Bias – p. 2
Subsampling Bias (Continued)
Note: For the equations on this page, I’m using MathML and MathJax.
Elsewhere I’ve used PNG image files for most equations. The original PNG images looked fine
until I got a retina display with 2 device pixels per px and suddenly every equation appeared fuzzy.
On other pages, I’m using higher-resolution images and onload
event handlers to
improve the appearance. On my Mac or iPhone, with a retina display, they look more or less the same now,
but MathML/MathJax gives a slightly better rendering on all my PC displays (possibly a ClearType issue).
The theorems given below provide bounds for the bias of a sample chosen according to Gy’s criterion.
Definition Suppose a lot to be sampled and tested for an analyte is composed of
N fragments. A sample from this lot is defined to be a random nonempty subset of
the N fragments. In other words, a sample is a random variable whose possible values are
nonempty subsets of the N fragments. (Note that the term random sample has a
different meaning.) A sample is correct if each fragment in the lot has the same probability of
being included in the sample.
Notation Index the fragments of the lot using the set
L =
{1, 2, …, N}.
For any integer
j ∈ L,
let
mj denote the mass of the
j th fragment,
Aj the mass of the critical component (analyte) in
the
j th fragment,
and
aj the critical content (mass fraction of analyte) in the
j th fragment
(aj = Aj / mj).
The fragment masses,
mj,
are assumed to be known, but the masses of critical component,
Aj, and critical contents,
aj, are unknown. In problems where
Aj and
aj are allowed to vary,
Aj will be treated as a function of
aj and
mj, which are considered more fundamental
(Aj =
aj mj).
For any nonempty subset G ⊆ L,
identify G with the collection of fragments indexed by the elements of
G. For example if G =
{1, 2, 3}, then identify G with the
collection that consists of the 1st, 2nd, and 3rd fragments in the lot.
Also, for any nonempty subset G ⊆ L, let:
In particular mL denotes the total mass of the
lot, AL denotes the mass of critical component in
the lot, and aL denotes the critical content of the
lot. Furthermore, if S denotes a sample from the lot, then:
mS
= mass of sample S,
AS
= mass of the critical component in sample S,
and
aS
= critical content of sample S.
In this case,
mS,
AS, and
aS
are numerical random variables.
Theorem A.1
Let
S be a correct sample from
lot L.
Then:
Proof…
Proof: Since
S is correct,
there is a real number
p with
0 < p ≤ 1, such that
Pr[ j ∈ S ] = p
for
j =
1, 2, …, N.
For any event
F, let
IF denote the random variable whose
value is
1 if
F occurs and
0 if
F
does not occur. So, for example, if
j ∈ L,
then
I[ j ∈ S]
equals
1 if fragment
j belongs to sample
S and it equals
0 otherwise. Then:
and
For any event F, the expected value of IF equals
the probability of F.
So, if j ∈ L, then
E(I[ j ∈ S]) =
Pr[ j ∈ S ] = p. So,
and:
So,
A stronger result can also be proved. It can be shown that
E(AS) / E(mS)
=
aL for all possible values of
a1, a2, …, aN
if and only if S is correct.
Note that the sampling bias is a bias in the mass fraction of analyte in the sample, which is
defined by aS =
AS / mS.
So, a sample S is unbiased if and only if
E(AS / mS)
= aL.
Unfortunately, the mean of the quotient,
E(AS / mS),
is not necessarily equal to the quotient of the means,
E(AS) / E(mS).
If one measured the total mass of analyte in a correct sample, AS, and divided it
by the expected mass of the sample, E(mS), rather than the actual mass,
the result would be unaffected by sampling bias; however, this is not typically done, and in most cases it
would not be desirable anyhow, because the elimination of a rather small bias would not be worth the increase
in variability that would occur.
One may consider the sampling bias to be negligible if it is a small fraction of the
standard deviation of aS, and the following corollary to Theorem A.1
shows that this is true whenever the relative standard deviation of the sample mass, mS, is small.
Corollary A.1.1
Assume
S is a correct sample. Then:
and
where RSD
denotes relative standard deviation (coefficient of
variation).
Proof…
Proof: First use the fact that
aL =
E(AS) / E(mS)
to derive the following equations.
Then, since
| ρ(aS, mS) |
≤
1,
it follows that
| Bias(aS) | ≤
σ(aS) ×
RSD(mS).
Note that a large value for RSD(mS) does not
necessarily imply a large sampling bias, because
aS and
mS may be only weakly correlated.
Theorem A.2
Assume S is a sample chosen in a manner such that the mass of the sample always falls
between (1 − δ) × M and
(1 + δ) × M
for specified values of M and δ.
If Pr[ j ∈ S ] = M /
mL
for all j ∈ L, then:
Given the premise of Theorem A.2, it can also be shown that if S
is unbiased for all possible values of a1, a2, …, aN,
then for all j ∈ L,
So, if the mass of the sample is not allowed to vary much, one can ensure zero sampling bias only if all the
fragments have nearly the same selection probability.