Chat with us, powered by LiveChat What is statistics? Biologists study the properties of living | Writedemy

What is statistics? Biologists study the properties of living

What is statistics? Biologists study the properties of living

Question
Statistics and samples
1.1

What is statistics?
Biologists study the properties of living things. Measuring these properties is a challenge, though, because no two individuals from the same biological population are
ever exactly alike. We can’t measure everyone in the population, either, so we are
constrained by time and funding to limit our measurements to a sample of individuals
drawn from the population. Sampling brings uncertainty to the project because, by
chance, properties of the sample are not the same as the true values in the population.
Thus, measurements made from a sample are affected by who happened to get sampled and who did not.
Statistics is a technology that describes and measures aspects of nature from
samples. Most importantly, statistics lets us quantify the uncertainty of these measures—that is, statistics makes it possible to determine the likely magnitude of their
departure from the truth.
Statistics is about estimation, the process of inferring an unknown quantity of a
target population using sample data. Properly applied, the tools for estimation allow
us to approximate almost everything about populations using only samples. Examples range from the average flying speed of bumblebees, to the risks of exposure to
cell phones, to the variation in beak size of finches on a remote Galápagos island. We
can also estimate the proportion of people with a particular disease that die per year
and the fraction who recover when treated.
Most importantly, we can assess differences between groups and relationships
between variables. For example, we can test the effects of different drugs on the
1

2

Chapter 1

Statistics and samples

possibility of recovery, we can measure the association between the lengths of fingers
and whether individuals play the piano, and we can test whether the survival of
women and children during shipwrecks differs from that of men.

Estimation is the process of inferring an unknown quantity of a population using sample data.
All of these quantities describing populations—namely, averages, proportions,
measures of variation, and measures of relationship—are called parameters. Statistics tells us how best to estimate these parameters using our measurements of a sample. The parameter is the truth, and the estimate is an approximation of the truth, subject to error. If we were able to measure every possible member of the population, we
could know the parameter without error, but this is rarely possible. Instead, we use
estimates based on incomplete data to approximate this true value. With the right statistical tools, we can determine just how good our approximations are.

A parameter is a quantity describing a population, whereas an estimate
is a related quantity calculated from a sample.
Statistics is also about hypothesis testing, the process of determining how well
a “null” hypothesis about a population quantity fits a sample of data. The null
hypothesis is a specific claim regarding the population quantity. It is made for the
purposes of argument and often embodies the skeptical point of view. Examples are
“this new drug is no improvement on its predecessor,” and “metabolic rate increases
with body mass according to Kleiber’s law, by the 3/4 power.” Biological data
usually get more interesting and informative if the null hypothesis is found to be inadequate.
Statistics has become an essential tool in almost every area of biology—as indispensable as the PCR machine, calipers, binoculars, and the microscope. This book
presents the ideas and methods needed to use statistics effectively, so that we can
improve our understanding of nature.
Chapter 1 begins with an overview of samples—how they should be gathered and
the conclusions that can be drawn from them. We also discuss the types of variables
that can be measured from samples, introducing terms that will be used throughout
the book.

Section 1.2

1.2

Sampling populations

3

Sampling populations
Our ability to obtain reliable measures of population characteristics—and to assess
the uncertainty of these measures—depends critically on how we sample populations. It is often at this early step in an investigation that the fate of a study is sealed,
for better or worse, as Example 1.2 demonstrates.
Raining cats

Example 1.2

In an article published in the Journal of the American Veterinary Medical Association, Whitney and Mehlhaff (1987) presented results on the injury rates of cats that
had plummeted from buildings in New York City according to the number of
floors they had fallen.Fear not: no experimental scientist tossed cats from different altitudes to obtain the data for this study.Rather, the cats had fallen (or
jumped) of their own accord.The researchers were merely recording the fates of
the cats that ended up at the veterinary hospital for repair.The damage caused by
such falls was dubbed Feline High-Rise Syndrome, or FHRS.1

Not surprisingly, cats that fell five floors fared worse than those dropping
only two, and those falling seven or eight floors tended to suffer even more (see
Figure 1.2-1). But the astonishing result was that things got better after that. On
average, the number of injuries was reduced in cats that fell more than nine
floors.This was true in every injury category.Their injury rates approached that
of cats that had fallen only two floors! One cat fell 32 floors and walked away
with only a chipped tooth.
This effect cannot be attributed to the ability of cats to right themselves so as
to land on their feet—a cat needs less than one story to do that.The authors of the
article put forth a more surprising explanation.They proposed that after a cat
attains terminal velocity, which happens after it has dropped six or seven floors,
the falling cat relaxes, and this change to its muscles cushions the impact when
the cat finally meets the pavement.

1

“The diagnosis of high-rise syndrome is not difficult. Typically, the cat is found outdoors, several stories
below, and a nearby window or patio door is open” (Ruben 2006).

Chapter 1

Statistics and samples

Number of injuries per cat

4

2.5
2.0
1.5
1.0
0.5
0
123456
(0) (8) (14) (27) (34) (21)

7–8
(9)

9–32
(13)

Number of stories fallen
Figure 1.2-1 A graph plotting the average number of injuries sustained per cat according to the
number of stories fallen. Numbers in parentheses indicate sample sizes. Modified from Diamond
(1988).

Remarkable as these results seem, aspects of the sampling procedure raise questions. A clue to the problem is provided by the sample size for each height. This is the
number of cats that fell a particular number of floors, and is indicated along the horizontal axis of Figure 1.2-1. No cats fell just one floor, and the number of cats falling
increases with each floor from the second floor to the fifth. Yet, surely, every building
in New York having at least five floors has a fourth floor, too, with open windows no
less inviting. What can explain this curious trend?
Our strong suspicion is that not all falling cats were taken to the vet and that the
chance of a cat making it to the vet was affected by the number of stories it had fallen.
Perhaps most cats that tumble out of a first- or second-floor window suffer only
indignity, which is untreatable. Any cat appearing to suffer no physical damage from
a fall of even a few stories may likewise skip a trip to the vet. At the other extreme, a
cat plunging 20 stories might also avoid a trip to the vet, heading to the nearest pet
cemetery instead.
This example illustrates the kinds of questions of interpretation that arise if samples are biased. If the sample of cats delivered to the vet clinic is, as we suspect, a distorted subset of all the cats that fell, then the measures of injury rate and injury severity will also be distorted. We cannot say whether this bias is enough to cause the
surprising downturn in injuries at a high number of stories fallen. At the very least,
though, we can say that, if the chances of a cat making it to the vet depends on the
number of stories fallen, the relationship between injury rate and number of floors
fallen will be distorted.
Good samples are a foundation of good science. In the rest of this section we give
an overview of the concept of sampling, what we are trying to accomplish when we
take a sample, and the inferences that are possible when researchers get it right.

Populations and samples
The first step in collecting any biological data is to decide on the target population. A
population is the entire collection of individuals or units that a researcher is inter-

Section 1.2

Sampling populations

5

ested in. Ordinarily, a population is composed of a large number of individuals; in
most cases, we assume that the population is effectively infinitely large. Examples of
populations include
all cats that have fallen from buildings in New York City,
all the genes in the human genome,
all individuals of voting age in Australia,
all paradise flying snakes, Chrysopelea paradisi, in Borneo, and
all children in Vancouver, Canada, suffering from asthma.
A sample is a much smaller set of individuals selected from the population.2 The
researcher uses this sample to draw conclusions that, hopefully, apply to the whole
population. Examples include
the fallen cats brought to one veterinary clinic in New York City,
a selection of 20 human genes,
a pub full of Australian voters,
eight paradise tree snakes caught by researchers in Borneo, and
a selection of 50 children in Vancouver, Canada, suffering from asthma.
A population is all the individuals or units of interest, whereas a sample
is a subset of units taken from the population.
In the above examples, the basic unit of sampling is literally a single individual.
Sometimes, however, the basic unit of study is a group of individuals, in which case a
sample consists of a set of such units. Examples of units include a single family, a
colony of microbes, a plot of ground in a field, an aquarium of fish, and a cage of
mice. Scientists use several terms to indicate the sampling unit, such as “unit,” “individual,” “subject,” or “replicate.”

Properties of good samples
Estimates based on samples are doomed to depart somewhat from the true population
characteristics simply by chance. This chance difference from the truth is called sampling error. The spread of estimates resulting from sampling error indicates the
precision of an estimate. The lower the sampling error, the higher the precision.
Larger samples are less affected by chance and so, all else being equal, larger samples
will have lower sampling error and higher precision.

2

In biology, a “blood sample” or a “tissue sample” might refer to a substance taken from a single individual.
In statistics, we reserve the word “sample” to refer to a subset of individuals drawn from a population.

6

Chapter 1

Statistics and samples

Sampling error is the chance difference between an estimate and the
population parameter being estimated.
Ideally, our estimate is accurate (or unbiased), meaning that the average of estimates is centered on the true population value. If samples are not properly taken,
measurements made on them might systematically underestimate (or overestimate)
the population parameter. This is a second kind of error called bias.

Bias is a systematic discrepancy between estimates and the true population characteristic.
The major goal of sampling is to minimize sampling error and bias in estimates.
Figure 1.2-2 illustrates these goals by analogy with shooting at a target. Each point
represents an estimate of the population bull’s-eye (i.e., of the true characteristic).
Ideally, all the estimates are tightly grouped, indicating low sampling error, and they
are centered on the bull’s-eye, indicating low bias. Estimates are precise if they are
tightly grouped and highly repeatable, with different samples giving similar answers.
Estimates are accurate if they are centered on the bull’s-eye. Estimates are imprecise,
on the other hand, if they are spread out, and they are biased (inaccurate) if they are
displaced systematically to one side of the bull’s-eye. The shots (estimates) on the
upper right-hand target in Figure 1.2-2 are widely spread out but centered on the
bull’s-eye, so we say that the estimates are accurate but imprecise. The shots on the
lower left-hand target are tightly grouped but not near the bull’s-eye, so we say that
they are precise but inaccurate. Both precision and accuracy are important, because a
lack of either means that an estimate is likely to differ greatly from the truth.
Precise

Imprecise

Accurate

Inaccurate

Figure 1.2-2 Analogy between estimation and target shooting. An accurate estimate is centered
around the bull’s-eye, whereas a precise estimate has low spread.

Section 1.2

Sampling populations

7

A final goal of sampling is to allow the precision of an estimate to be quantified.
There are several quantities available to measure precision, which we discuss in
Chapter 4.
The sample of cats in Example 1.2 falls short in achieving some of these goals.
If uninjured and dead cats do not make it to the pet hospital, then estimates of
injury rate are biased. Injury rates for cats falling only two or three floors are likely
to be overestimated, whereas injury rates for cats falling many stories might be
underestimated.

Random sampling
The common assumption of the methods presented in this book is that the data come
from a random sample. A random sample is a sample from a population that fulfills
two criteria.
First, every unit in the population must have an equal chance of being
included in the sample. This is not as easy as it sounds. A botanist estimating
plant growth might be more likely to find the taller individual plants or to collect
those closer to the road. Some members of animal or human populations may be
difficult to collect because they are shy of traps, never answer the phone, ignore
questionnaires, or live at greater depths or distances than other members. These
hard-to-sample individuals might differ in their characteristics from those of the
rest of the population, so underrepresenting them in samples would lead to bias.
Second, the selection of units must be independent. In other words, the selection of any one member of the population must in no way influence the selection of
any other member. This, too, is not easy to ensure. Imagine, for example, that a
sample of adults is chosen for a survey of consumer preferences. Because of the
effort required to contact and visit each household to conduct an interview, the lazy
researcher is tempted to record the preferences of multiple adults in each household
and add their responses to those of other adults in the sample. This violates the criterion of independence, because the selection of one individual has increased the
probability that another individual from the same household will also be selected.
This will skew the data if individuals from the same household have preferences
more similar to one another than is obtained from individuals randomly chosen
from the population at large. With non-independent sampling, our sample size is
effectively smaller than we think. This, in turn, will cause us to miscalculate the
precision of the estimates.
In a random sample, each member of a population has an equal and
independent chance of being selected.
In general, the surest way to minimize bias and allow sampling error to be quantified is to obtain a random sample.

Chapter 1

Statistics and samples

Random sampling minimizes bias and makes it possible to measure the
amount of sampling error.

How to take a random sample
Obtaining a random sample is easy in principle but can be challenging in practice. A
random sample can be obtained by using the following step-by-step procedure:
1. Create a list of every unit in the population of interest, and give each unit a
number between 1 and the total population size.
2. Decide on the number of units to be sampled (call this number n).
3. Using a random-number generator, generate n random integers between 1 and
the total number of units in the population.
4. Sample the units whose numbers match those produced by the random-number generator.
An example of this process is shown in Figure 1.2-3. In both panels of the figure,
we’ve drawn the locations of all 5699 trees present in 2001 in a carefully mapped
tract of Harvard Forest (Barker-Plotkin et al. 2006). Every tree in this population has
a unique number between 1 and 5699 to identify it. We used a computerized randomnumber generator to pick n = 20 random integers between 1 and 5699, where 20 is
the desired sample size. The 20 random integers, after sorting, are as follows:
156, 167, 232, 246, 826, 1106, 1476, 1968, 2084, 2222, 2223, 2284, 2790, 2898,
3103, 3739, 4315, 4978, 5258, 5500
These 20 randomly chosen trees are identified by red dots in the left panel of Figure 1.2-3.
0
North–south position (feet)

8

0

200

200

400

400

600

600

800

800
600

400

200

0

200

600

400

200

0

200

East–west position (feet)
Figure 1.2-3. The locations of all 5699 trees present in the Prospect Hill Tract of Harvard Forest in
2001 (green circles).The red dots in the left panel are a random sample of 20 trees.The squares in
the right panel are a random sample of 20 quadrats (each 20 feet on a side).

Section 1.2

Sampling populations

9

How realistic is this procedure? Creating a numbered list of every individual
member of a population might be feasible for patients recorded in a hospital data
base, for children registered in an elementary-school system, or for some other populations for which a registry has been built. The feat is impractical for most plant
populations, however, and unimaginable for most populations of animals or
microbes. What can be done in such cases?
One answer is that the basic unit of sampling doesn’t have to be a single individual—it can be a group, instead. For example, it is easier to use a map to divide a forest
tract into many equal-sized blocks or plots, and then to create a numbered list of these
plots, than it is to produce a numbered list of every tree in the forest. To illustrate this
second approach, we divided the Harvard Forest tract into 836 plots of 400 square
feet each. With the aid of a random-number generator, we then identified a random
sample of 20 plots, which are identified by the squares in the right panel of Figure
1.2-3.
The trees contained within a random sample of plots do not constitute a random
sample of trees, for the same reason that all of the adults inhabiting a random sample
of households do not constitute a random sample of adults. Trees in the same plot are
not sampled independently. The data in this case must be handled carefully. A simple
technique is to take the average of the measurements of all of the individuals within a
unit as the single independent observation for that unit.
Random numbers should always be generated with the aid of a computer. Haphazard numbers made up by the researcher are not likely to be random (see Example
19.1). Most spreadsheet programs and statistical software packages on the computer
include random-number generators.

The sample of convenience
One undesirable alternative to the random sample is the sample of convenience, a
sample based on individuals that are easily available to the researcher. The
researchers must assume (i.e., dream) that a sample of convenience is unbiased and
independent like a random sample, but there is no way to guarantee it.

A sample of convenience is a collection of individuals that are easily
available to the researcher.
The main problem with the sample of convenience is bias, as the following examples illustrate:
The injury rate of cats that have fallen from high-rise buildings is likely to be
underestimated compared with a random sample, if measured only on cats that
are brought to a veterinary clinic. Uninjured and fatally injured cats are less
likely to make it to the vet and into the sample.

10

Chapter 1

Statistics and samples

The spectacular collapse of the North Atlantic cod fishery in the last century
was caused in part by overestimating cod densities in the sea, which led to
excessive allowable catches by fishing boats (Walters and Maguire 1996). Density estimates were too high because they relied heavily on the rates at which
the fishing boats were observed to capture cod. However, the fishing boats
tended to concentrate in the few remaining areas where cod were still numerous, and they did not randomly sample the entire fishing area.
The Literary Digest Poll was the largest poll in history (questionnaires were
sent to 10 million people, of which 2.4 million responded), but it predicted the
wrong outcome to the 1936 U.S. federal election (Freedman et al. 1997). This
was probably because the list of people to receive questionnaires was obtained
from magazine subscriptions, telephone books, automobile registrations, and
club memberships. This tended to leave out people in low-income families
whose voting preferences were very different from the higher-income people
who received questionnaires.
A sample of convenience might also violate the assumption of independence if
individuals in the sample are more similar to one another in their characteristics than
individuals chosen randomly from the whole population. This is likely if, for example, the sample includes a disproportionate number of individuals who are friends or
who are related to one another.

Volunteer bias
Human studies in particular must deal with the possibility of volunteer bias, which
is a bias resulting from a systematic difference between the pool of volunteers (the
volunteer sample) and the population to which they belong. The problem arises
when the behavior of the subjects affects whether they are sampled.
In a large experiment to test the benefits of a polio vaccine, for example, participating schoolchildren were randomly chosen to receive either the vaccine or a saline solution (serving as the control). The vaccine proved effective, but the rate at which children in the saline group contracted polio was found to be higher than in the general
population. Perhaps parents of children who had not been exposed to polio prior to the
study, and therefore had no immunity, were more likely to volunteer their children for
the study than parents of kids who had been exposed (Brownlee 1955, Bland 2000).
Compared with the rest of the population, volunteers might be
more health conscious and more proactive;
low-income (if volunteers are paid);
more ill, particularly if the therapy involves risk, because individuals who are
dying anyway might try anything;
more likely to have time on their hands (e.g., retirees and the unemployed are
more likely to answer telephone surveys);

Section 1.3

Types of data and variables

11

more angry, because people who are upset are sometimes more likely to speak
up; or
less prudish, because people with liberal opinions about sex are more likely to
speak to surveyors about sex.
Such differences can cause substantial bias in the results of studies. Bias can be
minimized, however, by careful handling of the volunteer sample, but the resulting
sample is still inferior to a random sample.

Real data in biology
In this book we use real data hard-won from observational or experimental studies in
the lab and field and published in the literature. Do the samples on which the studies
are based conform to the ideals outlined above? Alas, the answer is often no. Random
samples, however much desired, are often not achieved by biologists working in the
trenches. Real data are frequently based on samples that are not random, as the falling
cats in Example 1.2 demonstrate.
Biologists deal with this problem by acknowledging that the problem exists, by
pointing out where biases might arise in their studies,3 and by carrying out further
studies that attempt to control for any sampling problems evident in earlier work.

1.3

Types of data and variables
With a sample in hand, we can begin to measure variables. A variable is any characteristic or measurement that differs from individual to individual. Examples include
running speed, reproductive rate, and genotype. Estimates (e.g., average running
speed of a random sample of 10 lizards) are also variables, because they differ by
chance from sample to sample. Data are the raw measurements of one or more variables made on a sample of individuals.
Variables are characteristics that differ among individuals.

Categorical and numerical variables
Variables can be categorical or numerical.
Categorical variables describe membership in a category or group. They
describe named characteristics of individuals that do not have magnitude on a
3

We biologists are generally happier to find such flaws in other researchers’ data than in our own.

12

Chapter 1

Statistics and samples

numerical scale. Categorical variables are also called attribute or qualitative variables. Examples…

Current telephone polls carried out to estimate
voter or consumer preferences do not make calls
to cell phones. The reason? Persons carrying
cell phones might be driving at the time of the
call (they might also have to pay for the call).
a. How might the strategy of leaving out cell
phones affect the goal of obtaining a random
sample of voters or consumers?
b. Which criterion of random sampling is most
likely to be violated by the problems you
identified in part (a): equal chance of being
selected, or the independence of the selection
of individuals?

Our website has a team of professional writers who can help you write any of your homework. They will write your papers from scratch. We also have a team of editors just to make sure all papers are of HIGH QUALITY & PLAGIARISM FREE. To make an Order you only need to click Ask A Question and we will direct you to our Order Page at WriteDemy. Then fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Fill in all the assignment paper details that are required in the order form with the standard information being the page count, deadline, academic level and type of paper. It is advisable to have this information at hand so that you can quickly fill in the necessary information needed in the form for the essay writer to be immediately assigned to your writing project. Make payment for the custom essay order to enable us to assign a suitable writer to your order. Payments are made through Paypal on a secured billing page. Finally, sit back and relax.

Do you need an answer to this or any other questions?

About Writedemy

We are a professional paper writing website. If you have searched a question and bumped into our website just know you are in the right place to get help in your coursework. We offer HIGH QUALITY & PLAGIARISM FREE Papers.

How It Works

To make an Order you only need to click on “Order Now” and we will direct you to our Order Page. Fill Our Order Form with all your assignment instructions. Select your deadline and pay for your paper. You will get it few hours before your set deadline.

Are there Discounts?

All new clients are eligible for 20% off in their first Order. Our payment method is safe and secure.

Hire a tutor today CLICK HERE to make your first order