Click4Biology: Statistical Analysis

Topic 1: Statistical analysis

About software and calculators

Excel 2003 Toolkit package

Excel 2007 Toolkit package

Graphic display calculator

Statistics and IB Biology syllabus a word of caution!

Data source for calculations.

1.1.1 Error bars and the representation of variability in data.

1.1.2 Calculation of mean and the standard deviation of the sample data.

1.1.3 Standard deviation and the spread of data.

1.1.4 Comparison of two sets of data using their means and standard deviation.

1.1.5 Comparison of two sets of sample data using the t-test.

a. Comparison of two samples.
b. Animation of two sample comparisons.
c. T Test in Excel 2007.
d . Drawing conclusions with statistical tests

1.1.6 Correlation, causation and the calculation of correlation coefficients

 

 

 

Statistic and the IB Biology syllabus:

General Notes:

EXCEL 2003

The statistical analysis of data will be illustrated with a worked example using Excel 2003 & 2007. Links to web pages with tutorials on the use of Microsoft Excel have been included. To obtain full use of the statistical functions of Excel requires the Data Analysis package. This can be added to your TOOLs menu using the 'Add in' l ink. Note that this is not a Download but is already in your Excel package waiting to be an 'Add in'. Installation instructions Excel 2003. Those of you with the Office 2007 will find Excel looking a little different but the ToolPack2007 can also be installed.

GRAPHIC CALCULATOR (TI 84 PLUS, TI 83 PLUS)

Alternatively students can use their graphic calculators to generate the statistic covered in this topic. Graphic calculator routines are based on the TI 83plus and the TI 84 plus. The great advantage of the graphic calculator is all IB Diploma students will have one for their mathematics course. The calculator can be easily used in the laboratory and also in the field for immediate feedback on your experiments. Try to build these routines into your practical work..

As a new addition to the syllabus this section of the site should be treated with some caution. I have tried as far as possible to adhere closely the syllabus statements. However there areas and questions that remain unanswered about the interpretation of the syllabus statements. These will of course become clearer as teachers attend workshops and through the discussions on the OCC forum. Until then please check your syllabus assessment statements and cross check with your subject advisor.

Examples:

1.1.1 Error bars

1.1.2 Calculating standard deviation:

In biology experiments we normally are working with a sample of the population. Therefore we calculate the sample standard deviation. I have read the information in the Microsoft Excel support pages that suggest that the correct form of standard deviation test to use is STDEV. One of the alternatives is to use the form STDEVP but this is for the population not the sample. The 'Teacher notes' state: 'Students should specify the standard deviation (s), not the population standard deviation'.

1.1.5 Confusing syllabus statement aside I refer here to aim 7 pg 45:

 

 

An investigation of shell length variation in a mollusc species.

The following scenario has been used to generate the data used in the demonstration of the statistical test of the course.

A marine gastropod (Thersites bipartita) has been sampled from two different locations:

Thersites bipartita

 

 

 

Analysis of gastropod data:

The data was obtained by measuring the height of the shells using a ruler. Measurement are in mm with an error of + or - 1 mm.

Significant digits in the data and the uncertainty in the data must be consistent. This applies to all measuring devices, for example, digital meters, stopwatches, and so on. The number of significant digits should reflect the precision of the measurement.

There should be no variation in the precision of raw data. For example, the same number of decimal places should be used. For data derived from processing raw data (for example, means), the level of precision should be consistent with that of the raw data

 

1.1.1 Error bars and the representation of variability in data.

Biological systems are subject to a genetic program and environmental variation. Consequently when we collect a set of data for a given variable it shows variation. When displaying data in graphical formats we can show the variation using error bars.

Error bars can be used to show either the range of the data or the standard deviation.

Mean with the full data range:

The data can be represented on a graph that might show the mean and the full range of data.

Marine population: mean= 30.7, Range = 23-43                Brackish population: mean = 38.2, Range = 32-51

 

 

 

 

 

 

 

 

 

 

 

Plotting the mean and the range allows for a quick comparison of the two sets of sample data.

 

1.1.2 Calculation of mean and the standard deviation of the sample data

Data collected from an experiment falls into three categories

Mean:

The arithmetic mean or average is a measure of the central tendency(middle value) of the data. Caution should be used as the distribution maybe skewed and the mean may in fact not be the middle value. The EXCEL formula is given above.

Be careful that the data type you have (check table) allows you to calculate the mean. It may be that the median or the mode are more appropriate.

Standard Deviation (s):

The standard deviation of the sample = s

The standard deviation calculated is for the sample not the total population which could of course have a smaller or larger standard deviation (see the note below).

The image shows the calculation using the Excel spreadsheet.

The standard deviation calculated is a measure of the spread of the data values around the mean.

Population 1. Mean = 31.4 Standard deviation(s)= 5.7

Population 2. Mean =41.6 Standard deviation(s) = 4.3

 

 

 

 

STDEV or STDEVA in Excel

'Microsoft recommends that you use STDEV instead of STDEVA unless you are sure that you want TRUE, FALSE, and the text strings to be interpreted as the STDEVA function interprets them. Most of the data that you want to calculate a population standard deviation for is completely numeric; in those cases, STDEV is appropriate.'

STDEV is the appropriate formula for the calculation of the sample standard deviation rather than the population standard deviation which is STDEVP.


Microsoft support Excel STDEV

Graphing the mean and the standard deviation.

One way to represent our data is to draw a graph that includes error bars of the standard deviation. The diagram below was drawn by hand but it is possible to plot the SD as error bars in Excel

 

Here each sample has the mean +/- 1 standard deviation.

There is no overlap in the distributions for shell length between these two populations.

The question being considered is:

Is there a significant difference between the two samples from different locations?

or

Are the differences in the two samples just due to chance selection?

 

 

 

 

Comparison of graphs:

 

 

 

 

 

 

 

 

 

 

 

 

 

1.1.3 Standard deviation and the spread of values around the mean.

  1. Standard deviation is a measure of how spread out the data values are from the mean.

  2. It is assumed that there is a normal distribution of values around the mean and that the data is not skewed to either end.

  3. 68% of all the data values in a sample can be found between the mean +/- 1 standard deviation..

4. 95% of all the data values in a sample can be found between the mean + 2s and the mean -2s.

 

1.1.4 Comparing means and standard deviation of two or more samples.

 

A sample with a small standard deviation suggest narrow variation but a second sample with a larger standard deviation suggests wider variation.

Example: The two mollusc samples above provide

Population 1. Mean = 31.4 Standard deviation(s)= 5.7

Population 2. Mean =41.6 Standard deviation(s) = 4.3

The second population has a greater mean shell length but slightly narrower variation. Why this is the case would require further observation and experiment on environmental and genetic factors.

 

1.1.5 Comparison of two samples/ t-Test

Comparison for two samples:

In the introduction to this topic we considered the sampling of the same species of mollusc from two different locations. We have already calculated the means and the standard deviation for these sample.

Population 1. Mean = 31.4 Standard deviation(s)= 5.7

Population 2. Mean =41.6 Standard deviation(s) = 4.3

The question that we might now ask is:

Null Hypothesis:
Is there no significant difference between the two samples except as caused by chance selection of data.

OR

Alternative hypothesis:
Is there a significant difference between the height of shells in sample A and sample B.

Animation of the question 'Is there a significant difference between these two populations?

 

 

Statistical test of difference using the t-Test.

 

T-Test Calculation : Excel 2007 (calculating P)

 

 

 

 

 

 

 

array excel 2007

 

Enter the setting as provided:

In Excel 2003 the t test is performed using the formula: = TTEST (range1, range2, tails, type) .

For the examples you'll use in biology, tails is always 2 , and type can be:

1, paired
2,Two sample equal variance
3, Two samples unequal variance

 

 

 

 

ttest

 

The cell with the t test P can be formatted as a percentage (Format menu > cell > number tab > percentage).

This automatically multiplies the value by 100 and adds the % sign. This can make P values easier to read and understand. It's also a good idea to plot the means as a bar chart with error bars to show the difference graphically.

 

 

 

 

 

 

 

Background to the T-test

 

Drawing conclusions:

1. State the null hypothesis and the alternative hypothesis based on your research question.

Null Hypothesis: 'There is no significant difference between the height of shells in sample A and sample B.'
Alternative Hypothesis: 'There is a significant difference between the height of shells in sample A and sample B'.

2. Set the critical P level at P= 0.05 (5%)

3. Write the decision rule for rejecting the null hypothesis.

If P  > 5% then the two sets are the same (i.e. accept the null hypothesis).

If P  < 5% then the two sets are different (i.e. reject the null hypothesis).

4. Write a summary statement based on the decision.

The null hypothesis is rejected since calculated P = 0.003 < P =0.05 two-tailed test

5. Write a statement of results in standard English.

There is a significant difference between the height of shells in sample A and sample B.

 

1.1.6 Correlation, causation and the calculation of correlation coefficients

When analysing an experiment you are very often looking for an association between variables. This can be a correlation to see if two variable vary together, or a relation to see how one variable affects another.

There are two tests for correlation: the Pearson correlation coefficient ( r ), and Spearman's rank-order correlation coefficient ( r s ). These both vary from +1 (perfect correlation) through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and normally-distributed use Pearson, otherwise use Spearman. In Excel r is calculated using the formula: = CORREL (X range, Y range) .

In Excel r is calculated using the formula: = CORREL (X range, Y range) .

To calculate r s , first make two new columns showing the ranks (or order) of the X and Y data (either by hand or using Excel's =RANK command), and then calculate the Pearson correlation on the rank data.

It is usual to draw a scatter graph of the data whenever a correlation is being investigated.

In the illustrated example the size of breeding pairs of penguins was measured to see if there was correlation between the sizes of the two sexes. The scatter graph and both correlation coefficients clearly indicate a strong positive correlation. In other words large females do pair with large males. Of course this doesn't say why, but it shows there is a correlation to investigate further.

 

 

 

 

 

If you know that one variable causes the changes in the other variable, then you can use linear regression to investigate the relation. This fits a straight line to the data, and gives the values of the slope and intercept of that line (m and c in the equation y = mx + c).

The simplest way to do this in Excel is to plot a scatter graph of the data and use the trend line feature of the graph.

Right-click on a data point on the graph, select Add Trend line, and choose Linear.

Click on the Options tab, and select Display equation on chart. You can also choose to set the intercept to be zero (or some other value). The full equation with the slope and intercept values are now shown on the chart.

 

 

Causation

It is important to realize that if the statistical analysis of data indicates a correlation between the independent and dependent variable this does not prove any causation. Only further investigation will reveal the causal effect between the two variables.

Correlation does not imply causation. Here are some unusual examples of correlation but not causation's !

Clearly there is no real interaction between the factors involved simply a co-incidence of the data.

Once a correlation between two factors has been established from experimental data it would be necessary to advance the research to determine what the causal relationship might be.