Click4Biology: Topic 1 Statistical Analysis

Topic 1: Statistical analysis

Excel 2003 Toolkit package

Excel 2007 Toolkit package

Using the Graphic display calculator for statistical tests

Choosing a statistical test by Dr. Neil Miller

Choosing a statistical test: Merlin software (excellent).

Using error bars in experimental Biology (Journal of cell biology) A truely excellent article (pdf download)

1.1.1 Error bars and the representation of variability in data.

1.1.2 Calculation of mean and the standard deviation of the sample data.

1.1.3 Standard deviation and the spread of data.

1.1.4 Comparison of two sets of data using their means and standard deviation.

1.1.5 Comparison of two sets of sample data using the t-test.

1.1.6 Correlation, causation and the calculation of correlation coefficients

1.1.1 State that error bars are a graphical representation of the variability of data.(1)

If we plot the mean with the range (see below) this shows the spread of the data around the mean.

The graph shows how variable the data (measreuments) are in comparison to the mean where a:

This type of graph is a form of descriptive statistics.

Example of the mean with the full data range: Comparison of the shell length of two samples of gastropod from different locations.

Marine population: mean= 30.7, Range = 23-43                Brackish population: mean = 38.2, Range = 32-51

 

 

 

 

 

 

 

 

 

 

.From the Journal of Cell Biology 'Error Bars in experimental Biology' it is advised that:

Rule 1: Always state on the graph which type of error bar is being used.

Rule 2: Always state the number (n) of the sample size in the legend of the graph.

if there where 20 repeats/measurements then we add n=2.0


Rule 3: Error bars and statistics should only be shown for indendently repeated experiments, and never for replicates.

If we wanted to find the mean height of sycamore trees then you would measure the height of different trees (Iindependently repeated experiments) not the same tree many times (replicates).

Mean +/- Standard deviation as a in indicator of the the variability of data is covered in the following syllabus statement.


1.1.2 Calculate the mean and standard deviation of a set of values.(2)

Data collected from an experiment falls into three categories

Mean:

The arithmetic mean or average is a measure of the central tendency (middle value) of the data. Caution should be used as the distribution may be skewed and the mean may in fact not be the middle value. In excel we use =average (number 1, number2,..)

Be careful that the data type you have (check table) allows you to calculate the mean. It may be that the mdian or the mode are more appropriate.

Standard Deviation (s):




The standard deviation of the sample = s

The standard deviation calculated is for the sample not the total population which could of course have a smaller or larger standard deviation (see the note below).

The image shows the calculation using the Excel spreadsheet.

The standard deviation calculated is a measure of the spread of the data values around the mean for a sample population.

Population 1. Mean = 31.4 Standard deviation(s)= 5.7

Population 2. Mean =41.6 Standard deviation(s) = 4.3

 

 

Normally students will be working with sample data and therefore should NOT use the STDEVA or STDEVP versions of standard deviation in Excel.

 

Graphic Calculators: An excellent method to calculate the standard devation can be to use the graphic calculator. Used during experiments or field work this gives immediate feedback on the reliability of the sample.

 

Graphing the mean and the standard deviation.

One way to represent our data is to draw a graph that includes error bars of the standard deviation. The diagram below was drawn by hand but it is possible to plot the SD as error bars in Excel

 

Herre each sample is plotted as the mean +/- 1 standard deviation.

The graphs are examples of descriptive statistics, the data is represented in the image (left).

Inspection of the graph DOES NOT allow us to determine if there is a significant difference between the two sets of data.

Plotting the mean +/- SD is a graphic representation of the varaibility of the data.

 

 

 

 

 

 

 

Comparison of graphs:

 

 

 

 

 

 

 

 

 

 

 

 

 

1.1.3 State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the values fall within one standard deviation of the mean.(1)

  1. Standard deviation is a measure of how spread out the data values are from the mean.

  2. It is assumed that there is a normal distribution of values around the mean and that the data is not skewed to either end.

  3. 68% of all the data values (measurements) in a sample can be found between the mean +/- 1 standard deviation..

4. 95% of all the data values in a sample can be found between the mean + 2SD and the mean -2SD

 

1.1.4 Explain how the standard deviation is useful for comparing the means and the spread of data between two or more samples.(3)

 

A sample with a small standard deviation suggest narrow variation (less error/ less uncertainty) but a second sample with a larger standard deviation suggests wider variation (more error/ more uncertainty)

Standard deviation can be used to determine if a single measurement lies outside the normal data range.

The mean +/- stadard deviation cannot be used for inferantial statistics (drawing conclusions regarding differnces).

 

Graph of Mean +/- Standard error

For the IB Biology teacher or student there is an important distinction to be drawn between descriptive and inferential error bars. In the course of a typical IB Biology experiment the processed data might be presented as either:

Descriptive Statistics:

These types of graph will allow the student to evaluate the (in the conclusion) variability if the data.

Inferential Statistics:

This graph will allow the student to draw on what the authors of 'Error bars in experimental biology' call a 'graphic signal' which allow the conclusion and evaluation to consider how much uncertainty there is in data.

Standard Error is fairly straight forward to calculate especially with a graphic calulator or a spread sheet function (Excel requires an Add-in (not a download)).

If we plot the mean +/- standard error we can use the 'graphic signal' to draw some inferences 



The graph show an overlap of the error bar.

If two SE error bars overlap you can conclude that the difference is not statistically significant.

Sample A and Sample B are not significantly different.

 

 

 

 

 

 



If the two error bars do not overlap then we CANNOT conclude that they are statistically different.

At this stage the student should proceed to a t-test to determine any stastistically significant difference.

Reading:

 A feature article in the Journal of Cell Biology by Geoff Cumming, Fiona Fidler, and David L. Vaux

An excellent and accessible article on error bars in experimental biology

 

 

 

 

 

 

 

 

 

1.1.5 Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate tables.(3)

If you carry out a statistical significance test, such as the t-test, the result is a P value, where P is the probability that there is no difference between the two samples.

A. When there is no difference between the two samples:

B. When there is a difference between the two samples:

As always with statistical conclusions, you could be wrong! It is possible there really is no effect, and you had the bad luck to get sets of results that suggests a difference or not, where there is none.

Of course, even if results are statistically highly significant, it does not mean they are necessarily biologically important. Remember this when drawing conclusions in the CE section of your internal assessment (psow).

 

Statistical test of difference using the t-Test.

 

T-Test Calculation : Excel 2007 (calculating P)

 

 

 

 

 

 

 

 

 

 

 

array excel 2007

 

Enter the setting as provided:

In Excel 2003 the t test is performed using the formula: = TTEST (range1, range2, tails, type) .

For the examples you'll use in biology, tails is always 2 , and type can be:

1, paired
2,Two sample equal variance
3, Two samples unequal variance

 

 

 

 

 

ttest

 

The cell with the t test P can be formatted as a percentage (Format menu > cell > number tab > percentage).

This automatically multiplies the value by 100 and adds the % sign. This can make P values easier to read and understand. It's also a good idea to plot the means as a bar chart with error bars of standard devation to show the variability in the data.

 

 

 

 

 

s

 

 

In biology the critical probability is usually taken as 0.05 (or 5%). This may seem very low, but it reflects the facts that biology experiments are expected to produce quite varied results.

 

Drawing conclusions:

1. State the null hypothesis and the alternative hypothesis based on your research question.

Null Hypothesis: 'There is no significant difference between the height of shells in sample A and sample B.'
Alternative Hypothesis: 'There is a significant difference between the height of shells in sample A and sample B'.

2. Set the critical P level at P= 0.05 (5%)

3. Write the decision rule for rejecting the null hypothesis.

If P  > 5% then the two sets are the same (i.e. accept the null hypothesis).

If P  < 5% then the two sets are different (i.e. reject the null hypothesis).

4. Write a summary statement based on the decision.

The null hypothesis is rejected since calculated P = 0.003 < P = 0.05 two-tailed test

5. Write a statement of results in standard English which includes the hypothesis

There is a significant difference between the height of shells in sample A and sample B.

 

1.1.6 Explain that the existence of a correlation does not establish that there is a causal relationship between two variables.(3).

Typically in IB Biology your experiment may involve a continuous indendent variable and a continuously variable dependent variable. e.g effect of enzyme concentration on the rate of an enzyme catalysed reaction. The statistical analysis would set out to test the strength of the relationship (correlation).

There are two tests for correlation: the Pearson correlation coefficient ( r ), and Spearman's rank-order correlation coefficient ( r s ). These both vary from +1 (perfect correlation) through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and normally-distributed use Pearson, otherwise use Spearman. In Excel r is calculated using the formula: = CORREL (X range, Y range) .

In Excel r is calculated using the formula: = CORREL (X range, Y range) .

To calculate r s , first make two new columns showing the ranks (or order) of the X and Y data (either by hand or using Excel's =RANK command), and then calculate the Pearson correlation on the rank data.

It is usual to draw a scatter graph of the data whenever a correlation is being investigated.

In the illustrated example the size of breeding pairs of penguins was measured to see if there was correlation between the sizes of the two sexes. The scatter graph and both correlation coefficients clearly indicate a strong positive correlation. In other words large females do pair with large males. Of course this doesn't say why, but it shows there is a correlation to investigate further.

 

 

 

 

 

If you know that one variable causes the changes in the other variable, then you can use linear regression to investigate the relation. This fits a straight line to the data, and gives the values of the slope and intercept of that line (m and c in the equation y = mx + c).

The simplest way to do this in Excel is to plot a scatter graph of the data and use the trend line feature of the graph.

Right-click on a data point on the graph, select Add Trend line, and choose Linear.

Click on the Options tab, and select Display equation on chart. You can also choose to set the intercept to be zero (or some other value). The full equation with the slope and intercept values are now shown on the chart.

 

 

 

 

Causation

It is important to realize that if the statistical analysis of data indicates a correlation between the independent and dependent variable this does not prove any causation. Only further investigation will reveal the causal effect between the two variables.

Correlation does not imply causation. Here are some unusual examples of correlation but not causation's !

Clearly there is no real interaction between the factors involved simply a co-incidence of the data.

Once a correlation between two factors has been established from experimental data it would be necessary to advance the research to determine what the causal relationship might be.