Discussion of Study Results ©2003, D.F. Parkhurst
In this section, I present and discuss the results of the pre- and post-tests. I will present each question, and follow that with a graph and a discussion of the results for that question.
Each graph represents a particular question. Each arrow (representing the answers for one student) starts with a circle symbol at a pre-test score, and points toward the post-test score for that same student; thus an upward arrow indicates improvement. The length of the arrow denotes the amount of improvement. Lack of an arrowhead indicates a student’s score that did not change between the two tests. The nth horizontal position represents the same student in all graphs.
Questions that address the concepts I considered most important are shown first, and in bold face. The numbers represent the order in which the questions appeared on the tests.
5. A team of atmospheric chemists measure concentrations of several air pollutants over two Midwestern industrial cities, each of about 100,000 population. In a talk on their results at a scientific meeting, they note that they found no significant differences in the concentrations between the two cities. After hearing the talk, a journalist reports that the researchers had shown that there were no important differences in the measured substances between the two cities. Is the journalist’s translation of the scientists’ results for the “lay reader” accurate? Explain.
|
|
One of two points I emphasized above all others in teaching this
course was that failing to obtain statistical significance from a study
provides no evidence that the effect being studied does not occur to an
important extent. Thus, I was pleased
by the results for this question, for which only one student received a
negative answer at the end, thirteen students improved (some substantially)
over the semester, and the four declines were only ½ point each. I appear to have fulfilled my teaching goal well on this point. |
6. Ecologists
interested in the effects of rising atmospheric CO2 levels on growth
of oak seedlings grow five seedlings per chamber in ten growth chambers with
340 ppm CO2 (near the present concentration) and five seedlings per
chamber in ten chambers maintained at 600 ppm CO2. Average growth in the high CO2
chambers was about 6% higher on average than that in ambient (340ppm) chambers,
but in each case the results varied from chamber to chamber. The P-value from an analysis of variance of
the data was 18.7%.
After
reading about this study for a class, a student asked what that P value
meant. The professor replied, “When P
> 5%, that tells you that any differences in growth between the two
treatments was just a result of random chance.” Does that reply seem correct to you? If not, explain briefly how your own interpretation would differ.
|
|
This question deals with a closely related point, but one that I stated in these words only once during the semester. Eleven students improved, some substantially, although six declined and eight post-scores were negative. I have since tried to explain this question more clearly, and will continue to do so in the future. |
10. Suppose (hypothetically)
that public health officials in Connecticut want to spray urban and suburban
forests with a certain fungus that they think will reduce the numbers of deer
ticks, which transmit Lyme disease from deer and other mammals to humans. The State Department of Environmental
Protection contracts with some university biologists to test whether the fungus
affects the numbers of non-target species—for example soil mites. Counts of mites from a set of plots sprayed
with the fungus, and from matched control plots sprayed with plain water are
obtained.
A
DEP statistician proposes analyzing those data with Bayesian methods rather
than with the frequentist methods that are more commonly used. Explain the difference, if you know it. Even better, describe a question relevant to
this situation that could be answered by Bayesian methods but that could not be
answered by frequentist ones.
|
|
This question represents another success story. Only one student received a positive score on the pre-test, but sixteen did at end of semester. I referred to the advantages of Bayesian analysis numerous times during the semester, and spent two weeks explaining the basics of how to do it near the end. The use of this analysis is (and in my view should be) growing, and I’d like our graduates to recognize and appreciate it if they see it in their jobs. |
12. Many statistical experts believe that scientists who use statistics in their work frequently overemphasize and misinterpret null-hypothesis significance testing, often in ways that run counter to environmental protection and public-health protection. Are you aware of this belief, and can you give an example of such a misinterpretation?
|
|
My greatest emphasis in E538 is on correct interpretation of
statistical significance tests, because they are so frequently
misinterpreted, often to the detriment of public health and environmental
protection. I am therefore especially pleased that eighteen of the students
finished the course with positive scores on this question, and all eighteen
showed improvement from the pretest.
(The other two students started and ended with zeros, but at least not
with negative scores.) On this
subject, most students appear to have learned what I hoped they would. |
1. Imagine that you work in an EPA office, or in an environmental consulting firm, and that you are hiring an assistant for your work. Why might you want that assistant to have good knowledge of statistics? State as specifically as you can what you see as the most important reason.
|
|
Of all the questions, this one, about why statistics would be useful for an environmental scientist, received the best answers in the pre-test, with only one student starting with an answer I scored negative. To my surprise, five of the twenty students gave poorer answers at the end of the course, though the declines were small. Replies of 13 students improved. (It is possible that half- or one-point changes could represent the vagaries of how students worded their answers, and how I scored them.) I did not lecture directly about this issue during the semester, but I expected students to gain this appreciation from the examples we dealt with. Overall, I was satisfied by the improvement. |
2. In your job with a state environmental agency, you obtain daily samples from the Goodwater River both upstream and downstream from a small industrial city. Among other data, you measure the dissolved oxygen concentration (DO) in the water (important for the health of fish and other organisms) at the two sites. Your supervisor asks you to perform a paired t-test to check whether DO is lower below the city than above it. However, this kind of test is valid only when data meet certain important requirements. Which one or more of those requirements might be likely to be violated by these data?
|
|
Here I asked for the requirements for a paired t test to be valid, and no one received a positive score on the pre-test, nor, unfortunately, on the post-test. Indeed, only four students improved their scores, while eight had declines! On reflection, I realize that I did not lay out these requirements explicitly during the semester. This result provides an example of how this testing can help me improve my teaching. |
3. While working on a joint project, one of your collaborators states that “most data (for continuous variables) in realistic situations are normally distributed.” Would you agree or disagree? Explain your answer.
|
|
It is a common misconception among scientists who know a little statistics (just enough to be dangerous?) that most sets of data are normally distributed. It did not surprise me that only four of the twenty students started out with positive scores on this questions. Early in the semester I did stress that data often come from skewed, non-normal distributions, and explained why that would occur. Thirteen of the students improved their scores from pre- to post-, yet only nine ended with positive scores. It may be that by end of semester students had forgotten the discussion of this issue; furthermore we talked several times about how means of data are often nearly normal, even when the individual data values are not. Students may have confused these two concepts. The results for this question leave me a little disappointed, and I will try to explain the distinctions better in the future. |
4. An ecologist performed an experiment in which some bags of maple leaf litter (dead leaves) were sprayed with a nitrogen solution, and other “control” bags were sprayed with distilled water. The mass of litter lost by decay after the bags lay for six months on the forest floor was then measured, and the mean losses in the two treatments were compared statistically. The resulting “P value” of 11% was greater than the 5% rejection level previously chosen by the researcher.
If the ecologist then mentioned these results casually to a colleague, could that colleague reasonably make either of the following statements? Please explain your answer. (One answer, covering both statements, is sufficient.)
· “If you were to repeat this experiment under identical conditions, but with the bags placed in different random positions in the forest, it is probable (more likely than not) that you would again fail to obtain statistical significance.”
· “If you were to repeat this experiment under identical conditions, but with the bags placed in different random positions in the forest, it is probable (more likely than not) that you would obtain a statistically significant result.
|
|
This question asks about another point that I never lectured on explicitly, namely whether a “statistically significant” (or not significant) result is likely to arise again if a study is repeated. That likely explains the generally poor and declining scores on this question. Tversky and Kahnemen (1971) show that many scientists expect far too much repeatability from relatively small studies, and I plan to ask students to read that paper in future semesters. (This figure shows only very small dots at the pretest score.) |
7. State at least two reasons why a scientist or environmental manager obtaining a set of data might want to know how the data are distributed before proceeding with data analysis.
|
|
The results of asking why knowing the distribution of data is useful were very similar to the results for Question 1. Here there was one decline from a positive to a negative score, but that was counter-balanced by several fairly large improvements. Overall, I appear to have explained the usefulness of knowing distributions well. |
8. Some statisticians describe their work as “searching for pattern in data, subtracting out that pattern, searching for further pattern in the resulting residuals, and continuing until no apparent pattern remains.” Does this sequence of actions seem familiar to you? If so, can you describe a simple example of a situation for which it might be usefully applied?
|
|
Like Question 3, this one asks about a subject dealt with only early in the semester. That may explain why the six improvements were countered by six declines and seven students stayed a zero (“don’t know”) scores for both tests. |
9. Define the meaning of “the power of a statistical significance test.”
|
|
When asked to define statistical power, only one student received a positive score on the pre-test while ten did so at the end of the course. Of those who did not, two improved, four declined, and four showed no change. Given that I did not emphasize power very strongly, I consider this a general success. |
11. The first kind of statistical test to which many science students seem to be exposed is the c2 (chi-squared) test. For what type of data is that test most appropriate?
|
|
Somewhat like Questions 2 and 4, this one serves to show that the pre-test post-test comparisons are meaningful. By that I mean that I did not spend much (or any) time on the subjects of these three questions, and the lack of improvement in the scores reflects that neglect nicely. I had included this question because many students first introduction to significance testing is with chi-square tests. I do not teach them, however, because there are usually better ways of analyzing data. (This figure is missing the starting circles.) |
13. Ready availability of computers these days eases the use of “resampling methods” of data analysis (like randomization tests and bootstrapping) in place of more traditional statistical methods. Describe the general idea of resampling methods briefly, and explain what advantages they have over traditional methods.
|
|
This last question yielded odd results. Every student had a zero or negative score on the pre-test, indicating lack of knowledge about resampling methods. On the post-test only two earned positive scores, while thirteen provided substantially incorrect answers. Most of these supplied almost identical answers, suggesting that resampling methods allow increasing the amount of data available when the original sample size is small. I speculate that this idea may have arisen in one of the Friday lab sessions, but I haven’t been able to confirm this. In any case, I was careful to clarify this issue in 2002, and will do so in the future as well. |
Fifteen students answered the last question
on the post-test about how much of any improvement in scores they would
attribute to this course, in contrast to other sources. Their responses, as percentages, were 50, 75,
80, 85, 90, 95, 95, 98, 98, 100, 100, 100, 100, 100, and 100.
The table below summarizes the results shown in the graphs above, in terms of relative frequencies of students whose answers declined, remained unchanged, or improved from pre- to post-test, question by question. As with the figures above, the four questions that addressed the concepts I considered most important are listed first, and in bold face.
|
Question |
|
Score |
Score |
Score |
|
5 |
Does “not significant” mean
“not important?” |
20% |
15% |
65% |
|
6 |
Does “not significant”
imply "just random chance?" |
25% |
20% |
55% |
|
10 |
Do you know the advantages
of Bayesian analysis? |
0% |
15% |
85% |
|
12 |
Have you heard that many
misinterpret significance tests? |
5% |
10% |
85% |
|
1 |
Why are statistics useful? |
25% |
15% |
60% |
|
2 |
What are requirements for paired t tests? |
40% |
40% |
20% |
|
3 |
Are most data normally distributed? |
25% |
5% |
70% |
|
4 |
Is “significance” consistent across similar experiments? |
45% |
40% |
15% |
|
7 |
Why is knowledge of distributions useful? |
20% |
20% |
60% |
|
8 |
Is sorting pattern and residuals familiar? |
25% |
40% |
35% |
|
9 |
What is statistical power? |
20% |
25% |
55% |
|
11 |
When would you use chi-square test? |
40% |
45% |
15% |
|
13 |
Describe resampling methods, and state advantages. |
60% |
30% |
0% |
|
|
Averages across questions |
26.9% |
24.6% |
47.7% |