Thursday, July 30, 2009

Pobody's Nerfect: Errors in an Age of Open Source Data

There's a saying in astronomy that you've never really reduced your data until you've reduced at least 3 (or 4 or 5) times. Data reduction is a fairly tedious task that takes astronomy images from the raw digital files obtained at a telescope and turns them into calibrated, quantifiable (number-based) results. It's absolutely crucial, but it is a pain. When we get to the end, we often find some unexpected feature in the data that makes us go back and re-examine every step of the reduction. And then we do it again to make sure we get the same answer again.

This process can be relatively fast, if we used the telescope in a fairly standard way and the instrument is well-understood, or it can take a long time if we were pushing the telescope or using a new instrument. Even then, problems crop up. I had to throw out one entire night's worth of beautiful-looking from the Lick Observatory because there was an odd instrumental problem that I couldn't solve (stars that were known to be twice as bright as other stars did not have twice as many detected photons), and the problem seemed to come and go through the night. Rather than risk making claims based on those data, I had to start over. Of course, I didn't discover the problem until months later when I was trying to interpret results, so I had to wait an entire year for the Earth to swing back to the proper side of the sun for viewing those stars.

So, it is well-established among scientists that raw data and normally-reduced data often have errors, sometimes serious and hidden ones. The scientific process seeks to identify and correct those errors, preferably sooner rather than later, but sometimes we even have to change results once we've published them. After all, we want to be right when we are probing the secrets of the Universe. Nobody wants to be remembered as the one who claimed to see irrigation canals on Mars ("Yeah, he founded a world-class observatory and started the program that discovered Pluto, but he saw canals on Mars"); we would rather our name be thrown in with the people like Hubble and Zwicky and Schmidt, whose then-radical conclusions based on careful analysis of observations were proven correct.

With the ascendancy of the Internet, many science programs (especially NASA programs) now publish their images and data online as soon as the pictures come in. Satellite weather maps have been transmitted live for years, and I remember watching live on CNN as Voyager 2's pictures from Uranus (I think, maybe it was Neptune) were sent back to Earth. But recently more and more space missions are dumping data directly to the Internet. The Mars Exploration Rovers Spirit and Opportunity are perhaps the most well-known, but NASA's flotilla of Earth-watching satellites such as the MODIS Aqua and Terra satellites post images of the Earth updated after every data dump (and, if you have a satellite receiver, you can even tune in and capture the data yourself in real time!).

These nearly-live data streams are quickly processed, but errors do happen. When they are noticed, the errors are corrected as quickly as possible, but sometimes this takes a while. For example, many of the Mars images you will see are partial because the data stream was interrupted; these images will be re-sent by the rover during the next data relay, but that can be a day or two away. For other missions, the time to correct data can be even longer -- it depends on how fast the problem is noted, how easy it is to correct, and how busy the scientists dealing with the data are. Sometimes the data corrections are minor, and sometimes the adjustments are major.

Again, we scientists are cool with this. We assume all results are preliminary and subject to change, though we hope it is correct by the time it is published in a peer-review journal (often years later).

In August's issue of Scientific American, an article points out a problem with this open access to data. In at least three recorded instances (I'm sure there are more!), data measuring Earth's climate was incorrectly reduced and analyzed. In these cases, the errors resulted in a slightly warmer climate than in reality, and in one case saw the "disappearance" of millions of square kilometers of sea ice overnight. When these errors were recognized, they were corrected. My reaction as a scientist is, "I'm glad they caught and fixed the errors."

But as these data were openly available to everyone, many lay people watching these data on the Internet saw these mistakes as, at best, incompetence, and at worst, evidence of a conspiracy to inflate the effects of global warming.

It's unfortunate that such public errors happened in a very politically-charged subject, but I don't believe that the scientists are incompetent nor that there is a conspiracy to cook the data. Rather, I think the public is seeing the truth behind how observational science, whether astronomy, climatology, geology, or any other such science works.

Scientific observation is not a nice and clean process where every data point is sacrosanct and set in stone as soon as it comes out of an instrument. It's a drawn out process of analysis, interpretation, re-analysis, and re-interpretation. Sometimes we learn that our calibration was wrong. Sometimes we learn that our instrument was broken right when the most exciting part of an event was going on. Sometimes we find out months later that, although we thought we had great information, it's really unsalvageable garbage. The hardest part of data reduction is coming out confident that we haven't made a mistake, and that's why we are forgiving (to a reasonable extent) when changes are made to data.

Our problem is that, as we do science out in the open (which I think is generally a good thing), the public gets to see the uglier parts of the process. Much, if not a vast majority, of the public is unaware of how complicated the process really is. That ignorance makes the public susceptible to people who come along and say, "Look, the number changed! Either they were lying before, they are lying now, or they are incompetent buffoons." This is a false choice, and this casts an unwarranted shadow over otherwise good science.

So, should science be done out in the open? Almost certainly! But perhaps the open source paradigm of instantaneous access of all people to all data is not the best model. I find the idea very appealing, but if such access is going to harm the scientific process, then I think we had better continue to discuss the issue and not be afraid to refine the idea.

No comments:

Post a Comment