-
No, this post isn't Star Trek related. It's about SCIENCE!
I realize that the real world makes a lot of things difficult for those who collect data. Consider this, then, a gripe list of things I may or may not expect to get fixed.
Moving on, I work on computer models to better understand physical processes. These models are built and parameterized off of real world data of some way or another. This means I consume a lot of data in my job. However, how I as a modeler would like data is not often how I get the data. Here's some of the bigger issues:
- Oddball formats. On a given day, I have to deal with files in Excel format, varieties of plain text, oddball Fortran binary files (undocumented, of course), NetCDF, HDF, ESRI Shapefiles, GeoTIFF, and a whole bunch of other raster formats. And this doesn't particularly phase me, either. What phases me? Incomplete oddball formats. Whenever a new and interesting bit of geodata ends up on my desk in an oddball format, I can wind up spending over a day poking at it. When I get gridded data, I need to know a few details about each datapoint in the grid:
- The lat/lon coordinates
- Units being used
- The digital representation of the data
Most data I get tends to follow this: well-put-together NetCDF & HDF files, as well as anything produced by or with a GIS (such as GeoTIFF, Shapefiles, etc.), tend to be alright. I can work with this. However, if I have to figure things out, it slows me down. Sometimes a lot if I misinterpret it- if it looks reasonable to my eye but actually is not, this Can Cause Baddness. This is largely the case with Fortran binary files, but I get a lot of plain text that also shares these problems- I have an unclear idea of where or what is being represented. Examples:
- Points are addressed from 0 to 360 degrees east instead of -180 to 180 (or vice versa)
- There may be a README file with the line "value is thousands / 20" or something similarly vague.
- Oddball Measurements. Let's start with something basic. If a value is between zero and one, then why do your measurements include numbers both less than zero and greater than one? Saying "sometimes in real life, things work out like that" doesn't satisfy me in building a numerical model where the textbook definition of a function expects a value to be between zero and one. Tell me why this is, with perhaps a textbook reference going more in depth. I don't mind that your measurements falls outside the traditionally accepted bounds of reality . I mind that there's no systematic explanation why this is.
- Oddball time-series. In modeling, we want to see a lot of things over time. The more fine-grained our models become, the more fine-grained we want our measurement data. If I'm simulating the growth of a crop, I would like to see the growth of the crop as well as after it's done growing. It's hard to determine a trend if one only has one to two datapoints to work from.
I suppose the grand summary of this griping is that as a modeler, I have my hands full with a lot of things. Spending weeks trying to work with bad data is frustrating; human hands entered the data into a computer, so why can't those same hands explain the data?
disqus comments