The differential biology reader

 
« Back to blog

Data have always been smoky

The first step in making sense of your data is just that : make sense of it. What is this a record of? What do the variables mean? Stefano Mazzocchi argues that a lot of the movement to open data is focusing only on the arcana of delivery, and in the process it is hard to determine what individual data points refer to:

the fact that without a high relational density, having a dataset in RDF doesn’t give any practical advantage over having it in its original format.

Yet, from a marketing/political point of view, the simple act of “triplifying” a dataset and make it available on the web as linked dataseems to make it appear all more powerful, all more useful and it’s being used a lot as a way to promote the idea that the web of data is finally getting traction.

By grinding all those rectangular datasets into triples, they’ve actually managed to make it *less* useful than in its original form. In the original form at least I had a little context of what this data was for and from, which is lost here. A surprising achievement, but I bet you won’t read about it at semantic web conferences any time soon.

I don't find that this is a particularly novel problem. Even in flat files you come across data where information is coded (such as '0' and '1' for sex, or something more obscure like species or condition.) If you've taken a look at older government provided data sets, you need to do a lot of flipping back and forth between the data and the document explaining the variable codings. Perhaps the amount of data these folks are dealing with just magnifies it out of all proportion.

Comments (0)

Leave a comment...

 
Got an account with one of these? Login here, or just enter your comment below.
Posterous-login    twitter