Opendata world is here!

After the age of hardware, came the age of software. From now on the age of data is here.

Why Now?

World Bank Is Opening Its Treasure Chest of Data

At least, this is a funny reason for a ephemeral

As a visualization tools and interface fan, since sometime ago I took notes talking with several people and finding out what are the main problems when you try to work with the new opendata repos.

First the name: opendata: open? Sounds very curious that USA, UK, AU, NZ use the simple name of “data”. Because these governments are publishing public data. The rest of the countries, that have started the data projects right now, are taking the name Open Data to refer to the same project. For me, coming from latin-mediterranean culture, this “open” means that there is and there will be a lot of “close”. I will suggest to name it: public data,

Second, the formats. I suggest to the public institutions to publish their datasets in several formats at the same time. Of course I’m talking about free formats like the stantards: CSV. ODS, XML/RDF, KML, JSON, … We can open a public discussion on which free formats we prefer. With a standarization of the published formats we can achieve at least two goals: in one hand lots of people will not be wasting any more time in order to convert files. The conversion is far  from being an easy reproducible process; the software versions of the file editors and the converters can mess around.

In the other hand, a fixed list of formats coming directly from the source of the data will make easier the references to a concrete dataset -for example: ” I took the dataset from July, 4th in CSV from data.gov.xx”.

 

ei, people, data is loading!

 Data life

I see every public dataset as a work-in-progress. Most of the public datasets can be improved. Quite often the data is not accurate, contains errors or can be either updated and extended. All the improvements will create a real life of every dataset, a timeline of the history of the dataset. For example: the list of all the public buildings in Catalonia (Catalan Government) I dived into this dataset. It is great to have it, but we need to improve a lot more the geo location of the buildings. the ones who can better improve this dataset are the citizens and specially the data workers. All the efforts to clean and improve a dataset must be put together, so we do not lose them.

Data needs to be considered in time, with versions and changes. This is why I propose to use version control softwares in order  to follow the evolution of the datasets. To be more specific, I will start to store my data in a GIT server. probably Github at the beginning.
An advanced control system like GIT allows:
  • to follow the evolution of the datasets (contributors, contributions, time)
  • to be sure that two or more are using exactly the same dataset
  • to improve the dataset with a simple git pull request
  • to publish a dataset as a software release
  • changing thedataset with a simple git fork
  • to use live & dynamic data reading directly  from the repository.
  • … and a lot of new ideas that will come out…
The data time is here and the people needs public data to control the power. Whereas information is power, public data is the power of the multitude.