How to inventory all of your agency's datasets


The question for this session was "How to inventory all of your agency's datasets?"

 

I'm sure that Harlan Yu took more notes but I (Gary Berg-Cross) note the following:

 

People agreed that we need a "discovery process" for what is valuable.

Some guidance on what data is valuable can come from an agency's Enterprise Architecture (EA) .  Indeed the management, technical and governance abilities around EA should be leveraged for inventorying and publishing  all of an agency's datasets.

 

We also need to provide the data in a form that is usable - that people need it in. Even using spreadsheet formats is helpful.

Some progress is made by putting data in XML for exchange. Oraganizations are still struggling to go beyond this to put it in RDF form for a richer form of metadata that can help the understanding of the data.

 

We discussed data quality issues starting with the problems seen with Recovery.gov putting a bunch

 of data out there and then getting a balck eye when people find QA issues with it. Some recent ideas from Clay Johnson at Sunlight Labs was noted in passing. 

 

On the other hand this lead to better data and reflects the  idea that "crowd sourcing" may help with data quality.  We need to get people involved in a "feedback loop". Harlan suggested that we think of this like a software "bug report". Martha Johnson also had a nice idea on "small, fail, fast" which expressed the idea that we fail and moved forward.  Perhaps we can treat some data puclishing like beta pilot and use only a focus group of people to look at it,knowing it has issues, but can be corrected. 

 

Gary suggested that we need to develop a culture in which such imperfect data be allowed to get out and be rapidly improved by the "crowd" of reviewers.

 

For quality regular metadata may not do it.  We may need to classify darta, but by what criteria?  One idea is that we need to put data in a a context in which it is understood.  We also need a policy track for Data.gov.

 

A question is would any data classification make sense both to citizens and Congress who are consumers? Also if we do this, some managers may misuse the classification?

 

Susan (GSA) says that people are still thinking from an older, fear  culture about these things in that we understand the cost of getting the data out, but not the benefits. We will know more when we get more data out and more eyese see it.  If the public sees the benefits they will demand it.

Gary suggested that some of this discussion gets at the issue of trying to make Open Government self correcting.  We need effective feedback loops (including from data stewards, Business Owners) for this to happen.

 

As this happens there will be a change in some of the roles that deal with data. How is data governance working? We need Best Practices for data admin from content to metadata

DOT noted that we may need people to annotate it.  And we need a Presence of the people that do publish the data.

 

Gary suggested that along with the raw data agencies might publish a small number of illustrated examples of the information and reports derived from a stratified selection of the data.

 

NASA  discussed the question of who owns data, since some contractors invest in the data and it becomes unlcear that it is still in the public domain.  This means it may not be released.

 

Gary suggested that in the future contracts can be written so contractors compete for work based partly on how open the resulting data products will be.