Information Quality & Data Sampling

One of the many things that I enjoy about this blog is that I get to work through issues that I encounter at work. This blog began when the library where I was working became a RDA test partner. Over the years, this blog has given me an outlet to explore possibilities and share that. This brings me to my last post on data sampling.

In my mind I had thought of metadata quality and data sampling as two different activities. Since that post, I’ve continued to think about the relationship between data sampling and metadata quality. One of the examples that I had in mind when writing my last post was metadata cleanup. Currently, I and a number of colleagues are working on data cleanup for a future migration. Our systems librarians are creating reports based on issues that have been called out as problematic such as format inconsistencies between bib and item records. The people working on these reports don’t necessarily understand the data. Sometimes what has been tagged as a cleanup project might be something else depending on the data. Hence, the necessity to sample the data in the reports to determine if it is really about the issue that was initially tagged, something else entirely, or the original issue and something else.

The question that I’ve been asking since concerns the relationship of that activity to metadata quality. At first, my focus was on how to sample data. Then, I asked why is it necessary to sample data in the first place. In the example above, the aim is to enhance the quality of the metadata by improving the accuracy of formats across bib and item records. In this case, the data sampling activity is a means to the end of better metadata quality.

What about this scenario? A research receives a data set and must determine if the data are appropriate for her research. One could say that this is very different than metadata quality. However, if I consider a broader term such as information quality, I could conceive of this process of the researcher sampling this fictitious set as a means to discern its information quality. By this, I mean that the researcher needs to see if the data are a good fit, have the components or data points needed for his research, and are appropriate. Beyond this, it is also important to consider the quality of the data itself as data (accuracy, consistency, trustworthiness, etc.).

In this sense, data sampling falls within the realm of information quality. It is one of the activities that helps ascertain information quality. Going forward, I’d like to explore this idea more and what other activities consist in information quality. What are the relationships between the various activities? And what are these activities?