I was thinking of the various technologies at our finger tips to manipulate and understand data. Immediately I thought of Open Refine. At the same time I also thought of the times Open Refine was difficult to use such as an xml file with repeatable tags. From this, I began to question the pros and cons of not just Open Refine but other tools like MarcEdit or Excel. The big question underlying all of this is: What does it mean to sample data? Honestly being able to determine not just the quality of metadata but also the trends and exceptions to those trends is work we all do as metadata librarians. This was the first time that I actually took a step back to consider what it is involved with the process of metadata sampling. Further, how do you teach others to effectively sample metadata?
There is a large body of research on metadata quality. What about data sampling or profiling? At this initial point of inquiry, this means the ability to understand the structure and content of a particular data set. Data sampling or profiling is the bread and butter of the social sciences who’ve been calling this type of work exploratory data analysis for many years. After a short tour of the literature, I wasn’t really sure how it applied to library metadata sampling techniques. I tried to search for research on metadata sampling which led to techniques on determining metadata quality. Metadata sampling incorporates the need to determine the quality of that metadata. It also goes beyond this in that it is necessary to formulate an understanding of the structure of the data set, the trends of data entry, and exceptions to the data content. When I think of structure, I think of how the metadata is encoded, marked up, or packaged. As metadata librarians, we encounter a number of ways our metadata is packaged. We work with spreadsheets, marc files, xml files, etc. Notice that I’m mixing here format with standards. Teasing this out further, I like the idea of determining how the metadata is packaged as it is rather vague and can incorporate the notion of what standards are in play and the format of the data set file(s). One of the reasons I like this is that the idea of package can refer to any metadata set, one from a vendor website, one from a knowledge base, one from a FTP, one from a researcher’s grant funded project, one from a library system’s API or export feature, etc. Another reason I like this idea is that it implicitly illustrates just one of the complex layers of dealing with metadata. After the package, it is necessary to understand how this set was created, what needs to be done, and more importantly what metadata trends and exceptions exist with the content or metadata itself. One step is definitely determining the quality of the metadata. Here quality is in direct relation to the needs and requirements of the project of which this data set is a part. Beyond the question of quality, trends means being able to explore the data in such a way as to be able to understand how that data was entered. Can you spot a trend with dates, notes, subject headings? Are there content standards in play? Was there any subject analysis? How were these standards applied? From these analyses, one can determine trends and exception to those trends. These are processes used to evaluate vendor records or other sets of records before going into library systems.
What is fascinating is that this work relies on the use of tools and good old fashion investigation work. The investigation work doesn’t necessarily involve the entire data set but pieces of it to create a narrative of the whole. Based on my experience, I came up with this graphic which is also infused from my experience with being a research data librarian. I’m just at the beginning of what it means to profile library data in a formalized way. I’d love to hear others’ thoughts on the matter or their experience.