Did you know that today is “Day of DH” or Digital Humanities? Our Digital Humanities Librarian scheduled an entire day of digital humanities fun. It started off with with a vote on sessions and workshops for the morning and afternoon. This was mixed in with hackfests and tool/software presentations. We had anywhere from 20-30 people all throughout the day! The best thing is that metadata featured prominently in many of the sessions.
During lunch, a colleague asked how to present the topic of linked data to an audience unfamiliar with it. The added challenge, as my colleague explained, is that the audience would not necessarily have the technology knowledge to understand linked data. We tried to look for examples. One example that my college showed me was the Virtual International Authority File. Essentially the issue that we skirted was name disambiguation. This is a huge problem especially with journals. There are solutions such as PubMed and Scopus. Recently, ORCID, or Open Registry of Research and Contributor ID’s, has launched out of beta. ORCID assigns a permanent unique identifier to authors that are associated with profiles. Profiles can be created by authors or third parties, such as libraries. What ORCID does is basically bring together the variant forms of an author name. Though this is not necessarily linked data, it is related (no pun intended). You can use the unique ID in other places to link out to an author’s profile. In a sense, we’re talking about making relationships or linking one idea to another. In this particular case, it is linked variant names to one authorized one (an activity well known to the cataloging world).
After lunch, we played with NVIVO. This is a software tool to analyze unstructured data. This is a fantastic tool if you need to do any text/data mining. It works with a number of different software tools such as NCapture, RefWorks, Twitter, Survey Monkey, etc. It also works with external data sets in various formats (xslx, txt, csv). For the session, we wanted to analyze a Twitter search based on a hashtag. We captured the Twitter search based on the hashtag poetry using NCapture. Thanks to the functionality of Chrome, we were able to import the data set from the Twitter search into NVIVO. From there, we saw the metadata associated with the export. We were able to visualize the data using a Map feature in NVIVO. We also got details about the number of tweets and retweets and other great information. One of the issues that the session leader emphasized was that such text mining is being driven by metadata. If you pull from social networking sites, there is always some sort of associated metadata. This is important as it will help indicate the data corresponding to the metadata labels. With Twitter, the metadata labels were easy to understand. However, in my next example they were not.
Lastly, we looked at Europeana. Our goal was to use the API service to extract data. First you need to get an API key. This just means registering with Europeana‘s API portal. With the key, you can consult a query. Europeana has plenty of examples of how to construct queries. If this is a little beyond you, then use the API Console, which constructs the query for you. Once you do this, then you put in your query (for example, search everything with Mozart). This will return a set of results in JSON. Now you might ask: What do I do with that? Cut and paste the JSON code into a simple text editor (text wrangler, textedit, notepad) and save with whatever name you want for the results with the extension .json. Now with the open source software called, OpenRefine, you can upload this file and create a new project. OpenRefine is great for data cleanup and visualization. Once you get the json data set into OpenRefine, then you can start playing with the metadata and data. Of course, we did all of that! What great fun. The first thing we saw was that the metadata labels used by Europeana were not always helpful. At times we had to guess what the data were. Not always a good thing. That just made me think again of the necessity to include README files for any data set – but that’s a different topic.
The great lesson from today was metadata is really everywhere. It’s extremely important even if users don’t know that they rely on metadata to do the work they do. This is why it is important to consider data consistency and accuracy and especially name disambiguation. And this was all from the perspective of primarily English majors! Go Digital Humanists!