Tag Archives: data cleanup

Google Refine

In my last post, I talked about using mail merge and excel to do data cleanup. Mail merge and excel are good tools for data cleanup. But like many tools they have their limitations. Another tool that is more powerful is called Google Refine. Google Refine is an open source tool that allows you to cleanup up messy data.

Essentially Google Refine looks like a spreadsheet. For basic operations, it almost acts like a spreadsheet. However, Google Refine goes beyond these to incorporate the use of regular expressions to help clean up thousands of records at a time. A nice analogy is that Google Refine is to metadata in xml or csv as MarcEdit is to MARC data.

It is necessary to download Google Refine and install it on your computer. This is a relatively simply process. The website has several videos that introduce you to Refine and user guides. You’ll see that a diamond appears in the unzipped file. Click on this and Google refine will open a command window and then the application will appear in your default  browser.

You can create a new project or open one you’ve already been working on. You can upload xml, csv or xslx files. I’ve opened xml and xslx files without any problem. If you upload a xml file, Google Refine will ask you to delimit a metadata record and then place that xml into a “spreadsheet” presentation for you to work on.

What I’ve found the most powerful in Google Refine are the regular expressions or GREL. With GREL, you can replace, add or remove text. There are math or date functions and much more. Depending on the extensions you download, Google Refine will also support Jython or other languages. Google Refine also has some built in expressions that help you reconcile data, remove white spaces, or transform cases from upper to lower, etc.

This is a very powerful tool to help you clean up thousands of records at a time. At the moment, the only limitations that I have run across are as follows:

Google Refine has trouble viewing and manipulating more than 2400 records. I found it works best with around 2100 or around there.

It was difficult to find sufficient documentation on GRELs. I continue to experiment with GRELs and see how they behave. However, there’s no GRELs for dummies out there.

In general, I found it better to jump in and start using Google Refine. There is a learning curve. If you begin you’ll need to be comfortable jumping in. Though the introductory videos are helpful, they are really too vague in my opinion.

Have fun and definitely give this a try….

Leave a comment

Filed under cataloging

Underthinking for a change

There are times when you have a project and it seems like the fancy tools are just out of reach. If only your department had a software engineer or if you had time to learn python and perl! With the movement of let’s do more with less, many in metadata and cataloging find ourselves strapped for time. Though we could certainly take a moment to learn how to program, sometimes we need to get the project completed in very little time. This happened to me just the other week. I was working with very messy data from a digital collection that was started some 8 years ago. To date, almost a dozen or so people have worked to create or edit the data. Each person had their own perspective of how that data should be entered. And yes it was fairly easy to see when a cataloger entered the data versus a volunteer. Add to this a funky export function offered by the digital library software application and Voilà – very inconsistent, inaccurate and messy data. My job was to quickly (as in you only have 3-4 days) to see how to clean this up and create METS documents based on our METS Profile ready to ingest into a digital repository. The good news was that the data didn’t have to be entirely clean since the end goal was to test if we could get METS files into and out of the digital repository. But the METS files had to be valid and conform to our METS profile. Any HTML had to be removed as well as other characters that have special meaning like brackets or question marks which are used to make queries.

As I digested the news, the person making the request said: “Let’s not overthink this. Let’s use mail merge.” I have never heard of using Word’s (or other word processing software) mail merge function to create METS files. But we needed this information quickly for all 4000+ records. The data were in a csv file. I brought this into excel and did the minimal cleanup. Then used a Word mail merge template that I created using the wizard and linking it to this excel file to create a giant merged document of first 2000+ METS records. Once you have this document, you can use an open source application called MS Word Split (Break, Create) Mail Merge from CNET to create separate METS files for each record. It’s not perfect but you can do this in a minimal amount of time. Thankfully in my case, our programmer created a perl code to separate each of the METS files.

I was very happy to get this advice. I was also happy to “under”think about this project. I would have never thought to use excel to cleanup data and then Word’s mail merge to create a new METS document (saved as .txt). If you need to do data cleanup that is not particularly challenging, this is an excellent approach.

In my next post, I want to talk about Google refine, which is a much better tool than excel for data cleanup.

Leave a comment

Filed under cataloging