Google Refine

In my last post, I talked about using mail merge and excel to do data cleanup. Mail merge and excel are good tools for data cleanup. But like many tools they have their limitations. Another tool that is more powerful is called Google Refine. Google Refine is an open source tool that allows you to cleanup up messy data.

Essentially Google Refine looks like a spreadsheet. For basic operations, it almost acts like a spreadsheet. However, Google Refine goes beyond these to incorporate the use of regular expressions to help clean up thousands of records at a time. A nice analogy is that Google Refine is to metadata in xml or csv as MarcEdit is to MARC data.

It is necessary to download Google Refine and install it on your computer. This is a relatively simply process. The website has several videos that introduce you to Refine and user guides. You’ll see that a diamond appears in the unzipped file. Click on this and Google refine will open a command window and then the application will appear in your default  browser.

You can create a new project or open one you’ve already been working on. You can upload xml, csv or xslx files. I’ve opened xml and xslx files without any problem. If you upload a xml file, Google Refine will ask you to delimit a metadata record and then place that xml into a “spreadsheet” presentation for you to work on.

What I’ve found the most powerful in Google Refine are the regular expressions or GREL. With GREL, you can replace, add or remove text. There are math or date functions and much more. Depending on the extensions you download, Google Refine will also support Jython or other languages. Google Refine also has some built in expressions that help you reconcile data, remove white spaces, or transform cases from upper to lower, etc.

This is a very powerful tool to help you clean up thousands of records at a time. At the moment, the only limitations that I have run across are as follows:

Google Refine has trouble viewing and manipulating more than 2400 records. I found it works best with around 2100 or around there.

It was difficult to find sufficient documentation on GRELs. I continue to experiment with GRELs and see how they behave. However, there’s no GRELs for dummies out there.

In general, I found it better to jump in and start using Google Refine. There is a learning curve. If you begin you’ll need to be comfortable jumping in. Though the introductory videos are helpful, they are really too vague in my opinion.

Have fun and definitely give this a try….