Underthinking for a change

There are times when you have a project and it seems like the fancy tools are just out of reach. If only your department had a software engineer or if you had time to learn python and perl! With the movement of let’s do more with less, many in metadata and cataloging find ourselves strapped for time. Though we could certainly take a moment to learn how to program, sometimes we need to get the project completed in very little time. This happened to me just the other week. I was working with very messy data from a digital collection that was started some 8 years ago. To date, almost a dozen or so people have worked to create or edit the data. Each person had their own perspective of how that data should be entered. And yes it was fairly easy to see when a cataloger entered the data versus a volunteer. Add to this a funky export function offered by the digital library software application and Voilà – very inconsistent, inaccurate and messy data. My job was to quickly (as in you only have 3-4 days) to see how to clean this up and create METS documents based on our METS Profile ready to ingest into a digital repository. The good news was that the data didn’t have to be entirely clean since the end goal was to test if we could get METS files into and out of the digital repository. But the METS files had to be valid and conform to our METS profile. Any HTML had to be removed as well as other characters that have special meaning like brackets or question marks which are used to make queries.

As I digested the news, the person making the request said: “Let’s not overthink this. Let’s use mail merge.” I have never heard of using Word’s (or other word processing software) mail merge function to create METS files. But we needed this information quickly for all 4000+ records. The data were in a csv file. I brought this into excel and did the minimal cleanup. Then used a Word mail merge template that I created using the wizard and linking it to this excel file to create a giant merged document of first 2000+ METS records. Once you have this document, you can use an open source application called MS Word Split (Break, Create) Mail Merge from CNET to create separate METS files for each record. It’s not perfect but you can do this in a minimal amount of time. Thankfully in my case, our programmer created a perl code to separate each of the METS files.

I was very happy to get this advice. I was also happy to “under”think about this project. I would have never thought to use excel to cleanup data and then Word’s mail merge to create a new METS document (saved as .txt). If you need to do data cleanup that is not particularly challenging, this is an excellent approach.

In my next post, I want to talk about Google refine, which is a much better tool than excel for data cleanup.