Over the past week or so, there have been several discussions about Google as a last library and its weak metadata. This came about after a presentation from Geoff Nunberg as well as articles recounting Nunberg’s assessment of Google’s metadata errors.
The article that appeared in the Register called “Google Book Search – Is it the last library” by Cade Metz discusses the opinion of Geoff Nunberg as Google being the last library and his disparaging assessment of Google’s lack of good metadata (from his presentation called Google Books: A Metadata Train Wrek). Cade sets the tone with:
Geoff Nunberg, one of America’s leading linguistics researchers, laid this rather ominous tag on Google’s controversial book-scanning project amidst an amusingly-heated debate this afternoon on the campus of the University of California, Berkeley.
“This is likely to be The Last Library,” Nunberg said during a University conference dedicated to Google Book Search and the company’s accompanying $125m settlement with US authors and publishers. “Nobody is very likely to scan these books again. The cost of scanning isn’t going to come down. There’s no Moore’s Law for scanning.
Cade continues with two important issues. First, who will have control over these scanned books? Second, right now that Google is the major (if not only player of its size in town), how can it be the only library which researchers of the future must rely on if the metadata is unreliable and inconsistent? Unreliable metadata or not, the problem is that Google has had a head start, has more money than most libraries around, and has control not just of scanned books but also much of the Internet as well.
Cade ends with this very ambiguous conclusion:
Google says that if it hadn’t scanned all those books, no one else would have. And now there’s less incentive to scan all those books. But Google insists it’s not The Last Library.
What is interesting is that ripples of Geoff Nunberg’s assessment have become waves of discussion. In his article, Google, the Last Library and Millions of Metadata Mistakes, Norman Oder focuses on the inconsistent metadata that Geoff Nunberg analyzed in the Google Book scanning project and that the Registry article brought up. This article goes beyond Cade’s reporting in that it provides some answers from Google in regards to Nunberg’s accusations.
Google’ Jon Orwant, who manages the Google Books metadata team, responded at length:
First, we know we have problems. Oh lordy we have problems. Geoff refers to us having hundreds of thousands of errors. I wish it were so. We have millions. We have collected over a trillion individual metadata fields; when we use our computing grid to shake them around and decide which books exist in the world, we make billions of decisions and commit millions of mistakes. Some of them are eminently avoidable; others persist because we are at the mercy of the data available to us. The quality of our metadata now is a lot better than it was six months ago, and it’ll be better still six months from now. We will never stop improving it.”We have a cacophony of metadata sources—over a hundred—and they often conflict,” he added, contrasting that with library cataloging practices. “Without good metadata, effective search is impossible, and Google really wants to get search right.
The unreliable metadata is not just Google’s fault but from outside sources that provide metadata to Google. This point is taken up by Christine Schwartz over at her blog, Cataloging Futures, who reminds us that this is not just a source of worry for Google but also Libraries as well. In her post, Christine mentions that digital projects constantly face metadata errors. She asks some great questions such as:
- At what point do you create the metadata and by whom?
- Is metadata automatically extracted?
- Is there human oversight or any quality control?
I think we can add to Christine’s list of questions the following:
- What standards are being implemented?
- How much are local policies changing or tweaking national or international standards?
- How are unknowns being dealt with such as unknown creators or unknown dates?
- Is the metadata and quality control being consistantly done even when librarians come and go from the digital projects?
One of the problems that can be seen is that perhaps the metadata is consistant in one digital collection in one library. But in another or at another institution, they do things differently – even if the same metadata set is used (such as Dublin Core, which comes in many local flavors).
However, the issue seems more complex as Karen Coyle points out in her post, GBS and Bad Metadata. This issue is that the Google book scanning project also gets much of its metadata from libraries for its books through OCLC. As Karen ventures, Google most likely has a contract with OCLC that restricts what Google can take. Karen writes:
This leaves us with a bit of a mystery, although I think I know the answer. The mystery is: why would Google only use limited metadata from the participating libraries? And why won’t they answer the question that I asked at the Conference: “Do you have a contract with OCLC? And does it restrict what data you can use?” Because if the answer is “yes and yes” then we only have ourselves (as in “libraries”) to blame. And Nunberg and his colleagues should be furious at us.
Interestingly enough, OCLC answered Karen’s question with the following statement that she recently posted:
In a recent post in the NGC4LIB list, we got a very welcome answer from Chip Nilges of OCLC about Google’s use of WorldCat records:
To answer Karen’s most recent post, Google can use any WC metadata field. And it’s important to note as well that our agreement with Google is not exclusive. We’re happy to work with others in the same way. The goal, as I said in my original post, is to support the efforts of our members to bring their collections online, make them discoverable, and drive traffic to library services.
Karen goes on in her post to discuss the types of metadata that Google should include in the book scanning project. Among the types are Scholarship, Collection Development, Metasearch, Links to other related resources, and Computation. These types of metadata cover a broad spectrum of what researchers need in collecting and analyzing research materials. This metadata has to be as clear as possible in terms of making more reliable connections in especially for linked data. Despite this comprehensive approach that Karen suggests, it seems to require more rather than less metadata. Perhaps instead of thinking that we should take everything from a record, perhaps it is the question of quality over quantity. Instead of just taking as much quantitative metadata, getting as much qualitative metadata that helps uniquely identify an item as well as promote knowledge discovery is and will become increasingly essential.
A big question that comes out of reading these conversations is how can libraries help Google? Would Google even want help from libraries in order to improve the metadata with their book scanning project? Furthermore, can libraries improve their own metadata qualitatively and make it more interoperable across libraries and digital collections? Perhaps that is the first step for libraries before it can even approach Google….?