Data Services in the Law Library
By Shay Elbaum, Reference Librarian, Stanford Law School
As empirical legal scholarship continues to pick up steam, academic law librarians are increasingly called upon to support empirical research and provide data services. I’ve had the chance to do so in one of my first assignments as a new reference librarian, gathering and managing a dataset for one of our faculty who’s applying statistical methods to data collected from Congress.gov.
As a new librarian with just enough tech know-how to be dangerous, working on this project has been a learning experience in several dimensions. I’m sharing some highlights here in the hope that others in the same position will glean something useful.
Getting the data / Don’t make it harder than it needs to be
The project began before this faculty member joined us at Stanford, with the excellent Michael Lindsey at Berkeley laying the foundations and gathering the data to begin with. Michael kindly shared with us the existing SQL database along with the PHP scripts he used to scrape and process data from Congress.gov. Taking over the project wasn’t as easy as just putting the database on our own servers and tweaking the scripts, unfortunately; after encountering a few roadblocks, we began looking for an easier way.
Happily, the Congress.gov developers had been busy. They’d added a feature for downloading search results in CSV format, and the data pulled included almost everything our faculty member was interested in. Now we had a spreadsheet containing all the relevant data, no PHP or SQL needed (though I did need to do a little bit of web scraping to fill in some blanks in the downloadable data, but that’s another story).
This was a reminder of something I tell students all the time in the research instruction context: see if someone else has already done the work for you! Instead of a treatise chapter with citations to all the key cases, it was a “download search results” feature, but the principle remains the same.
Cleaning the data / Use the right tools for the job
Now I needed to get the data into a format that our faculty member could use. Some information needed to be extracted from free-text fields and given its own column (e.g., names of states). There were also errors and gaps in the data and some inconsistencies in how certain information was recorded. The goal here was to prepare the data for analysis with a light hand and to document everything I did as well as anything I spotted that could affect analysis (e.g., committee names sometimes change, and the downloaded data for earlier Congresses lists the names that committees had at the time).
This is where this turns into a “Cool Tools” blog post. I attended a workshop on a tool called OpenRefine shortly before beginning this stage of the project, and it turned out to be perfect for the job. OpenRefine is “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.”
Among other useful features, it made it possible to mass-edit subsets of the data: for example, removing leading and trailing whitespace in every cell, splitting multivalued cells into separate columns, identifying records with a particular kind of error and selectively editing those records. It also records every edit made, making it easier to document changes, undo, or look back at the original without losing your work. There’s a session on OpenRefine at this year’s AALL Annual Meeting. I’d recommend it to everyone who’s ever cursed at Excel.
This was also my introduction to regular expressions, a tool for recognizing patterns in text. Using regular expressions in OpenRefine, I was able (for example) to pull out the names of states from a general description field (“So-and-so, of Michigan, to be the Director of…”) by describing the usual pattern (state name follows “, of” and is followed by a comma), and to do this for every record with just one command. This didn’t grab all the states because not all the entries followed that pattern – but OpenRefine’s filtering tools made it easy to find the inconsistencies and repeat the process. Probably old hat for anyone familiar with these tools, but for me, it felt pretty magical. For anyone in the same position, I looked through several regular expressions tutorials and found Regular-Expressions.info to be the most helpful; your mileage, as always, may vary.
Looking for help
I’ve been saying “we” a lot without explaining to whom that refers. Getting this data into shape has really been – and continues to be – a joint effort. A lot of the time spent has been just OpenRefine and me, but all the key steps have involved other people.
This whole endeavor is meant to enable and support another’s research, so there’s been plenty of back-and-forth with our faculty member who’s using the data. Within the law school, our academic technology specialist and the law school’s IT team have also been tremendously helpful. Outside of the law school, I’ve attended workshops on relevant tools as well as drop-in hours for data wrangling help hosted by our main campus library. I’m also, of course, building on work that someone else began.
I’m lucky to have these resources close at hand, but I could have missed many of them if I hadn’t been looking for support to start with. It’s also been an excellent opportunity to build cross-campus connections, reaching outside of the law library to learn from colleagues for whom this sort of work is their bread and butter.
We’re now moving into the “maintenance” stage of the project. The older, messier data is mostly cleaned up, and the bulk of the work now involves periodic updates as new data is posted to Congress.gov. It’s a good vantage point for reflecting on how we got here. I’ve focused on a few of the more generally applicable takeaways here, but I’m curious what sort of problems (and solutions, hopefully) others have encountered with similar projects, particularly if working with datasets isn’t something you do frequently. Feel free to jump into the comments below if you’ve got any thoughts, dreams, tales, forecasts of doom, etc. to share!