US Statutes at Large: Essential to understanding our laws and legislative history

One of the benefits of the Congressional Data Coalition has been our ability to collaborate on mutual projects of interest. CDC members recognize that reusable, cleaned-up legislative information, especially the laws themselves, is essential for both the legislative data community and the public. Unfortunately, at least some information will likely not be provided by Congress or will not be provided in a timely manner.

Almost 3½ years ago, in November 2010, GPO and the Library of Congress were authorized by the Joint Committee on Printing to make the following three document sets available on the Internet: Statutes at Large, the Congressional Record (1878-1998), and the Constitution of the United States: Analysis and Interpretation (CONAN). Quoting from the JCP letter: “These are key primary research sources, essential to understanding our laws and legislative history, and they should all be readily available online in electronic format.”

So far, volumes 65 through 124 (1951-2010) of the Statutes at Large and PDF files only of CONAN have been published by the Legislative Branch per the November 2010 authorization.

Why are the Statutes at Large important?

The United States Statutes at Large is the legal and permanent evidence of all the laws enacted during a session of Congress (1 U.S.C. 112). Every law, public and private, is published in the order of its passage. The set contains treaties and international agreements before 1948, concurrent resolutions, proposed and ratified amendments to the Constitution, and proclamations by the President. Pretty much the whole enchilada – and before you ask about the Constitution, yes, volume 1 includes the Declaration of Independence, the Articles of Confederation, and the Constitution of the United States.

But isn’t the US Code the law? Only a subset of the laws in the Statutes at Large are contained within the U.S. Code and many of those laws have been modified by subsequent laws to the point that the original language is difficult to discern. Hundreds of laws have been enacted that never made it into the United States Code. For example, of the 440 laws enacted in 1949, 235 made it into the US Code.

The importance of Internet accessibility to the laws enacted before 1951 should be obvious. The Law Revision Counsel (the organization responsible for putting together the U. S. Code) in their Table of Acts Cited by Popular Name have identified almost 2,100 laws that were enacted before 1951. Searching legislative text from the 112th Congress (2011-2012) shows that the past is not completely forgotten. About 6 percent of Statute at Large citations reference pre-1951 volumes.

So, while some of these laws are cited in current bills, they remain in 2014, officially available only as paper documents and, unofficially, there are scanned versions of the volumes at the Constitution Society’s website, but these volume files have not been broken down into individual laws, treaties, Presidential proclamations, etc. until now.

Making more laws available

Starting in January 2014, the Congressional Data Coalition and citizens joined together to make the individual laws and other documents of the US Statutes at Large available as discreet PDF files. We’re a little over half way through the initiative but we need volunteers to help for the final push.

Rather than attempting to produce a full-text table of contents for each volume as was accomplished by GPO for the post-1950 volumes, we’ve extracted the page number where each component (public law, resolution, etc.) begins by reusing the OCRed text from the PDF files. We then crowdsource the proofreading and correcting of the data which is where we need your help. Once the simple table of contents is completed, software extracts the individual PDF files for each sub-document. The software to do all this is open source and available online.

As of April 2014, volumes 28 through 64 (1893-1951) have been processed. We’ve also begun extracting the text from the tables of contents from the volume files and combined it with the simple table of contents data being used to create the files (sort of like a final QA check). By combining the two data sources (the text from the tables of contents along with the public law number and stat page data, we’ve been able to build more usable tables of contents. See the U. S. Statutes at Large Pre-1951 Directory.

The future of legislative data collaboration

Our approach has combined crowd-sourcing, manual editing, and automated processes. We’ve received help from a variety of outstanding volunteers. In two months, we have expanded the availability of laws by 50 years and over 15,000 acts, treaties, and international agreements.

Similar approaches should be strongly considered for publishing other historical documents on the Internet. The best example of the elephant in the room of course is the Congressional Record – only available on the Internet back to 1994 but published since 1873. As software developers, both inside and outside of government, we should be thinking in terms of how crowdsourcing can help us build the necessary document repositories for the 21st century.

Our role, as the Congressional Data Coalition, includes supporting public initiatives that provide improved legislative information for ourselves and the public. Tom Bruce, Director of the Cornell Law Information Institute, said it eloquently in his hangout session when he talked about the dream of having an open-access Westlaw or LexisNexis with layered access to information providing legal/legislative services housed under many roofs – a federation of services and data.

We should not shy away from identifying data anomalies and provide corrected data in a fully transparent and constructive way to support the public need for accurate and timely legislative information. It might not seem that having all of these laws as discreet files on the Internet would mean much. We’ve lived without them as discreet electronic files for a long time without any apparent problems. My hope for now is that these documents will extend our electronic legislative library so that our history can be read and referenced over the Internet.

Please consider helping our effort and volunteering along with us at Special thanks to Owen Ambur, Daniel Schuman, Sara S. Frug, Joe Jerome, and Matt Steinberg.


  1. So… you’re going to recreate Wikisource? Because Wikisource could sure use your support.

    The Statutes at Large (from front to back, not just the public laws, as with the Constitution Society’s scans) have been on Wikisource and Wikipedia Commons since 2011. (Sans volumes 47-64, about years 1934-1950.) They were contributed by Prof. Timothy K. Armstrong from the University of Cincinnati College of Law, first scanned in 2008. The Wikisource community has also taken it upon themselves to transcribe the documents into text, as versus pictures of text, and digitized a significant number of their indexes.

    There most certainly have been “broken down into individual laws, treaties, Presidential proclamations, etc.” before now, even if you never scrolled down to the ridiculously inconspicuous “Wikisource has original text related to this article” link on the corresponding Wikipedia page (given the amount of time it took us to manually transcribe them).

    Its sad that you never noticed this, but its even sadder that GPO got money to recreate what Wikipedians have been doing for half a decade.

  2. Actually, Prof. Armstrong did scan volumes 47-64 on his university website, they just haven’t been uploaded to Wikipedia Commons, OCR’d, and indexed like the other volumes.

  3. Oops, no, he may not have. Last I checked with him about it (maybe 3 years ago, or thereabouts), he had only partially scanned them, but had gotten caught up with DJVU conversion details or something.

    In any event, I look forward to merging your work (or at least what little that won’t duplicate what we’ve already done) into Wikipedia and Wikisource.

    I will try and make sure Prof. Armstrong knows about this.


    The corrected OCR is on Wikisource. Red means it hasn’t been marked as proofread, yellow means its been proofread by at least person, and green means its been proofread by at least two editors.


    For the scanned volumes. Some have been transcribed but not pushed into the main list, while others are in the middle of transcription (proofreading the OCR’d text).