The Rare Adventist Book (“Text” or “Corpora”) Project is a scholarly work which uses automation and volunteer transcriptionists to reproduce texts through an exhaustive Quality Management (QM) process. These texts are then published through repositories at Harvard Dataverse, and the Open Science Framework (OSF).
Quality Management Process
The goal for accuracy is complete parity with the original source, with full public access as a further guarantor of quality.
Once provenance is established for a qualifying digital corpus, the OCR text is cleansed of anomolies (UTF-8 characters, ligatures, control sequences), normalized, then corrected by hand using the intersection difference between the two texts (ocr and corrected), to produce a changeset which lists (by line number) a set of corrections to be made.
After these steps are repeated until no corrections remain in a changeset, an additional Quality Management checklist is employed to establish word count, paragraph count, and many other data points through procedural (visual/manual) steps to ensure an accurate text.
The resulting corrected text of the title is published and is assigned a SHA-1 Checksum (a commit hash), which serves as the identifer for the corpus. The resulting files also include an MD5 hash for verification.
Although now verified, it is available for further inspection or revision in the case a better copy of the document is found, or that an error in copying had taken place. Thus the final guarantor of quality is the interested public who are free to ascertain and contribute using the sources provided by the project, or by using their own. Anyone may request an update (pull request) from that point, and all any changes which are accepted by our transcriptionists are updated to the latest revision.
The project is a volunteer community effort and does not rely on formal funding, although we do accept donations for our book acquisition fund. This is because in many instances we need to purchase rare titles from the market.
Rare Book & Document Donations
The Rare Adventist Book Project is always eager to receive rare original books and manuscripts. If you have one, please let us know. We have a non-destructive scanning process, and can acquire the text without damage. Any donated item can be returned to you unharmed.
Please email us for more information. firstname.lastname@example.org.
Open Science Framework
Open Science Framework (OSF) is the staging and repository area where completed texts are posted, listed by author. The public may review the text for a specific book, see the original sources, information about the provenance (which is where the book came from), view the diff files, and see how the completed text was validated for accuracy, and so on. This is replication data used for validation and transparency.
By entering and sharing our files in the OSF repository, the public may also benefit from our on going work. Once files are made public, they are in the public domain and can be viewed, downloaded, and used for any purpose. It would be nice (although not required) if you cite the project if you publish from it, so people know you have used a reliable text. Citation information is available at OSF, which is the assigner of our DOI. Original page numbers do not exist in public files so as not to interfere with Natural Language Processing and statistical analysis.
Completed texts are also stored at our Harvard Dataverse repository in a slightly different format. These are assembled in a way appropriate for data scientists and scholars for tagging and analysis as Natural Language Processing (NLP) Corpora.