Current Version: 0.6 (The Man Who Couldn't Cry)
This is snatch, the page image grabbing tool. It can be used to download page images from various academic and archival websites.
Important: This script is not intended to be used for copyright infringement or other illegal purposes. It is designed for downloading of public domain materials or copyrighted materials for fair use purposes only. By running this script, you indemnify the author(s), maintainer(s), and/or distributor(s) of any and all possible liabilities.
There are one to three steps to installing snatch:
First things first! If you use guiprep or guiguts, then you already have perl installed. You just need to make sure perl is in your system's PATH.
*nix: I will assume you know how to install perl (Debian: `apt-get perl`; Gentoo: `emerge perl`; etc.).
Windows: You basically have two options:
Mac: Please refer to the following thread at DP: http://www.pgdp.net/phpBB2/viewtopic.php?t=3777
Download the latest release.
To unzip snatch, you need an unzip program. The DP forums contain some good suggestions:
Unzip the latest release to the directory of your choice (I use ~/dp/snatch/ under linux and C:\temp\dp\snatch\ under Windows).
Now, go to the command line. (Under Windows go to Start->Run and type "cmd" [without quotes] in the "Open" field then hit Enter.) Navigate to the directory where you put snatch, and run the program as you desire (see the Examples section for, well, examples...).
Snatch now reads a config file to set some default settings. Included in the distribution is a file called "config.example.rc". If installing for the first time, you should copy (or rename) "config.example.rc" to "config.rc". For most users, that is all you will need to do. However, if you want to tweak default snatch performance, you can mess with the settings as detailed below.
It is important to remember when modifying the config.rc to include double quotation marks (") around each value and put a semi-colon (;) at the end.
| name | variable | example values | description |
|---|---|---|---|
| User Agent | $config{'useragent'} |
Mozilla/6.0 |
Some sites will only allow certain "user agents" (browsers and other programs that access the web) to access their site. If for some reason the default user agent isn't working, or if you prefer to use a different user agent for another reason, uncomment this option and change as desired. As of version 0.5, if this option is not set, snatch will choose a random user agent from the list in the uagents file. As of version 0.5.2, the user agent can be set by the -ua option, which overrides this setting. |
| HTTP Proxy | $config{'proxy'} |
http://proxy.myserver.com/ http://username:password@myproxy.myserver.com:8080/ |
This is commented (disabled) by default, but if you are trying to download through a proxy, just uncomment the variable and put in your proxy server address. Proxy support is provided by the LWP::UserAgent perl module. Proxy environment variables currently not supported. As of version 0.5.2, the -p flag overrides this option. Using "-p off" on the commandline will disable proxy support altogether. |
| Download Directory | $config{'dir'} |
*nix: "~/snatch/dp/" Windows "C:\Documents and Settings\username\My Documents\dp\snatch\" |
This option sets the default download directory. Please note that this option can be overridden using the "-d" flag on the commandline. |
| Default format | $config{'format'} |
image |
This option defines the default preferred download format. Since the format is heavily site-dependent, this option may be overridden by the module. Each module defines its own default download format. If this option is not set, or if the value of this option is not valid for that module, the module may use its default value type or it may fail to download any files. Refer to the module's documentation in snatch.html for details on default module types. This value may be overridden by the "-f" option on the command line. |
| Verbose Reporting | $config{'verbose'} |
0 1 |
This option enables or disables verbose reporting. It is usually needed only for reporting bugs or developing new modules (or for insanely watching as each page is downloaded, as I am wont to do from time to time) |
| Renumber Files | $config{'renumber'} |
0 1 |
This option enables renumbering of downloaded files. Please note that modules may choose to override this option for technical reasons. Refer to the documentation for details on which modules override this option. Setting this option to "1" is equivalent to using the "-r" option on the command line. |
| Wait Between Files | $config{'wait'} |
[integer] |
Sometimes it may be useful (and even courteous) to pause for a few seconds (or longer) between downloading files. If this option is set, snatch will wait the number of seconds specified between each file. This setting can be overridden using the "-w" option on the command line. |
| Cookie File (0.5 and later) | $config{'cookiefile'} |
cookies.txt |
Experimental! Use this file to store cookies. |
*nix
snatch [options] [site] [id]
Windows (using a DOS command prompt)
perl snatch [options] [site] [id]
Not enough arguments
Usage: snatch.pl [options] site id
-d dir Save to directory "dir" (create dir if it doesn't exist).
-f format Download format. Valid values are "pdf" or "image" (not all
sites support downloading of both formats).
-l Generate a list of available modules and exit. (Other options
ignored)
-i start Begin downloading at "start" (integer) page.
-h Print help and exit (other options ignored)
-o offset End downloading at "start" + "offset" (see -i option). If not
set, the selected module will determine the offset equal to
the total number of pages in the book.
-p proxy Set the proxy string to "proxy". The string "off" turns the
proxy feature off (e.g., if it's set in your config.rc file)
-q Quiet mode (verbose reporting OFF).
-r Renumber images sequentially. By default, images are saved
using the same filename as on the server.
-u Don't download pages; instead, print a list of URLs from
which the pages may be downloaded (overrides -v). Note that
some files may still be downloaded in order to generate
the URLs.
-ua uagent Set the user agent to "uagent".
-v Verbose mode (unless -u is selected).
-w wait Number of seconds to wait between the download of each file.
site Short form of the site to download from (i.e., which module
to use).
id Unique ID of which book's images to download.
--update-cache Update the site cache and exit. (Other options ignored)
Austrian Literature Online has a ton of books in German. Since the site is in German, and since I can't read German, I have no idea of the copyright status of these images or the books themselves.
The alo module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one PNG image per page. |
The ID is an integer (1 to 5 characters). To get the ID, navigate to the book you wish to retrieve. The URL will have a portion that looks like:
objid=XXXX
The XXXX is the ID.
Early Canadiana Online (ECO) is a digital library providing access to over 1,496,000 pages of Canada's printed heritage. It features works published from the time of the first European settlers up to the early 20th Century.
DP has been given permission to use Canadiana's page images.
The can module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one PNG image per page. |
The ID is consists of digits and (optionally) underscores. To get the ID, navigate to the book's bibliographic record. The ID is listed as the CIHM in the record.
The Core Historical Literature of Agriculture is exactly what it sounds like: a repository of literature about agriculture.
The copyright page contains the following text:
As a publicly supported institution, Mann Library generally does not own rights to material in its collections. Therefore it does not charge permission fees for use of such material and cannot give or deny permission to publish or otherwise distribute material in its collections. It is the obligation of the user to determine and satisfy copyright or other use restrictions when publishing or otherwise distributing materials found in the Mann Library collections.
The chla module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
The collection contains both books and journals. The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:
id=XXXXXXX
The "XXXXXXX" is the ID. Journal IDs are in the format "XXXXXXX_XX_XX".
The New York State Historical Literature at Cornell project contains texts about NY state history.
The copyright page contains the following text:
Copyright and other rights in the images, underlying encoded text, selection, indexing, and display of materials in Cornell Digital Library Collections are held by the Cornell University Library to the extent permitted by law. Users should be aware that materials made available in Cornell Digital Library Collections may be subject to additional restrictions. These include but are not limited to the rights of copyright, privacy, and publicity. Such restrictions are likely to be controlled by parties other than the Cornell University Library. Users are solely responsible for determining the existence of such rights, obtaining any permissions, and paying any associated fees required for the proposed use.
The chla module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one GIF image per page. |
The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:
did=nys###
The ### is the ID (do not include the "nys" as part of the ID).
The HEARTH project a repository of texts relating to home economics.
The copyright page contains the following text:
As a publicly supported institution, Mann Library generally does not own rights to material in its collections. Therefore it does not charge permission fees for use of such material and cannot give or deny permission to publish or otherwise distribute material in its collections. It is the obligation of the user to determine and satisfy copyright or other use restrictions when publishing or otherwise distributing materials found in the Mann Library collections.
The chla module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
The collection contains both books and journals. The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:
id=XXXXXXX
The "XXXXXXX" is the ID. Journal IDs are in the format "XXXXXXX_XX_XX".
The Hockliffe Project has been designed to promote the study of early British children's literature. It will provide internet access to the full texts of the Hockliffe Collection of Early Children's Books, owned by De Montfort University, and will accompany this archive with contextualising documents and research. The aim is to work towards a reevaluation of children's literature in its own infancy, and to let these rich and varied books speak for themselves.
It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.
The hock module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one JPEG image per page. |
This one is easy. The ID is a four-digit catalog number (####). To obtain the ID, just go to the project and browse to the book you want. The catalog number is in the far left column on the browse page. Note that only books marked as having images can be downloaded.
The Kentuckiana Digital Library is part of the Kentucky Virtual University and contains many items about Kentucky history.
The copyright page contains the following text:
Many items offered by the Kentuckiana Digital Library may be protected by the U.S. Copyright Law (Title 17, U.S.C.). Some items may have restrictions imposed by the copyright holder or the repository owning the physical items. The holding repositories have made best efforts to identify the copyright status for online items. This information is offered as a service to the general public in determining the proper use of an item and is found in collection finding aids and/or upon inquiry to the holding repository. However, it is always the user's responsibility to determine copyright restrictions and obtain the permission of the copyright holder.
The kdl module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
The ID for each work can be found by navigating to the item and finding the portion of the URL that says:
id=AXX-XXX-XXXXXXXX
Where "A" is a letter and each "X" is a number.
moamb
Making of America (MOA) is a joint project of Michigan University and Cornell University. This module will retrieve works in the Michigan books collection.
MOA allows downloading and storing of images for personal use, but according to them, one must request permission before redistributing their images. It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.
The moamb module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
Note: When downloading images, each image is generated on the MOA server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image.
This moamb ID number is in the form 'XXX####.####.###' (without quotes), where X is a letter and # is a number. For example, the unique ID for James Fenimore Cooper's novel 'The Last of the Mohicans' is 'ABB2610.0001.001'.
You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:
idno=
The portion following that, and continuing to the next ampersand (&), is the ID.
The Posner Memorial Collection is one of Carnegie-Mellon University Library's special collections, consisting of rare and interesting books acquired by Henry Posner, Sr.
It is not clear whether or not Carnegie-Mellon University claims copyright on the scans. However, there appear to be quite a few books that were published during or after 1923, which means they might still be under copyright. The following note is posted on the collection's website:
Use of the Posner Collection is intended for educational purposes only. Users are warned that copyright laws may restrict the use of these images. Permissions for commercial use or publication should be obtained from the copyright holders.
The posner module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one JPEG image per page. |
The ID is the call number of the book. To get the ID, browse to the desired book and then find the portion of the URL that has "call=XXX_XXXX". For example:
http://posner.library.cmu.edu/Posner/books/book.cgi?call=220_H31F
The ID for this book is 220_H31F.
Our Roots features books that highlight Canada's local history.
The roots module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one JPEG image per page. |
The ID is an integer (4 characters). To get the ID, navigate to the table of contents page for the book you wish to retrieve. The URL will have a portion that looks like:
id=XXXX
The XXXX is the ID. Note that if you are viewing a specific page of a book you will see a 6 digit number:
ID=XXXXXXThis is the ID for the page, not the book; return to the Table of Contents to find the correct number.
The Schoenberg Center for Electronic Text & Image has a lot of cool stuff.
I found no explicit notices of copyright on the site. It is unclear whether their mechanically reproduced page images are eligible for copyright.
The posner module supports the following download formats:
-f value | Description |
|---|---|
image | Pages are returned as one JPEG image per page. |
To get the ID, browse to the desired book and then find the portion of the URL that has "textID=XXX_XXXX". For example:
http://dewey.library.upenn.edu/sceti/printedbooksNew/index.cfm?textID=B5083
The ID for this book is 220_H31F.
The University of Michigan Historical Math Collection has a bunch of math books.
The ummath module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
Note: When downloading images, each image is generated on the server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image. When downloading PDF files, there is no need to send these extra HTTP requests.
This moamb ID number is in the form 'XXX####.####.###' (without quotes), where X is a letter and # is a number. For example, the unique ID for James Fenimore Cooper's novel 'The Last of the Mohicans' is 'ABB2610.0001.001'.
You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:
idno=
The portion following that, and continuing to the next ampersand (&), is the ID.
The Historic Pittsburgh is a collection of historical texts published in the 19th and 20th centuries about the city of Pittsburgh. The collection is owned by Pittsburgh University.
The copyright page contains the following text:
Users of the Historic Pittsburgh website do not need to seek permission for downloading images for private or educational use. However, the University of Pittsburgh does retain the rights to the digital images available on this website.
The kdl module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
The ID for each work can be found by navigating to the item and finding the portion of the URL that says:
id=AXX-XXX-XXXXXXXX
Where "A" is a letter and each "X" is a number.
moamb
The United States and its Territories "omprises the full text of monographs and government documents published in the United States, Spain, and the Philippines between 1870 and 1925."
The philamer module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
Note: When downloading images, each image is generated on the MOA server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image.
This moamb ID number is in the form 'XXX####.####.###' (without quotes), where X is a letter and # is a number. For example, the unique ID for James Fenimore Cooper's novel 'The Last of the Mohicans' is 'ABB2610.0001.001'.
You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:
idno=
The portion following that, and continuing to the next ampersand (&), is the ID.
Wright American Fiction (WAF) is a collection of 19th century American fiction, as listed in Lyle Wright's bibliography American Fiction, 1851-1875. WAF is hosted by the Indiana University Digital Library Program.
WAF does not appear to have a copyright policy on it's pages; however, the Digital Library Program at Indiana University has the following statement:
The university is currently seeking means to clarify the rights of use of many materials accessible on its Web pages. Unless rights of use are clearly stated with respect to an individual item, users must seek permission from the copyright owner for all uses that are allowed by fair use and other provisions of the U.S. Copyright Act. If you need assistance with identifying or locating the copyright owner of a work, please contact the owner of the page from which you linked to this statement.
It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.
The wright module supports the following download formats:
-f value | Description |
|---|---|
Pages are returned as one PDF file per page. | |
image | Pages are returned as one GIF image per page. |
Note: When downloading images, each image is generated on the WAF server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image. When downloading PDF files, there is no need to send these extra HTTP requests.
This wright ID number is in the form 'Wright2-####' (without quotes), where # is a number. For example, the unique ID for Herman Melville's book of short stories 'The Piazza Tales' is 'Wright2-1702'.
You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:
idno=
The portion following that, and continuing to the next ampersand (&), is the ID.
Here are some examples of how to use snatch – from basic to complex uses that you may never even care about or want to use. Please note that snatch requires perl to be installed.
Note: Most of these examples do not use real IDs.
Note for Windows users: You must run snatch from a DOS command prompt. Also, you will have to begin each command with "perl ".
The following command will download all the pages of a book from MOA - Michigan Books to the current directory.
snatch moamb AAA1234.0001.001
Using the -v flag will cause the script to update you on the progress of the download. The following command will download all the pages of a book from MOA - Michigan Books to the current directory.
snatch -v moamb AAA1234.0001.001
Some modules allow downloading pages in different formats. To download pages in a different format, use the -f flag. Check individual module documentation for supported formats. If an unsupported format is used, the value of the -f flag is ignored and the default format used.
The following command will download all the pages of a book from Wright American Fiction as PDF files.
snatch -f pdf wright Wright2-0987
The -d flag can be used to download files to a directory other than the current directory. The following example will download all the pages from a book at The Hockliffe Project to a subdirectory in the current directory called "hockliffe".
snatch -d hockliffe hock 0123
Note for Windows Users: By default, perl uses Unix-style directory format (e.g., /some/directory/path/); however, if you are more comfortable using a DOS-style directory path, you can. For example, if you wanted to save your images in C:\Hockliffe, you can use the following command.
snatch -d C:\Hockliffe hock 0123
You can set an offset of the number of pages you want to download using the -o flag. If the value of the -o flag is greater than the total number of pages in the book being downloaded, the script will stop once all pages have been downloaded. The following command will download the first 10 pages of a book from Wright American Fiction.
snatch -o 10 wright Wright2-4321
The -i flag will let you begin downloading pages at any page in the book. If the value of the -i flag is greater than the total number of the pages in the book, no pages will be downloaded. The following command will begin downloading at page 10 of a book from the University of Georgia.
snatch -i 10 uga abcd
Using the -i and -o flags in conjunction can be a handy device. For example, if you wanted to download a single article from a journal at MOA - Cornell Journals, and you know the article is on pages 11 through 20, you could use the following command.
snatch -i 11 -o 10 moacj harp0000
Literally, this command says, "Begin downloading at page 11 and continue for 10 pages." The tenth page would be page 20.
Not every site stores their page images in a sequential format. Using the -r flag, you can have the pages automatically renumbered (and padded) to 8 characters (plus extension). The following command will download all the pages from a book hosted by Early Canadiana Online and renumber the pages.
snatch -r can 67584
There are a couple of reasons you might want to wait between downloading files: to give your computer time to do other things, to open up your internet connection, or to be nice to the server you are downloading from by not bombarding them with multiple requests one right after another. The following command will cause the script to pause for 5 seconds before downloading each page image from a book at MOA - Michigan Books.
snatch -w 5 moamb ZZZ0987.0001.001
You may not want to actually download the page images using snatch. If instead you would rather just get a list of URLs that can be passed to another program – such as wget – you can use the -u flag. The following command will generate a list of URLs for a book at the Universal Library Scanserver.
snatch -u ulscan book0
Warning: The -u flag will override the -v flag (verbose reporting).
Sometimes a person might need to extract an image from a PDF file. It's not very hard, but finding the right tools to do it can be tough.
There are some commercial tools available (for Windows anyway), but since both I and the people whom this tutorial is intended to help are poor, this tutorial will focus on free tools. (Incidentally, if you do have money to burn, I would recommend using PDF Extract TIFF.)
This tutorial assumes the following things:
Here's the tools we will use:
Please make sure pdfimages.exe is saved somewhere in your path (e.g., C:\Windows\system32) and that Irfanview is installed before continuing.
These steps assume the PDF files are in a directory called C:\pdfs and that the PDFs are named something like 0001.pdf, 0002.pdf, 0003.pdf, etc.
Open up your DOS prompt (Start->Run, type in "cmd" or "command" and then "Run")

Navigate to C:\pdfs
C:\> cd C:\pdfs
Extract the images from the PDF files:
To extract from just one file, run the following command:
C:\pdfs> pdfimages 0001.pdf 0001
This will extract the image and name it 0001-000.pbm. If there are multiple images, they will be sequential like 0001-000.pbm, 0001-001.pbm, 0001-002.pbm, etc.
To extract from multiple files, run the following command:
C:\pdfs> for %f in (*.pdf) do pdfimages %f %f
This will extract all the images in each of the files and name them 0001.pdf-000.pbm, 0002.pdf-000.pbm, 0003.pdf-000.pbm, etc.
At this point, your image files are extracted as PBM (Portable Bitmap) files. If these are ok for your purposes, then you are done; otherwise, continue on to convert the files to TIFF.
Before we go on, exit from DOS.
C:\pdfs> exit
To convert the PBM files to TIFF files, we will use Irfanview. Open one of the PBM files in Irfanview (doesn't matter which one) and then from the File menu select Batch Conversion/Rename

Where it says "Files of type:" select "PBM/PGM/PPM - Portable Bitmap"

On the left, select "Add all". A list of PBM files will appear

Under "Batch conversion settings:" check the "Use advanced options" checkbox and then click on "Set advanced options".

Under the "Set advanced options" dialog box, make the following selections:

Under "Batch conversion settings:" again, make sure the "Output format:" is set to "TIF - Tagged Image File Format".

On the left side again, select "Start".

A "Converting images" dialog box will open. Once the images are done converting, click "Exit".

That's it! You've got your TIFF files, nicely extracted. Do what you will with them.
Irfanview has a lot of options for converting image files. You may find that you want to use some other file format, you might want to make thumbnails, etc. Play around — have fun!
Coming soon...
The most prevalent plan for this script is to keep adding modules to support various archive sites. Some sites aren't easily snatch-able, but I will happily consider supporting any archive.
I am also working on documentation for anybody who wishes to submit their own site module.
My current development plans are below.
If you would like to see any other sites added to snatch, please let me know in the Distributed Proofreaders forum thread for this script: http://www.pgdp.net/phpBB2/viewtopic.php?t=4089
The main code of snatch is nigh complete (I think!), but there are a couple of new features I would like to implement. The following features may or may not be implemented in an upcoming release:
perl snatch --update-cacheThe following people have provided code and/or ideas to help the development of snatch. Thanks a bundle!