snatch 0.6 (The Man Who Couldn't Cry) – the page image grabber

Installation

There are one to three steps to installing snatch:

Install perl (skip if already installed)
Download and unzip snatch
Configure snatch (optional)

Perl

First things first! If you use guiprep or guiguts, then you already have perl installed. You just need to make sure perl is in your system's PATH.

*nix: I will assume you know how to install perl (Debian: `apt-get perl`; Gentoo: `emerge perl`; etc.).

Windows: You basically have two options:

Download the same perl runtime libraries as used with guiguts/guiprep (if you've already done this, then you can move on).
Install ActivePerl for Windows.

Mac: Please refer to the following thread at DP: http://www.pgdp.net/phpBB2/viewtopic.php?t=3777

Download and unzip snatch

Download the latest release.

To unzip snatch, you need an unzip program. The DP forums contain some good suggestions:

Windows: http://www.pgdp.net/phpBB2/viewtopic.php?t=1017
Mac: http://www.pgdp.net/phpBB2/viewtopic.php?t=9931
*nix: If "unzip" isn't available, contact tech support for your distribution.

Unzip the latest release to the directory of your choice (I use ~/dp/snatch/ under linux and C:\temp\dp\snatch\ under Windows).

Now, go to the command line. (Under Windows go to Start->Run and type "cmd" [without quotes] in the "Open" field then hit Enter.) Navigate to the directory where you put snatch, and run the program as you desire (see the Examples section for, well, examples...).

Configuring

Snatch now reads a config file to set some default settings. Included in the distribution is a file called "config.example.rc". If installing for the first time, you should copy (or rename) "config.example.rc" to "config.rc". For most users, that is all you will need to do. However, if you want to tweak default snatch performance, you can mess with the settings as detailed below.

It is important to remember when modifying the config.rc to include double quotation marks (") around each value and put a semi-colon (;) at the end.

name	variable	example values	description
User Agent	$config{'useragent'}	Mozilla/6.0	Some sites will only allow certain "user agents" (browsers and other programs that access the web) to access their site. If for some reason the default user agent isn't working, or if you prefer to use a different user agent for another reason, uncomment this option and change as desired. As of version 0.5, if this option is not set, snatch will choose a random user agent from the list in the uagents file. As of version 0.5.2, the user agent can be set by the -ua option, which overrides this setting.
HTTP Proxy	$config{'proxy'}	http://proxy.myserver.com/ http://username:password@myproxy.myserver.com:8080/	This is commented (disabled) by default, but if you are trying to download through a proxy, just uncomment the variable and put in your proxy server address. Proxy support is provided by the LWP::UserAgent perl module. Proxy environment variables currently not supported. As of version 0.5.2, the -p flag overrides this option. Using "-p off" on the commandline will disable proxy support altogether.
Download Directory	$config{'dir'}	nix: "~/snatch/dp/" Windows* "C:\Documents and Settings\username\My Documents\dp\snatch\"	This option sets the default download directory. Please note that this option can be overridden using the "-d" flag on the commandline.
Default format	$config{'format'}	image pdf	This option defines the default preferred download format. Since the format is heavily site-dependent, this option may be overridden by the module. Each module defines its own default download format. If this option is not set, or if the value of this option is not valid for that module, the module may use its default value type or it may fail to download any files. Refer to the module's documentation in snatch.html for details on default module types. This value may be overridden by the "-f" option on the command line.
Verbose Reporting	$config{'verbose'}	0 1	This option enables or disables verbose reporting. It is usually needed only for reporting bugs or developing new modules (or for insanely watching as each page is downloaded, as I am wont to do from time to time)
Renumber Files	$config{'renumber'}	0 1	This option enables renumbering of downloaded files. Please note that modules may choose to override this option for technical reasons. Refer to the documentation for details on which modules override this option. Setting this option to "1" is equivalent to using the "-r" option on the command line.
Wait Between Files	$config{'wait'}	[integer]	Sometimes it may be useful (and even courteous) to pause for a few seconds (or longer) between downloading files. If this option is set, snatch will wait the number of seconds specified between each file. This setting can be overridden using the "-w" option on the command line.
Cookie File (0.5 and later)	$config{'cookiefile'}	cookies.txt	Experimental! Use this file to store cookies.

Top

Synopsis

*nix

snatch [options] [site] [id]

Windows (using a DOS command prompt)

perl snatch [options] [site] [id]

Top

Options

Not enough arguments

Usage: snatch.pl [options] site id

  -d dir     Save to directory "dir" (create dir if it doesn't exist).

  -f format  Download format. Valid values are "pdf" or "image" (not all
             sites support downloading of both formats).

  -l         Generate a list of available modules and exit. (Other options
             ignored)

  -i start   Begin downloading at "start" (integer) page.

  -h         Print help and exit (other options ignored)

  -o offset  End downloading at "start" + "offset" (see -i option). If not
             set, the selected module will determine the offset equal to
             the total number of pages in the book.

  -p proxy   Set the proxy string to "proxy". The string "off" turns the 
             proxy feature off (e.g., if it's set in your config.rc file)

  -q         Quiet mode (verbose reporting OFF).

  -r         Renumber images sequentially.  By default, images are saved
             using the same filename as on the server.

  -u         Don't download pages; instead, print a list of URLs from
             which the pages may be downloaded (overrides -v). Note that
             some files may still be downloaded in order to generate
             the URLs.

  -ua uagent Set the user agent to "uagent".

  -v         Verbose mode (unless -u is selected).

  -w wait    Number of seconds to wait between the download of each file.


  site       Short form of the site to download from (i.e., which module
             to use).

  id         Unique ID of which book's images to download.


  --update-cache    Update the site cache and exit. (Other options ignored)

Top

Modules

[download | view]

Austrian Literature Online (alo)

Description

Austrian Literature Online has a ton of books in German. Since the site is in German, and since I can't read German, I have no idea of the copyright status of these images or the books themselves.

Formats

The alo module supports the following download formats:

-f value	Description
image	Pages are returned as one PNG image per page.

ID

The ID is an integer (1 to 5 characters). To get the ID, navigate to the book you wish to retrieve. The URL will have a portion that looks like:

objid=XXXX

The XXXX is the ID.

History

0.5.2: Fixed bug in URL generation
0.4: Standardized
0.2: Added module

Top

[download | view]

Canadian Literature Online (can)

Description

Early Canadiana Online (ECO) is a digital library providing access to over 1,496,000 pages of Canada's printed heritage. It features works published from the time of the first European settlers up to the early 20th Century.

DP has been given permission to use Canadiana's page images.

Formats

The can module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one PNG image per page.

ID

The ID is consists of digits and (optionally) underscores. To get the ID, navigate to the book's bibliographic record. The ID is listed as the CIHM in the record.

History

0.6: Fixed navigation bug, updated image type to PNG instead of GIF (PDF still available).
0.5.2: Fixed bug in page listing parsing.
0.5: Fixed module (previously broken due to site changes)
0.4.1: Added message for restricted content.
0.4: Standardized
0.3.4: Fixed small bug in ID parsing
0.3: Added module

Top

[download | view]

Core Historical Literature of Agriculture (chla)

Description

The Core Historical Literature of Agriculture is exactly what it sounds like: a repository of literature about agriculture.

The copyright page contains the following text:

As a publicly supported institution, Mann Library generally does not own rights to material in its collections. Therefore it does not charge permission fees for use of such material and cannot give or deny permission to publish or otherwise distribute material in its collections. It is the obligation of the user to determine and satisfy copyright or other use restrictions when publishing or otherwise distributing materials found in the Mann Library collections.

Formats

The chla module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

ID

The collection contains both books and journals. The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:

id=XXXXXXX

The "XXXXXXX" is the ID. Journal IDs are in the format "XXXXXXX_XX_XX".

History

0.4.4: Fixed renumbering bug
0.4: Added module

Top

[download | view]

Cornell University Library - New York State Historical Literature (culnys)

Description

The New York State Historical Literature at Cornell project contains texts about NY state history.

The copyright page contains the following text:

Copyright and other rights in the images, underlying encoded text, selection, indexing, and display of materials in Cornell Digital Library Collections are held by the Cornell University Library to the extent permitted by law. Users should be aware that materials made available in Cornell Digital Library Collections may be subject to additional restrictions. These include but are not limited to the rights of copyright, privacy, and publicity. Such restrictions are likely to be controlled by parties other than the Cornell University Library. Users are solely responsible for determining the existence of such rights, obtaining any permissions, and paying any associated fees required for the proposed use.

Formats

The chla module supports the following download formats:

-f value	Description
image	Pages are returned as one GIF image per page.

ID

The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:

did=nys###

The ### is the ID (do not include the "nys" as part of the ID).

History

0.6: Updated to fix navigation due to site changes
0.4: Bug fixes
0.4: Added module

Top

[download | view]

Home Economics Archive: Research, Tradition and History (hearth)

Description

The HEARTH project a repository of texts relating to home economics.

The copyright page contains the following text:

As a publicly supported institution, Mann Library generally does not own rights to material in its collections. Therefore it does not charge permission fees for use of such material and cannot give or deny permission to publish or otherwise distribute material in its collections. It is the obligation of the user to determine and satisfy copyright or other use restrictions when publishing or otherwise distributing materials found in the Mann Library collections.

Formats

The chla module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

ID

The collection contains both books and journals. The IDs for each type of work can be found by navigating to the item and finding the portion of the URL that says:

id=XXXXXXX

The "XXXXXXX" is the ID. Journal IDs are in the format "XXXXXXX_XX_XX".

History

0.5: Fixed renumbering bug
0.4: Added module

Top

[download | view]

Hockliffe Project (hock)

Description

The Hockliffe Project has been designed to promote the study of early British children's literature. It will provide internet access to the full texts of the Hockliffe Collection of Early Children's Books, owned by De Montfort University, and will accompany this archive with contextualising documents and research. The aim is to work towards a reevaluation of children's literature in its own infancy, and to let these rich and varied books speak for themselves.

It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.

Formats

The hock module supports the following download formats:

-f value	Description
image	Pages are returned as one JPEG image per page.

ID

This one is easy. The ID is a four-digit catalog number (####). To obtain the ID, just go to the project and browse to the book you want. The catalog number is in the far left column on the browse page. Note that only books marked as having images can be downloaded.

History

0.5.2: Updated to reflect new URL format
0.4: Standardized
0.2: Added module

Top

[download | view]

Kentuckiana Digital Library (kdl)

Description

The Kentuckiana Digital Library is part of the Kentucky Virtual University and contains many items about Kentucky history.

The copyright page contains the following text:

Many items offered by the Kentuckiana Digital Library may be protected by the U.S. Copyright Law (Title 17, U.S.C.). Some items may have restrictions imposed by the copyright holder or the repository owning the physical items. The holding repositories have made best efforts to identify the copyright status for online items. This information is offered as a service to the general public in determining the proper use of an item and is found in collection finding aids and/or upon inquiry to the holding repository. However, it is always the user's responsibility to determine copyright restrictions and obtain the permission of the copyright holder.

Formats

The kdl module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

ID

The ID for each work can be found by navigating to the item and finding the portion of the URL that says:

id=AXX-XXX-XXXXXXXX

Where "A" is a letter and each "X" is a number.

History

0.6: Fixed bugs due to site navigation changes
0.5: Fixed renumbering bug
0.4.2: Fixed bug preventing GIF downloads
0.4: Added module

Top

[download | view]

Making of America - Michigan Books (moamb)

moamb

Description

Making of America (MOA) is a joint project of Michigan University and Cornell University. This module will retrieve works in the Michigan books collection.

MOA allows downloading and storing of images for personal use, but according to them, one must request permission before redistributing their images. It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.

Formats

The moamb module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

Note: When downloading images, each image is generated on the MOA server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image.

ID

This moamb ID number is in the form 'XXX####.####.###' (without quotes), where X is a letter and # is a number. For example, the unique ID for James Fenimore Cooper's novel 'The Last of the Mohicans' is 'ABB2610.0001.001'.

You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:

idno=

The portion following that, and continuing to the next ampersand (&), is the ID.

History

0.5.2: Updated module again because of site changes
0.5.1: Updated module to work with MOA site changes
0.5: Fixed renumbering bug
0.4: Optimized and standardized.
0.3.2: Fixed bug where some books that had alternative filename formats weren't being downloaded
0.2: Optimized
0.1: Added module

Top

[download | view]

Posner Memorial Collection (posner)

Description

The Posner Memorial Collection is one of Carnegie-Mellon University Library's special collections, consisting of rare and interesting books acquired by Henry Posner, Sr.

It is not clear whether or not Carnegie-Mellon University claims copyright on the scans. However, there appear to be quite a few books that were published during or after 1923, which means they might still be under copyright. The following note is posted on the collection's website:

Use of the Posner Collection is intended for educational purposes only. Users are warned that copyright laws may restrict the use of these images. Permissions for commercial use or publication should be obtained from the copyright holders.

Formats

The posner module supports the following download formats:

-f value	Description
image	Pages are returned as one JPEG image per page.

ID

The ID is the call number of the book. To get the ID, browse to the desired book and then find the portion of the URL that has "call=XXX_XXXX". For example:

http://posner.library.cmu.edu/Posner/books/book.cgi?call=220_H31F

The ID for this book is 220_H31F.

History

0.5.2: Updated to reflect new URL format
0.4: Added module

Top

[download | view]

Our Roots: Canada's Local Histories Online (roots)

Description

Our Roots features books that highlight Canada's local history.

Formats

The roots module supports the following download formats:

-f value	Description
image	Pages are returned as one JPEG image per page.

ID

The ID is an integer (4 characters). To get the ID, navigate to the table of contents page for the book you wish to retrieve. The URL will have a portion that looks like:

id=XXXX

The XXXX is the ID. Note that if you are viewing a specific page of a book you will see a 6 digit number:

ID=XXXXXX

This is the ID for the page, not the book; return to the Table of Contents to find the correct number.

History

0.6: Additional updates to fix broken URLs from site changes
0.5.2: Updated based on site changes (thanks iclysdale!)
0.4: Added module

Top

[download | view]

Schoenberg Center for Electronic Text & Images (sceti)

Description

The Schoenberg Center for Electronic Text & Image has a lot of cool stuff.

I found no explicit notices of copyright on the site. It is unclear whether their mechanically reproduced page images are eligible for copyright.

Formats

The posner module supports the following download formats:

-f value	Description
image	Pages are returned as one JPEG image per page.

ID

To get the ID, browse to the desired book and then find the portion of the URL that has "textID=XXX_XXXX". For example:

http://dewey.library.upenn.edu/sceti/printedbooksNew/index.cfm?textID=B5083

The ID for this book is 220_H31F.

History

0.6: Fixed minor warning about renumbering pages
0.5.2: Minor documentation updates to clarify ID
0.4: Added module

Top

[download | view]

University of Michigan Historical Math Collection (ummath)

Description

The University of Michigan Historical Math Collection has a bunch of math books.

Formats

The ummath module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

Note: When downloading images, each image is generated on the server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image. When downloading PDF files, there is no need to send these extra HTTP requests.

ID

You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:

idno=

The portion following that, and continuing to the next ampersand (&), is the ID.

History

0.5.2: Updated module to work with URL changes
0.5: Fixed renumbering bug
0.4: Added module

Top

[download | view]

Historic Pittsburgh (upitt)

Description

The Historic Pittsburgh is a collection of historical texts published in the 19th and 20th centuries about the city of Pittsburgh. The collection is owned by Pittsburgh University.

The copyright page contains the following text:

Users of the Historic Pittsburgh website do not need to seek permission for downloading images for private or educational use. However, the University of Pittsburgh does retain the rights to the digital images available on this website.

Formats

The kdl module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

ID

The ID for each work can be found by navigating to the item and finding the portion of the URL that says:

id=AXX-XXX-XXXXXXXX

Where "A" is a letter and each "X" is a number.

History

0.6: Updated to work with site changes
0.4.4: Fixed renumbering bug
0.4.2: Added module

Top

[download | view]

The United States and Its Territories (ust)

moamb

Description

The United States and its Territories "omprises the full text of monographs and government documents published in the United States, Spain, and the Philippines between 1870 and 1925."

Formats

The philamer module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

ID

You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:

idno=

The portion following that, and continuing to the next ampersand (&), is the ID.

History

0.5.3: Updated the module to actually work...
0.5.2: Added module

Top

[download | view]

Wright American Fiction (wright)

Description

Wright American Fiction (WAF) is a collection of 19th century American fiction, as listed in Lyle Wright's bibliography American Fiction, 1851-1875. WAF is hosted by the Indiana University Digital Library Program.

WAF does not appear to have a copyright policy on it's pages; however, the Digital Library Program at Indiana University has the following statement:

The university is currently seeking means to clarify the rights of use of many materials accessible on its Web pages. Unless rights of use are clearly stated with respect to an individual item, users must seek permission from the copyright owner for all uses that are allowed by fair use and other provisions of the U.S. Copyright Act. If you need assistance with identifying or locating the copyright owner of a work, please contact the owner of the page from which you linked to this statement.

It is unclear as to whether or not their mechanically reproduced page images are eligible for copyright protection.

Formats

The wright module supports the following download formats:

-f value	Description
pdf	Pages are returned as one PDF file per page.
image	Pages are returned as one GIF image per page.

Note: When downloading images, each image is generated on the WAF server real-time. The script must access each page individually (i.e., send separate HTTP requests) first to generate the image, and then to download the generated image. When downloading PDF files, there is no need to send these extra HTTP requests.

ID

This wright ID number is in the form 'Wright2-####' (without quotes), where # is a number. For example, the unique ID for Herman Melville's book of short stories 'The Piazza Tales' is 'Wright2-1702'.

You can find the ID number by navigating to the book you want to retrieve and then copying the link address (URL) from that book. In the URL, there is a portion that looks like:

idno=

The portion following that, and continuing to the next ampersand (&), is the ID.

History

0.5: Fixed renumbering bug
0.4: Optimized and standardized.
0.3.3: WAF made changes to the website that broke the module; re-wrote module to work again.
0.1: Added module

Top

Examples

Here are some examples of how to use snatch – from basic to complex uses that you may never even care about or want to use. Please note that snatch requires perl to be installed.

Note: Most of these examples do not use real IDs.

Note for Windows users: You must run snatch from a DOS command prompt. Also, you will have to begin each command with "perl ".

Simple Download
Simple Download with Verbose Reporting
Download a Different Format
Download to a Different Directory
Download First 10 Pages
Begin Download at Page 10
Download Pages 11 to 20
Renumber Pages Sequentially
Wait 5 Seconds before Downloading Each File
Get a List of URLs

Simple Download

The following command will download all the pages of a book from MOA - Michigan Books to the current directory.

snatch moamb AAA1234.0001.001

Simple Download with Verbose Reporting

Using the -v flag will cause the script to update you on the progress of the download. The following command will download all the pages of a book from MOA - Michigan Books to the current directory.

snatch -v moamb AAA1234.0001.001

Download a Different Format

Some modules allow downloading pages in different formats. To download pages in a different format, use the -f flag. Check individual module documentation for supported formats. If an unsupported format is used, the value of the -f flag is ignored and the default format used.

The following command will download all the pages of a book from Wright American Fiction as PDF files.

snatch -f pdf wright Wright2-0987

Download to a Different Directory

The -d flag can be used to download files to a directory other than the current directory. The following example will download all the pages from a book at The Hockliffe Project to a subdirectory in the current directory called "hockliffe".

snatch -d hockliffe hock 0123

Note for Windows Users: By default, perl uses Unix-style directory format (e.g., /some/directory/path/); however, if you are more comfortable using a DOS-style directory path, you can. For example, if you wanted to save your images in C:\Hockliffe, you can use the following command.

snatch -d C:\Hockliffe hock 0123

Download First 10 Pages

You can set an offset of the number of pages you want to download using the -o flag. If the value of the -o flag is greater than the total number of pages in the book being downloaded, the script will stop once all pages have been downloaded. The following command will download the first 10 pages of a book from Wright American Fiction.

snatch -o 10 wright Wright2-4321

Begin Download at Page 10

The -i flag will let you begin downloading pages at any page in the book. If the value of the -i flag is greater than the total number of the pages in the book, no pages will be downloaded. The following command will begin downloading at page 10 of a book from the University of Georgia.

snatch -i 10 uga abcd

Download Pages 11 to 20

Using the -i and -o flags in conjunction can be a handy device. For example, if you wanted to download a single article from a journal at MOA - Cornell Journals, and you know the article is on pages 11 through 20, you could use the following command.

snatch -i 11 -o 10 moacj harp0000

Literally, this command says, "Begin downloading at page 11 and continue for 10 pages." The tenth page would be page 20.

Renumber Pages Sequentially

Not every site stores their page images in a sequential format. Using the -r flag, you can have the pages automatically renumbered (and padded) to 8 characters (plus extension). The following command will download all the pages from a book hosted by Early Canadiana Online and renumber the pages.

snatch -r can 67584

Wait 5 Seconds before Downloading Each File

There are a couple of reasons you might want to wait between downloading files: to give your computer time to do other things, to open up your internet connection, or to be nice to the server you are downloading from by not bombarding them with multiple requests one right after another. The following command will cause the script to pause for 5 seconds before downloading each page image from a book at MOA - Michigan Books.

snatch -w 5 moamb ZZZ0987.0001.001

Get a List of URLs

You may not want to actually download the page images using snatch. If instead you would rather just get a list of URLs that can be passed to another program – such as wget – you can use the -u flag. The following command will generate a list of URLs for a book at the Universal Library Scanserver.

snatch -u ulscan book0

Warning: The -u flag will override the -v flag (verbose reporting).

Extracting Images from PDF Files

Introduction

Sometimes a person might need to extract an image from a PDF file. It's not very hard, but finding the right tools to do it can be tough.

There are some commercial tools available (for Windows anyway), but since both I and the people whom this tutorial is intended to help are poor, this tutorial will focus on free tools. (Incidentally, if you do have money to burn, I would recommend using PDF Extract TIFF.)

Windows
Linux
Authors

Windows

Assumptions

This tutorial assumes the following things:

You know how to use the DOS commandline (change directories, run commands, etc.)
You have obtained the PDF file(s) and put them into a directory of their own.
You have permission (such as under fair use rules) or don't need permission to modify these files.

Software Needed

Here's the tools we will use:

pdfimages (part of the Xpdf package) – Download ZIP
Irfanview

Please make sure pdfimages.exe is saved somewhere in your path (e.g., C:\Windows\system32) and that Irfanview is installed before continuing.

Steps

These steps assume the PDF files are in a directory called C:\pdfs and that the PDFs are named something like 0001.pdf, 0002.pdf, 0003.pdf, etc.

Open up your DOS prompt (Start->Run, type in "cmd" or "command" and then "Run")
Navigate to C:\pdfs
C:\> cd C:\pdfs
Extract the images from the PDF files:
- To extract from just one file, run the following command:
  C:\pdfs> pdfimages 0001.pdf 0001
  This will extract the image and name it 0001-000.pbm. If there are multiple images, they will be sequential like 0001-000.pbm, 0001-001.pbm, 0001-002.pbm, etc.
- To extract from multiple files, run the following command:
  C:\pdfs> for %f in (*.pdf) do pdfimages %f %f
  This will extract all the images in each of the files and name them 0001.pdf-000.pbm, 0002.pdf-000.pbm, 0003.pdf-000.pbm, etc.
At this point, your image files are extracted as PBM (Portable Bitmap) files. If these are ok for your purposes, then you are done; otherwise, continue on to convert the files to TIFF.
Before we go on, exit from DOS.
C:\pdfs> exit
To convert the PBM files to TIFF files, we will use Irfanview. Open one of the PBM files in Irfanview (doesn't matter which one) and then from the File menu select Batch Conversion/Rename
Where it says "Files of type:" select "PBM/PGM/PPM - Portable Bitmap"
On the left, select "Add all". A list of PBM files will appear
Under "Batch conversion settings:" check the "Use advanced options" checkbox and then click on "Set advanced options".
Under the "Set advanced options" dialog box, make the following selections:
- Check the "Resize" checkbox
- Select the "Set new size as percentage of original" radio button
- Make sure both Width and Height are 100%
- Check the "Preserve aspect ratio" checkbox
- Make sure "Set DPI:" is set to 300
Click "Ok" to exit from the "Set advanced options" dialog box.
Under "Batch conversion settings:" again, make sure the "Output format:" is set to "TIF - Tagged Image File Format".
On the left side again, select "Start".
A "Converting images" dialog box will open. Once the images are done converting, click "Exit".
That's it! You've got your TIFF files, nicely extracted. Do what you will with them.

Modifications

Irfanview has a lot of options for converting image files. You may find that you want to use some other file format, you might want to make thumbnails, etc. Play around — have fun!

Linux

Coming soon...

Top

Development

The most prevalent plan for this script is to keep adding modules to support various archive sites. Some sites aren't easily snatch-able, but I will happily consider supporting any archive.

I am also working on documentation for anybody who wishes to submit their own site module.

My current development plans are below.

Additional Sites

If you would like to see any other sites added to snatch, please let me know in the Distributed Proofreaders forum thread for this script: http://www.pgdp.net/phpBB2/viewtopic.php?t=4089

Additional Features

The main code of snatch is nigh complete (I think!), but there are a couple of new features I would like to implement. The following features may or may not be implemented in an upcoming release:

An option to set the highest page image number (may require modification of some existing modules)
An option to create a list of URLs "totally offline" – currently, some modules still need to access the 'net to determine how many pages are available (will require modification of some existing modules)
Have some information about each download saved: # images, what images were downloaded, etc.
An option to resume downloading of a set of images.

History

0.6 – The Man Who Couldn't Cry (20080911)

ADDED The United States and its Territories module

Note: To use this module, you may first need to run perl snatch --update-cache

UPDATED the following modules to work with site changes
- Canadiana (now downloads PNGs instead of GIFs)
- Cornell University Library - New York State Historical Literature
- Kentuckiana
- Historic Pittsburgh
DISABLED the following modules because site changes have made them impossible to fix them
FIXED a minor warning related to page renumber in Schoenberg Center for Electronic Text & Image
FIXED a minor bug that caused a warning when $config{'cookiefile'} is not set in config.rc

0.5.2 – time will tell (20070107)

ADDED -ua command line option to set user agent (overrides $config{'uagent'} setting)
ADDED -p command line option to set (or disable) HTTP proxy
UPDATED Our Roots module based on the update provided by Ian Clysdale (see this post at DP
UPDATED the following modules to work with URL changes:
FIXED bug in Austrian Literature Online module that prevented downloading
FIXED bug in Canadiana module for parsing page listings
FIXED some minor errors in the install documentation and added reference to new -ua commandline option.
FIXED a minor bug relating to trailing slashes in output directory names.

0.5.1 – cryin' won't help you (20060201)

UPDATED Making of America Michigan Books module to work with site changes [CW]
FIXED minor bug in build script affecting download/view links in module documentation [CW]

0.5 – When the Levee Breaks (20060130)

FIXED bug in several modules that prevented proper renumbering for PDF downloads [CW]
FIXED bug preventing -l and --update-module options from working properly [CW]
FIXED Canadiana Online module previously broken due to site changes [CW]
UPDATED Michigan State University Digital & Multimedia Center module [CW]
STOLE several concepts from gharvest.pl (special thanks to Bruce Albrecht)
- Proxy support now works properly
- Random useragent now chosen if one not specified in config.rc

0.4.3 – $!=$# (20050225)

FIXED stupid typo that cause the script to be completely useless. [CW]

0.4.2 – The freeloader rides again (20050110)

ADDED new module Historical Pittsburgh. [CW]
FIXED bug in Kentucky Digital Library module that prevented downloading GIF images. [CW]
FIXED minor --update-cache bug that kept module cache in memory.
Tweaked verbose code to print file download URL.

0.4.1 – Kind of Bluer (20040920)

ADDED check in Canadian Literature Online module to prevent attempts to download restricted materials. [CW]
FIXED bug in Making of America - Cornell Journals module. [CW]
FIXED minor bug that prevented module name to be printed in verbose mode. [CW]
FIXED bug in build script that erroneously misname module files. [CW]
FIXED bug in documentation generation code that prevented formats from being shown in module docs. [CW]
Strictified code. [CW]

0.4 – Kind of Blue (20040919)

ADDED a bunch of new modules:
ADDED installation instructions to documentation, along with minor navigational tweaks [CW]
FIXED Austrian Literature Online bug that prevented renumbering [CW]
FIXED bug preventing formats other than "image" and "png" from being used. [CW]
CHANGED University of Georgia module to be faster when using -o flag to specify an offset. (Using the uga module without the -o flag is not recommended.) [CW]
CHANGED how certain modules "touch" a page to generate an image. Now the "touch" pages are called immediately before the image to be downloaded. This should help people on dialup connections from having to touch every page first before downloading any images (especially when a download fails at page #100 – you can pick back up where you left off!). [CW]
CHANGED module code to standardize common module routines and make maintenance/addition of modules easier. [CW]
CHANGED preferences code. Preferences are now stored in a file called "config.rc" – a "config.example.rc" is included (see Installation for details).[CW]
REMOVED the Universal Library Scanserver module. [CW]

0.3.4 (20040522)

Fixed a small bug in the Early Canadiana Online module. [CW]

0.3.3 (20040506)

Added some experimental code to save an external preferences file. [CW]
Re-wrote Wright American Fiction module. [CW]

0.3.2 (20040422)

Fixed a bug with Making of America - Michigan Books module where some books that had a different filename syntax weren't being downloaded. [CW]
Added a bunch of Examples [CW]

0.3.1 (20031215)

Fixed two small bugs in main code. [CW]
Fixed a bug in the Universal Library Scanserver module [CW]

0.3 (20031211)

Added Early Canadiana Online module [CW]
Added Austrian Literature Online module [CW]
Added Michigan State University Digital & Multimedia Center module [CW]
Added Universal Library Scanserver module [CW]
Added -w option (wait) [CW]
Moved documentation to this HTML file [CW]

0.2 (20030930)

Added The Hockliffe Project module [CW]
Added University of Georgia module [DM]
Added -o option (offset of pages to download) [CW]
Added -r option (renumber pages sequentially) [CW]
Added -u option (just print a list of URLs) [CW]
Optimized Making of America - Michigan Books module [CW]

0.1 (20030925)

Initial release [CW] [DM]

snatch – the page image grabber

Contents

Installation

Perl

Download and unzip snatch

Configuring

Synopsis

Options

Modules

Austrian Literature Online (alo)

Description

Formats

ID

History

Canadian Literature Online (can)

Description

Formats

ID

History

Core Historical Literature of Agriculture (chla)

Description

Formats

ID

History

Cornell University Library - New York State Historical Literature (culnys)

Description

Formats

ID

History

Home Economics Archive: Research, Tradition and History (hearth)

Description

Formats

ID

History

Hockliffe Project (hock)

Description

Formats

ID

History

Kentuckiana Digital Library (kdl)

Description

Formats

ID

History

Making of America - Michigan Books (moamb)

Description

Formats

ID

History

Posner Memorial Collection (posner)

Description

Formats

ID

History

Our Roots: Canada's Local Histories Online (roots)

Description

Formats

ID

History

Schoenberg Center for Electronic Text & Images (sceti)

Description

Formats

ID

History

University of Michigan Historical Math Collection (ummath)

Description

Formats

ID

History

Historic Pittsburgh (upitt)

Description

Formats

ID

History

The United States and Its Territories (ust)

Description

Formats

ID

History

Wright American Fiction (wright)