9. Defining and deriving access-quality images

Binary versus tonal images for access. The discussion of the relative merits of binary and tonal images for access began with a consideration of computer-storage and file-handling issues. Much future access to documents will be offered via computer networks which, for the current period at least, have limited abilities to move large quantities of data quickly. The consultants noted that smaller, more efficient files could be made either way: binary at moderate resolution (e.g., 200 dpi) or JPEG files with reduced spatial resolution (e.g., 100 or 150 dpi) and/or increased compression (e.g., 30:1).

The discussion turned to the likely actions of researchers who desired the images. Most library and archive researchers, committee members asserted, require printed copies; "printability" is a very important feature to be considered. The printing requirement tends to reward binary images since most present-day laser printers accommodate such images more readily than tonal images.

With this guidance, the consultants produced an array of sample images:

The binary examples were compared with grayscale images with both high levels of JPEG compression and, in two examples, reduced spatial resolution. The goal was to create JPEG compressed files that were comparably sized to the proposed binary access images.

Examples of reduced spatial resolution, extreme JPEG compression with visible artifacts produced for the discussion of options for access images (see Section 9 below).

Binary preferred. For access purposes, most committee members indicated a preference for the 300 dpi binary image compressed using Group 4 compression, formally called ITU-T Recommendation T.6. (ITU-T is the international FAX standards organization formerly known as CCITT.) For the types of content seen in the FTP collection (typewritten and handwritten letters, no small point sizes), no noticeable improvement over the 300 dpi image was seen in binary images with a resolution of 400 or 600 dpi (interpolated from the 300 dpi grayscale source image). The consultants noted that the file size for the 300 dpi example is smaller than the 400 and 600 dpi examples in direct proportion to the resolution and not in the ratio of the squares of the numbers, due to the way Group 4 compression works. The 300 dpi images print well and reasonably quickly on existing laser printers, many of which are not capable of 400 or 600 dpi printing. The grayscale images that were heavily compressed with the JPEG algorithm were not favored because of the challenge at print time and the visibility of JPEG image artifacts.

The access images are stored in TIFF (Tagged Image File Format) version 5.0 files. Although a de facto ("industry," not formal) standard promulgated by the Aldus (now Adobe) Corporation, the TIFF family of formats is in widespread use and employs a publicly disclosed set of tags to identify various parameters of the image in the file.

GIF and PDF formats. The consultants commented on the GIF format, an alternate way to produce grayscale images and one that is well supported in the World Wide Web environment. GIF images are widely used and employ the proprietary (patented) LZW compression method, which performs relatively poorly on natural images as compared to non-noisy computer-generated graphics (where long strings of identical values are common). GIF images on the WWW are commonly used for navigational purposes to give the user a sense of the content of a larger image before committing to a long download.

The committee also discussed the use of the Portable Document Format (PDF, a proprietary format developed by the Adobe Corporation) for access. PDF images can be viewed in a software called Acrobat; Adobe distributes a read-only version of Acrobat for the WWW at no cost to users. Since a PDF file can contain multiple page images, once it is opened, a viewer can page from one page to the next.

PDF has promise as an access format. It is based on PostScript, which is a presentation-control language (as opposed to an archival bit-mapped image format). Although PDF version 1.0 did not support binary data without conversion to printable ASCII (gibberish) characters, version 1.1 (current at the time of the demonstration project) permits the inclusion of binary data and natively supports both Group 4 compression and JPEG compression in addition to formatted text with specified fonts. In addition to concerns about the proprietary format, the key drawback to PDF for this demonstration project was the capability of the then-current version of the Acrobat read-only software called Acroread. In Acroread version 2.0, the entire PDF file must be downloaded before the first page is displayed. This was changed in the 3.0 and later releases, which permit incremental downloading (via "byteserving" code on the server) and viewing in a style more compatible with the WWW environment.

In the end, the desire to accommodate printing and to minimize the reliance on proprietary formats led the committee to decide that the testbed's access images should take the form of binary images in the TIFF format with Group 4 compression. When production was under way and as the time neared for actually presenting the Federal Theatre Project collection on the World Wide Web, however, the need for "screen" or "display" access images became more evident. As will be reported in Section 13 below, the Library produced GIF images of the document pages and the online presentation features both screen-access GIF images and printer-access TIFF images.


Next Section | Previous Section | Contents