#0032: Instructions on digitising physical documents

#0032: Instructions on digitising physical documents

Preamble

This will be a quick guide to anyone who may be interested in creating their own digital archives of physical documents. Although there are undoubtedly any number of different ways to achieve this task: I only intend to show you one method. The method that I specifically use (at the time of writing) in order to create, label, modify, and archive document files. Files such as the ones hosted on this website’s “Device Document Scans” page.

Hyperlink: https://www.tinkerersblog.net/device-document-scans

Tools and equipment

Hardware:

  • flatbed scanner
  • personal computer

Software:

  • Linux Mint (operating system)
  • Bash terminal (TUI program for accessing other TUI programs)
  • simple-scan (GUI scanning program)
  • GIMP (GUI WYSIWYG image manipulation program)
  • ImageMagick convert (TUI image manipulation program)
  • img2pdf (TUI file format conversion program)
  • xviewer (GUI image displayer program)
  • xreader (GUI PDF displayer program)

Process overview

1) Scanning the physical document.
2) Initial edit, and virtual file export of scanned images.
3) Edit of image dimensions and watermark application.
4) Creation of alpha-less versions of the edited images.
5) Compilation of all alpha-less images into a single PDF file.
6) Test, organisation, and archiving of files.

Process explained

1) Scanning the physical document.

I use the flat bed scanner on a Pantum M6607NW laser printer scanner combo, in conjunction with a standard GUI GNU/Linux program called simple-scan. One by one I scan all the document’s pages using a 300 DPI (Dots Per Inch) image fidelity setting.

2) Initial edit, and virtual file export of scanned images.

I use simple-scan to export all the raw scanned images in a lossless PNG image file format.

Although simple-scan has some basic image editing functionality, such as image rotation and cropping; I tend to shy away from cropping images here due to the lack of precision available with the tool. However a rough crop to minimize image file size can be useful at this stage. Especially when scanning documents with a smaller page size (e.g. A5); which would otherwise have a lot of needless (memory consuming) white-space in each image.

Additionally, I find that rotating whole images at this stage using simple-scan to be a better experience than rotating them later using GIMP (or even xviewer). This is because, anecdotally: it seems to use less system resources for some reason. It’s just a smoother experience.

As for the outputted files themselves: I like suffixing metadata information onto the file name. In this case “_300DPI_scan”. This is to help identify specific files when they all get archived together.

It also adds a certain element of future-proofing because I may want to create higher or lower DPI versions of the same documents for specific purposes in the future; without it causing a naming conflict, and upsetting my global naming scheme.

Output:

generic_manual_p1_300DPI_scan.PNG
generic_manual_p2_300DPI_scan.PNG
generic_manual_p3_300DPI_scan.PNG …

3) Edit of image dimensions and watermark application.

I use GIMP (GNU Image Manipulation Program) to crop each page image with pixel perfect uniformity (i.e to the same image dimensions). Then I apply my watermark to each page and then export them as PNG images again. I mark the exported PNG files with the ‘WM_’ prefix to differentiate them from the original PNG images, which would otherwise have the same file name.

For the sake of clarity I should state that I keep all the original files (raw scan images) just incase I need to work with them again, and for some reason I do not wish to use the edited versions. It’s good practice to always keep and archive the original unadulterated images for instances like these.

Input:

generic_manual_p1_300DPI_scan.PNG
generic_manual_p2_300DPI_scan.PNG
generic_manual_p3_300DPI_scan.PNG …

Output:

generic_manual_p1_300DPI_scan.PNG
generic_manual_p2_300DPI_scan.PNG
generic_manual_p3_300DPI_scan.PNG …

WM_generic_manual_p1_300DPI_scan.PNG
WM_generic_manual_p2_300DPI_scan.PNG
WM_generic_manual_p3_300DPI_scan.PNG …

4) Creation of alpha-less versions of the edited images.

I use the terminal “convert” program to remove the alpha layers of every PNG image. This is because “img2pdf” can not compile PNG images into a PDF that contains alpha layers. (I.e. clear sections/layers within an image). If you try to, img2pdf will return an error message that contains additional instructions. Unfortunately it will still also output a 0 byte PDF file which you will have to delete.

Error message:

WARNING:root:Image contains transparency which cannot be retained in PDF.
WARNING:root:img2pdf will not perform a lossy operation.
WARNING:root:You can remove the alpha channel using imagemagick:
WARNING:root: $ convert input.png -background white -alpha remove -alpha off output.png
ERROR:root:error: Refusing to work on images with alpha channel

The “convert” command options assigns the background colour to the image as white. This is the colour that replaces any clear (or alpha) sections of the image. Next the alpha sections of the image are removed, then all alpha functionality of the PNG file is switched off.

Please note the exact order that the command options are passed to the program is not important, I only state this order for human understandability. Additionally the “convert” program does not actually convert the original files inputted into it, it instead outputs a modified copy. It will however overwrite the original file if you give the output file an identical name.

I suffix the “_no_alpha” label onto the the outputted files to differentiate them from their predecessors. Although as you can see the file names are getting long and unwieldy, especially if the manual itself already has a long name. However the various prefixes and suffixes all serve a purpose and are necessary for file version distinction.

Command:

convert WM_generic_manual_p1_300DPI_scan.PNG -background white -alpha remove -alpha off WM_generic_manual_p1_300DPI_scan_no_alpha.PNG

Input:

WM_generic_manual_p1_300DPI_scan.PNG
WM_generic_manual_p2_300DPI_scan.PNG
WM_generic_manual_p3_300DPI_scan.PNG …

Output:

WM_generic_manual_p1_300DPI_scan.PNG
WM_generic_manual_p2_300DPI_scan.PNG
WM_generic_manual_p3_300DPI_scan.PNG …

WM_generic_manual_p1_300DPI_scan_no_alpha.PNG
WM_generic_manual_p2_300DPI_scan_no_alpha.PNG
WM_generic_manual_p3_300DPI_scan_no_alpha.PNG …

5) Compilation of all alpha-less images into a single PDF file.

I compile all the watermarked no alpha layer versions of the image files into a single PDF file using “img2pdf” via the terminal.

Command:

img2pdf WM_generic_manual_p1_300DPI_scan_no_alpha.PNG WM_generic_manual_p2_300DPI_scan_no_alpha.PNG … -o generic_manual_300DPI_scan.PDF

Input:

WM_generic_manual_p1_300DPI_scan_no_alpha.PNG
WM_generic_manual_p2_300DPI_scan_no_alpha.PNG
WM_generic_manual_p3_300DPI_scan_no_alpha.PNG …

Output:

generic_manual_300DPI_scan.PDF

6) Test, organisation, and archiving of files.

This stage firstly involves testing if the PDF actually works as expected. Whether or not it is functional and whether or not all the pages contained therein are in the correct order. As well as rendering and scaling correctly. To do this I just try to open the file using Mint’s default PDF viewer program (namely xreader), and skim through the document’s pages.

This stage also involves putting each different collection of images from the various stages of this process into their own labelled ZIP format archive file. Then placing all these files into another container ZIP alongside the ultimate resultant PDF.

It is then placed into the local “device_document_scans” folder. Which is then copied over to the backups. Finally, I also upload the PDF by itself onto this website.

Output:

generic_manual_300DPI_scan.ZIP

Containing:

generic_manual_300DPI_scan.PDF
imageset_no_alpha.ZIP
imageset_raw.ZIP
imageset_watermarked.ZIP

Thoughts on tools and equipment

Hardware

As far as hardware requirements go, its just the bare essentials really: a decent scanner and computer. Neither devices need to be anything special, just fit for purpose.

Computer

As for computers, whatever computer you are currently using is likely to be just fine. The main thing that may become an issue is probably system RAM size; and even then only when scanning large (600+ DPI) multi-page documents at the same time.

This is because the scanning program will have to hold all these rather large images uncompressed within the RAM as you scan through the document. RAM may also become an issue when using image manipulation software like GIMP. If it is too low it may limit how many images you may work on concurrently. At the very lest it may limit your ability to do other things on the machine as you process these images. For example running a RAM greedy application such as a modern internet browser (e.g. Firefox or Google Chrome).

Another thing that may be a limiting factor with computers is CPU processing power. When converting file formats or compiling a series of images into a portable document file: your system may freeze or become unresponsive. Especially if the programs used/running aren’t optimised to be multithreaded. Resulting in the instruction sets all getting queued on the same CPU core and thread. This in turn causes the unresponsiveness as user input is queued behind these instruction sets.

To sum it up, any computer with more than 2-4 gigabytes of RAM and an early generation Intel i3 processor will likely suffice. However there are too many variables that may affect whether or not these system requirements are adequate; such as the desired scan image size, resource use of the operating system, scanning program, as well as background processes.

Scanner

Now onto the scanner. Most if not all modern flat bed scanners should be adequate. Chances are if they connect to your computer via USB 2.0 protocol or better than they are new enough to provide the 300 DPI (dots per inch) image quality that I use for digitising my manuals. If you are scanning photographs you may require a higher DPI rate such as 600 DPI to maximize image detail retention.

However since the value of my manuals is rather utilitarian in nature, 300 DPI is a fine image quality for my use case. By ‘utilitarian’ I mean that the information printed onto the manuals is what I am primary preserving, and not each page’s visual aesthetic. Because of this I just need them to be legible without necessarily preserving every minute page detail.

Heck, an argument could even be made to go down to a 75 DPI scan setting: as it’s perfectly useable whilst also minimizing all file sizes; including all intermediary portable network graphic images, as well the final portable document file.

However I find that working with 300 DPI images (which translate to a maximum of 2550*3507 pixels for an uncropped full scan) are a good compromise between image detail and workability/use-ability.

Example of 1200 DPI scanned image unable to be displayed with xviewer

Scan DPI example files


(Feel free to download and test these files on your own system.)

Scan image metadata translations

(Translations based on a scan of the full scanner bed of a PANTUM M6607NW)

Key: scan quality (Dots Per Inch) / image dimensions (pixels) / file size (bytes)

  1. 75 DPI / 637*876 p / 870.9 kB (lossless PNG)
  2. 150 DPI / 1275*1753 p / 4.2 MB (lossless PNG)
  3. 300 DPI / 2550*3507 p / 17.5 MB (lossless PNG)
  4. 600 DPI / 5100*7014 p / 62.8 MB (lossless PNG)
  5. 1200 DPI / 10200*14028 p / 211.9 MB (lossless PNG)

Software

Since my operating system of choice is Linux Mint running the Cinnamon desktop environment, I just use the programs that are either available with the initial install package as standard; or downloaded from the standard Ubuntu repository if necessary.

Simple-scan comes preinstalled with Linux Mint. It is the default scanning utility program. There are more robust alternatives such as ‘xsane’; however my philosophy with regards to tools like this is that one only upgrades tools or seeks alternative tools when the default tools are found to be wanting. I.e. when there’s a particular functionality or quality that the current toolset doesn’t provide; and since the default simple-scan program provides adequate functionality, I don’t need to seek alternatives just for the sake of it.

Moving on. Both GIMP, Image Magick and ‘img2pdf’ are available within the standard Ubuntu software repository. So both can be downloaded using the ‘sudo apt-get install’ commands. However it is recommended that you first use “apt-cache search [program]” command to ascertain whether or not they are available within whatever repository that you are using, if you are using another Linux distro to Linux Mint.

sudo apt-get install gimp
sudo apt-get install imagemagick
sudo apt-get install img2pdf

To sum up GIMP. If you are coming from Windows, you may be used to other image manipulator programs like ‘paint.net’ or ‘Adobe Photoshop’, if not GIMP itself since it is a multiplatform program and available on Windows. Anyway if you have used any modern full-suite WYSIWYG image manipulation program, then GIMP will be an easy enough program to jump on to.

Finally Image Magick. This is a software toolkit that you access via the Bash terminal. Many people, including myself prefer TUI based programs like this due to their ease of use, user interface uniformity, and functional robustness.

I often write scripts including commands that utilise programs that can be accessed via Bash. The programs provided by Image Magick are no different. Once a person gets used to using them, it becomes a natural progression to create scripts which then automate the process.

This would be useful for situations such as batch conversion of multiple files: as scripting allows the user to go AFK or do something else, rather than babysit the process. Scripting and chaining commands like this is probably the greatest strength of CLI/TUI programs over GUI programs.

Closing thoughts

If you aren’t already accustomed to using any Linux based distro, then one thing I recommend keeping in mind is hardware compatibility. It is probably this platforms biggest weakness.

This is specifically because most companies build their products to target the Windows platform. Often facilitating device functionality by using proprietary drivers, and oft times even programs: such as with proprietary controller programs for LED keyboards. These drivers are sometimes absent in Linux. However in most cases there are open-source alternatives.

In the past this used to be a bigger issue. Thankfully the list of supported peripheral devices has gotten much better as of late. As it is at the time of writing, and according to my personal experience as well as as some online reading: most devices work flawlessly plug-and-play; however, some devices work for the most part but are missing some advanced functionality, and some devices don’t work at all.

Unfortunately the best way to tell whether or not your device will work, is by simply plugging it in and fiddling with settings and open source drivers; until it either eventually works, or you give up. Whichever comes first.

As an example: I had quite a few issues with my system not recognising my Pantum M6607NW printer-scanner combo properly, despite official Linux drivers being available on the standard repository, and via the companies website. Even now, after resolving that problem and getting the thing working, I am still having some minor issues with the device.

For example if you paid attention to the images above, you may have noticed that Simple-scan allows for a 2400 DPI scan in conjunction with the Pantum M6607NW. Unfortunately this setting doesn’t work as expected. It does scan the document, and it does it noticeably slower than on the 1200 DPI setting. Which is as expected, due to scan heads collecting more detail from each page segment. However the resultant image has the same pixel dimensions as a 1200 DPI scan. So if there is a higher detail density, it isn’t reflected in a larger image dimension – as is the case with all other DPI settings.

Although xviewer failed to open images of this size, the Firefox browser did not; and upon visual inspection and detail comparison between the 1200 and the 2400 DPI scans: I have concluded that they are identical. See for your self, the files are listed in this article. Knowing this, it is likely that simple-scan is providing an option that the scanner can not support. Although the Pantum’s slower read speed on the 2400 setting has me doubting this conclusion. Since it seems to exhibit a programmed hardware response to this setting.

I could likely find the solution eventually by combing through the official generic M6600 series online manual for my machine, then hunt down more specific documentation … although it is frankly not a priority at this point. As I am not planning on using a 2400 DPI scan setting anytime soon. I only highlight this specific issue to make you aware of the kind of troubleshooting fun to expect on this platform.

So if you are moving to a Linux based platform for productivity purposes, well you can’t say that you haven’t been warned. Having said that, don’t let that stop you from using this platform for this purpose. When it works it works fantastically, and when it doesn’t there is always something that you can do yourself to make it work. You have to get used to being your own tech support.

Best of luck archiving your documents, and as always:
Thank you for reading.

Glossary of terms

AFK: Away From Keyboard
Bash: Bourne Again SHell
CLI: Command Line Interface
DPI: Dots Per Inch
GIMP: GNU Image Manipulation Program
GUI: Graphics User Interface
PDF: Portable Document File
PNG: Portable Network Graphic
PnP: Plug and Play
TUI: Text User Interface
WYSIWYG: What You See Is What You Get

Links, references, and further reading