So I have this 13" Sony DPT-RP1 e-Reader that I use for reading studies and books. Some books I can't find online for free or are out of stock and I have to sometimes resort to The Internet Archive to borrow scanned copies of them in PDF format. The problem is the scans are not optimized for e-Ink screens: yellow-ish pages that appear gray-dark on the monochrome screen, text imprints from the opposite page still visible through the paper. I needed clean images to be able to read a book under normal conditions.
For example, this online copy of Elijah Andreson's "On the Corner" Sociological study looked like this on my e-Reader:
The gray background distracts one from the text and makes reading arduous.
So let's see how we can make this better by using a few BASH commands and CLI tools in Linux.
First extract all the images from the PDF in a folder:
pdfimages A_place_on_the_corner.pdf corner/
This generates about 8GB of *.pbm and *.ppm files. PNM files contain the scans of the pages with text shadows and other elements such as stains or handwritten notes. PPM files contain their negatives:
We only need the PBM files and we'll convert them to PNG to retain image quality. You can do this for all the PBM images in the folder with
find -name '*.pbm' -print0 | xargs -0 -r mogrify -format png
You can remove the PBM files now with
rm *.pbm
Now we will bulk-generate a negative of these images using a small tool named convert from the ImageMagick package:
ls -1 *.png | xargs -n 1 bash -c 'convert "$0" -negate "${0%.png}.jpg"'
We've now created JPG files out of the PNG one and we don't need that latter ones anymore so we'll just delete them:
rm *.png
We'll improve contrast a bit and better highlight dark parts making them even darker:
ls -1 *.jpg | xargs -n 1 bash -c 'convert "$0" -level 60 "${0%.jpg}.jpg"'
We now scale down the images 50% because we need to later generate a new PDF file out of them and we don't want it to be huge in size:
ls -1 *.jpg | xargs -n 1 bash -c 'mogrify "$0" -resize 50% "${0%.jpg}.jpg"'
To create a PDF out of these resized JPEG files use
convert *.jpg a_place_on_the_corner.pdf
ImageMagick uses a predefined cache size of 1GB storage and 256RAM. It will output errors when working with large image files. To give it more resources edit its policy.xml file:
sudo nano /etc/ImageMagick-6/policy.xml
Add new, bigger values for memory and disk to be able to run the above operation without errors
<policy domain="resource" name="memory" value="2256MiB"/> <policy domain="resource" name="map" value="512MiB"/> <policy domain="resource" name="width" value="16KP"/> <policy domain="resource" name="height" value="16KP"/> <policy domain="resource" name="area" value="128MB"/> <policy domain="resource" name="disk" value="20GiB"/>
Depending on file sizes the convert command will take a long time to process every JPEG and you'll need plenty of free HDD space. If the resulting PDF file is too large for you and you don't need such a high resolution you can scale-down the JPG images even further and re-generate the output PDF. My resulting PDF file was about 120MB in size after scaling the images to about 25% of their original size.
The result is pictured below (before and after):
And here is the book on the monochrome e-Ink screen:
Better contrast, clean background, nicely-shaped letters.
Here's another example from a Psychology manual. Notice how the text coming through from the other side of the page is hone and how the coffee spot at the bottom is less visible in the final output: