Document Management, Software Development

Automating PDF Collation for Two-Sided Scans

February 29, 2024

Imagine you want to scan documents on both sides, but your automatic document feeder (ADF) only scans one side. It can be quite a hassle to manually merge the two resulting PDFs in the correct order. That’s where the PDFCollate project comes in.

PDFCollate is a handy tool that automates the process of merging two PDFs into one correctly ordered document. Whether you frequently scan two-sided documents or just need to merge two PDFs occasionally, PDFCollate can save you time and effort.

How does it work?

PDFCollate monitors a designated directory, SOURCE_DIRECTORY, for PDF files. It assumes that the first PDF created in the directory is the front side of the document. The second PDF created is considered the back side and is merged with the front side to create a collated PDF, which is then saved in the designated DESTINATION_DIRECTORY.

To address common problems, PDFCollate implements the following solutions:

Merging a new PDF with an old one: PDFCollate includes a timeout, COLLATE_TIMEOUT, which starts counting from the moment the first PDF finishes writing. If a new PDF is created before the timeout ends, it is recognized as the second PDF. However, if the timeout elapses before another PDF is created, the new PDF becomes the first one, and the previous one is evicted with a timeout warning.
Merging incompatible PDFs: PDFCollate ensures that the number of pages in the PDFs being merged is equal. If the pages do not match, the second PDF replaces the first one, accompanied by a warning.

While PDFCollate is a powerful tool for automating PDF collation, it does have some limitations. It currently relies on inotify and can only be used on Linux. However, you can overcome this limitation by utilizing Docker.

Installation and Configuration

You can install PDFCollate using the Docker image cranium/pdfcollate, available on Docker Hub. Alternatively, you can use the Python package and install it with pip install pdfcollate.

To set up PDFCollate according to your specific needs, you can modify the following environment variables:

SOURCE_DIRECTORY: The directory to monitor for new PDF files.
DESTINATION_DIRECTORY: The directory where the collated PDF will be created.
COLLATE_TIMEOUT: The duration in which two PDFs are considered related.
OUTPUT_NAME_SUFFIX: A suffix added to the output PDF name.
DELETE_OLD_FILES: Controls whether or not to remove successfully merged files (True by default).

As an example of how PDFCollate can be utilized, consider the setup of the author, who has a NAS with two SAMBA directories: one for single-sided scans and the other for two-sided scans. By configuring their NAS docker-compose file with the PDFCollate Dockerfile and setting the appropriate volumes, they can easily merge two-sided scans.

Why PDFCollate?

As the saying goes, “necessity is the mother of innovation.” The creator of PDFCollate needed a solution to scan both sides of documents without too much hassle. PDFCollate solves this problem by automatically merging two PDFs into one, correctly ordered document.

What’s Next?

While PDFCollate is already a useful tool, there are still opportunities for improvement. The following items are on the to-do list for future enhancements:

Upgrade alpine: The project is currently stuck at alpine:3.8 due to the presence of the pdftk binary.
Document utilization: Provide clear documentation on how PDFCollate can be utilized as pure Python, as a Docker image, or in a docker-compose file.
Add CLI arguments for configuration: Enhance flexibility by adding command-line arguments for configuration options.
Add tests: Ensure that PDFCollate behaves correctly in all scenarios.

In conclusion, PDFCollate is a valuable tool for automating the collation of two-sided scanned documents. With its straightforward setup and powerful merging capabilities, it can save you time and effort. Give it a try and streamline your document management process!

Have any questions or want to contribute to PDFCollate? Feel free to check out the PDFCollate repository on GitHub and join the conversation.

References

Acknowledgements:
– PDFCollate. (n.d.). In GitHub Repository. Retrieved from https://github.com/RomainGehrig/PDFCollate

Licensing:
– PDFCollate is licensed under the MIT License.

Group Sum