Converting AWS Textract JSON to PRImA PAGE XML using textract2page

April 16, 2024

Converting AWS Textract JSON to PRImA PAGE XML using textract2page

Are you looking for a way to convert OCR results from AWS Textract JSON files to PRImA PAGE XML files? Look no further! In this article, we will explore the textract2page software, which provides a simple and efficient solution for this task.

Installation

To get started with textract2page, you will need to set up a Python virtual environment. Once your virtual environment is ready, you can install the textract2page package using the following command:

bash
pip install textract2page

Usage

The textract2page package includes a file-based conversion function that can be used via the command line interface (CLI) or the Python API.

Python API

To convert a Textract JSON file and its corresponding image file to a PAGE XML file using the Python API, follow these steps:

Import the convert_file function from the textract2page module.
Call the convert_file function with the path to the Textract JSON file, the path to the image file, and the desired output path for the PAGE XML file.

Here’s an example of using the Python API to convert a Textract JSON file example.json and its corresponding image file example.jpg to a PAGE XML file example.xml:

python
from textract2page import convert_file

convert_file("example.json", "example.jpg", "example.xml")

CLI

If you prefer using the command line interface, textract2page provides CLI options for file conversion. To convert a Textract JSON file and its corresponding image file to a PAGE XML file, use the following command:

bash
textract2page example.json example.jpg > example.xml

You can also specify the output file directly using the -O option:

bash
textract2page -O example.xml example.json example.jpg

For a full list of available options, you can use the --help or -h flag.

Testing

To ensure the proper functioning of textract2page, it is recommended to run regression tests. The textract2page repository includes a set of tests that can be executed using pytest.

To run the regression tests, follow these steps:

Install the required dependencies by running make deps-test.
Run the tests using the command make test-api.

If you prefer running the regression tests via the command line, you can use the following command:

bash
sudo apt-get install xmlstarlet # optional, required for result tree validation
make test-cli

Please note that if xmlstarlet is available, the CLI test will also validate the result tree. Otherwise, it will only check for the completion of the command without errors.

Conclusion

In conclusion, textract2page provides a seamless solution for converting AWS Textract JSON files to PRImA PAGE XML files. With its easy installation process, straightforward usage instructions, and comprehensive testing procedures, textract2page is a reliable tool that will greatly assist in your OCR data conversion needs. Give it a try and see the benefits for yourself!

If you have any questions or need further assistance, feel free to reach out and share your thoughts.

References

Textract2page GitHub Repository: https://github.com/slub/textract2page
PRImA PAGE XML: https://github.com/PRImA-Research-Lab/PAGE-XML
AWS Textract: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html

Group Sum

Converting AWS Textract JSON to PRImA PAGE XML using textract2page

Converting AWS Textract JSON to PRImA PAGE XML using textract2page

Installation

Usage

Python API

CLI

Testing

Conclusion

References

Leave a Reply Cancel reply