Converting AWS Textract JSON to PRImA PAGE XML using textract2page
Are you looking for a way to convert OCR results from AWS Textract JSON files to PRImA PAGE XML files? Look no further! In this article, we will explore the textract2page software, which provides a simple and efficient solution for this task.
Installation
To get started with textract2page, you will need to set up a Python virtual environment. Once your virtual environment is ready, you can install the textract2page package using the following command:
bash
pip install textract2page
Usage
The textract2page package includes a file-based conversion function that can be used via the command line interface (CLI) or the Python API.
Python API
To convert a Textract JSON file and its corresponding image file to a PAGE XML file using the Python API, follow these steps:
- Import the
convert_file
function from thetextract2page
module. - Call the
convert_file
function with the path to the Textract JSON file, the path to the image file, and the desired output path for the PAGE XML file.
Here’s an example of using the Python API to convert a Textract JSON file example.json
and its corresponding image file example.jpg
to a PAGE XML file example.xml
:
python
from textract2page import convert_file
convert_file("example.json", "example.jpg", "example.xml")
CLI
If you prefer using the command line interface, textract2page provides CLI options for file conversion. To convert a Textract JSON file and its corresponding image file to a PAGE XML file, use the following command:
bash
textract2page example.json example.jpg > example.xml
You can also specify the output file directly using the -O
option:
bash
textract2page -O example.xml example.json example.jpg
For a full list of available options, you can use the --help
or -h
flag.
Testing
To ensure the proper functioning of textract2page, it is recommended to run regression tests. The textract2page repository includes a set of tests that can be executed using pytest
.
To run the regression tests, follow these steps:
- Install the required dependencies by running
make deps-test
. - Run the tests using the command
make test-api
.
If you prefer running the regression tests via the command line, you can use the following command:
bash
sudo apt-get install xmlstarlet # optional, required for result tree validation
make test-cli
Please note that if xmlstarlet
is available, the CLI test will also validate the result tree. Otherwise, it will only check for the completion of the command without errors.
Conclusion
In conclusion, textract2page provides a seamless solution for converting AWS Textract JSON files to PRImA PAGE XML files. With its easy installation process, straightforward usage instructions, and comprehensive testing procedures, textract2page is a reliable tool that will greatly assist in your OCR data conversion needs. Give it a try and see the benefits for yourself!
If you have any questions or need further assistance, feel free to reach out and share your thoughts.
References
- Textract2page GitHub Repository: https://github.com/slub/textract2page
- PRImA PAGE XML: https://github.com/PRImA-Research-Lab/PAGE-XML
- AWS Textract: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html
Leave a Reply