A Python ETL/ELT Library

December 21, 2023

Streamlining Data Pipelines with onETL: A Python ETL/ELT Library

Data processing and transformation play a crucial role in today’s data-driven world. As businesses strive to harness the power of data, efficient and scalable data pipelines become essential. This is where onETL, a Python ETL/ELT library powered by Apache Spark and other open-source tools, comes into the picture. In this article, we will dive deep into onETL and explore how it can streamline your data pipelines.

Introducing onETL

onETL is a powerful Python library that provides unified classes to extract data from various sources and load it into different stores. Built on the robust Apache Spark framework, onETL leverages the Spark DataFrame API to perform transformations in terms of ETL (extract, transform, load). Whether you need to extract data from databases, perform SQL queries, or execute DDL and DML statements, onETL provides a seamless experience.

Key Features of onETL

Unified Data Extraction and Loading

With onETL, you can extract data from different types of sources, including databases like Clickhouse, MSSQL, MySQL, Postgres, and Oracle, as well as various file systems like HDFS, S3, SFTP, FTP, and WebDAV. onETL provides unified classes such as DBReader and FileDownloader to simplify the extraction process. Similarly, onETL offers DBWriter and FileUploader classes for loading data into different stores.

Spark DataFrame API for Transformations

onETL unleashes the power of Apache Spark’s DataFrame API for data transformations. You can perform complex operations on your data using Spark’s rich set of functions and methods. Whether you need to filter, aggregate, or join datasets, onETL lets you leverage the full potential of Spark’s DataFrame API.

Direct Database Access

In addition to traditional ETL operations, onETL allows you to directly access databases using its built-in classes. You can execute SQL queries, perform DDL (Data Definition Language) and DML (Data Manipulation Language) statements, and even call functions and procedures. This capability enables you to build ELT (Extract, Load, Transform) pipelines and empowers you with greater flexibility.

Support for Different Read Strategies

onETL provides support for different read strategies, allowing you to fetch data incrementally or in batch mode. Whether you need to ingest the latest data updates or process large volumes of historical data, onETL has got you covered. You can choose the read strategy that best fits your use case, ensuring optimal performance and efficient data processing.

Hooks and Plugins for Customization

onETL offers hooks and plugins mechanisms for altering the behavior of internal classes. You can customize the library to suit your specific needs and extend its functionality through plugins. This flexibility enables you to adapt onETL to your unique data pipeline requirements and enhance its capabilities.

Supported Storage Types

onETL supports various storage types for seamless integration with different databases and file systems. It provides connectors for databases like Clickhouse, MSSQL, MySQL, Postgres, Oracle, Teradata, Hive, Kafka, Greenplum, and MongoDB. Additionally, onETL supports file systems like HDFS, S3, SFTP, FTP, FTPS, WebDAV, and Samba. The wide range of supported storage types makes onETL a versatile solution for handling different data sources and destinations.

Practical Examples

To showcase the power and versatility of onETL, let’s explore a few practical examples:

Example 1: MSSQL to Hive

In this example, we will extract data from a MSSQL database and load it into Hive using onETL. We will initialize the necessary connections, perform data extraction using DBReader, apply transformations using Apache Spark, and finally load the transformed data into Hive using DBWriter.

Example 2: SFTP to HDFS

Here, we will download files from an SFTP server and upload them to HDFS using onETL. We will utilize FileDownloader and FileUploader classes to achieve this. Additionally, we will demonstrate how to apply filters and limits to control the files being processed and ensure efficient data transfer.

Example 3: S3 to Postgres

In this example, we will directly read files from an S3 bucket, convert them to Spark DataFrames, apply transformations, and then write the transformed data into a Postgres database. We will leverage SparkS3 and Postgres connectors provided by onETL to seamlessly integrate with S3 and Postgres.

Conclusion

With its powerful capabilities, seamless integration with Apache Spark, and support for various storage types, onETL is a valuable tool for simplifying and streamlining data pipelines. Whether you are working with databases, file systems, or a combination of both, onETL empowers you to extract, transform, and load data efficiently. By leveraging the Spark DataFrame API and its integrated connectors, onETL provides a comprehensive solution for data processing and integration.

By harnessing the power of onETL, you can unlock the potential of your data and drive insights that can propel your business forward. Whether you are a data engineer, data scientist, or solution architect, onETL offers a robust framework for building scalable and performant data pipelines.

To get started with onETL, refer to the documentation at https://onetl.readthedocs.io/ and explore the various examples and tutorials available. Streamline your data pipelines with onETL and experience the power of efficient data processing.

Group Sum