Welcome to Pemi’s documentation!

Pemi is a framework for building testable ETL processes and workflows.

Motivation

There are too many ETL tools. So why do we need another one? Many tools emphasize performance, or scalability, or building ETL jobs “code free” using GUI tools. One of the features often lacking is the ability to build testable data integration solutions. ETL can be exceedingly complex, and small changes to code can have large effects on output, and potentially devastating effects on the cleanliness of your data assets. Pemi was conceived with the goal of being able to build highly complex ETL workflows while maintaining testability.

This project aims to be largely agnostic to the way data is represented and manipulated. There is currently some support for working with Pandas DataFrames, in-database transformations (via SqlAlchemy) and Apache Spark DataFrames. Adding new data representations is a matter of creating a new Pemi DataSubject class.

Pemi does not orchestrate the execution of ETL jobs (for that kind of functionality, see Apache Airflow or Luigi). And as stated above, it does not force a developer to work with data using specific representations. Instead, the main role for Pemi fits in the space between manipulating data and job orchestration.

Indices and tables