I am Robson, I work as a Data Engineer, one of my duties is to guarantee data quality on pipelines, the projects I touched in the past never had a real good solution for bad data quality, every existing data quality tool I tried, like Great Expectations, Soda, dbt tests requires to write the rule for the failure first, which is exactly the part that's hard, as we can't imagine all the problems before.
So I have been working on Scherlok, taking the opposite approach, first profile the data, and then detect when something changes, as a result, there's no yaml to write, no deep configuration to perform, currently it's detecting volumes, schemas, NULL rates, distributions, freshness cadence, cardinality and stores it locally and then detects when something changes in subsequent runs, severity is being classified in 3 categories like WARNING/INFO/CRITICAL
The code is pure python, z-score based and intent to be light weight over more complex and sometimes expensive market solutions.
I decided to open this project, so we can make it more robust, have some contributors with some already merged PRs.
it works with DBT, reading `target/manifest.json`, discovering every materialized model, auto-resolve the connection from `profiles.yml`, and profiles each model. CI integration is actually one line
I would love feedback, and would be glad with external help from folks facing data quality issues in other scenarios I can't yet imagine
Hey Folks,
I am Robson, I work as a Data Engineer, one of my duties is to guarantee data quality on pipelines, the projects I touched in the past never had a real good solution for bad data quality, every existing data quality tool I tried, like Great Expectations, Soda, dbt tests requires to write the rule for the failure first, which is exactly the part that's hard, as we can't imagine all the problems before.
So I have been working on Scherlok, taking the opposite approach, first profile the data, and then detect when something changes, as a result, there's no yaml to write, no deep configuration to perform, currently it's detecting volumes, schemas, NULL rates, distributions, freshness cadence, cardinality and stores it locally and then detects when something changes in subsequent runs, severity is being classified in 3 categories like WARNING/INFO/CRITICAL
The code is pure python, z-score based and intent to be light weight over more complex and sometimes expensive market solutions.
I decided to open this project, so we can make it more robust, have some contributors with some already merged PRs.
It can start with 3 commands:
scherlok connect scherlok investigate scherlok watch
it works with DBT, reading `target/manifest.json`, discovering every materialized model, auto-resolve the connection from `profiles.yml`, and profiles each model. CI integration is actually one line
I would love feedback, and would be glad with external help from folks facing data quality issues in other scenarios I can't yet imagine
https://github.com/rbmuller/scherlok/labels/good%20first%20i...
Repo: https://github.com/rbmuller/scherlok PyPI: pip install scherlok
Thanks Robson