Re_data is an open-source data reliability framework for modern data stack. Re_data detects potential problems in your data like:
And alerts you on Slack or Email and in re_data UI so that you can react, investigate and fix issues quickly. You can even setup more granual alerts for specific groups of people using re_data_owners
setting.
Add the re_data dbt package to your main dbt repo project. You need to update your packages.yml
file with re_data package like that:
packages.yml
packages:
***
- package: re-data/re_data
version: [">=0.10.0", "<0.11.0"]
For local testing run dbt deps to install the package that was added to packages.yml
pip install re-data
dbt run --models package:re_data
Let's go over 2 commands for generating & serving UI. It works quite similarly to dbt docs. First you create files by calling re_data overview generate
and then serving already existing files by re_data overview serve
. For more details on paramters accepted by this & other re_data commands check re_data CLI reference
To run re_data need to configure what tables should be monitored and set up some properties of this monitoring. You may also want/need to update some of the defaults vars use by re_data to run it for specific time windows or compute types of metrics you need.
Re_data dbt native cofiguration follows the same rules as dbt configuration, config block inside model will have the most priority and configuration in dbt_project.yml
will have the least priority.
<model_name>.sql
{{
config(
re_data_monitored=true,
re_data_time_filter='creation_time',
re_data_columns=['amount', 'status'],
re_data_metrics_groups=['table_metrics', 'column_metrics'],
re_data_metrics={'table': ['orders_obove_100'], 'column': { 'status': ['distinct_values'],
re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3.0} }},
re_data_owners=['datateam']
)
}}
select ...
Also you can add config parameters in <model_name>.yml file
<model_name>.yml
version: 2
models:
- name: pending_orders
config:
re_data_monitored: true
re_data_time_filter: created_at
re_data_columns:
- amount
- status
re_data_metrics_groups:
- table_metrics
re_data_metrics:
table:
- orders_above_100
column:
status:
- distinct_values
re_data_anomaly_detector:
name: modified_z_score
threshold: 3
If you need to add re-data parameters for example for all models better to do it in dbt_project.yml. In dbt_project.yml you also can define re-data parameters for one model.
dbt_project.yml
models:
toy_shop:
revenue:
+re_data_monitored: true
+re_data_time_filter: created_at
+re_data_anomaly_detector:
name: modified_z_score
threshold: 3
+re_data_metrics_groups:
- table_metrics
orders_per_age:
+re_data_metrics:
table:
- orders_above_100
sources:
toy_shop:
toy_shop_sources:
toy_shop_customers:
+re_data_monitored: true
+re_data_time_filter: joined_at
seeds:
toy_shop:
order_items:
+re_data_monitored: true
+re_data_time_filter: added_at
+re_data_anomaly_detector:
name: z_score
threshold: 3
+re_data_columns:
- name
- amount
Set to true
to enable monitoring for a given table or set of tables.
SQL expression (for example, column name) to filter records of the table to a specific time range. It can be set to null
if you wish to compute metrics on the whole table. This expression will be compared to re_data:time_window_start
and re_data:time_window_end
vars during the run. (described below)
Set of columns for which re_data should compute metrics. If not specified, re_data will compute stats for all columns with either numeric or text types.
List of groups of metrics to compute. You can use any re_data:metrics_groups
defined in your vars here. If not specified, re_data will compute metrics defined by re_data:default_metrics
variable.
Additional metrics to be computed for a given table (set of tables). Those can be either whole table
level or column
level. (Check out metrics section to learn distinction between the two)
You can be pass any number of already defined or your custom metrics to be computed. Check out extra metrics section for available metrics and custom metrics for ways to define your own metrics.
In a lot of cases when you extend metrics which are computed we recommend creating a new re_data:metrics_groups
in your vars, adding your metrics to it and then defining re_data_metrics_groups
to use it for set of models. This approach is usually more flexible when adding new metrics for a given model.
Alternative anomaly dector with it's parameters to use when detecting anomalies in a given model (set of models)
For details about configuration look into Anomaly Detection
Group of single person which should receive and alert about problem with a given model.
Apart from model specific config re_data enables you to edit global configuration for some of the parameters. All of them are optional so we start with sensible defaults and let you override if there is a need.
Parameters of re-data configuration in dbt_project.yml
vars:
# (optional) if not passed, stats for last day will be computed
re_data:t
ime_window_start: '{{ (run_started_at - modules.datetime.timedelta(1)).strftime("%Y-%m-%d 00:00:00") }}'
re_data:time_window_end: '{{ run_started_at.strftime("%Y-%m-%d 00:00:00") }}'
# (optional) configuring
re_data:select:
- model_name1
- model_name2
- source_name1
# (optional) tells how much hisory you want to consider when looking for anomalies
re_data:anomaly_detection_look_back_days: 30
# (optional) configuring storing tests history
re_data:save_test_history: true
# (optional) querying db for failing rows
re_data:query_test_failures: true
# (optional) limit the number of failed rows returned per test
re_data:test_history_failures_limit: 10
# (optional) configuring storing table samples
re_data:store_table_samples: true
# (optional) configuring owners
re_data:owners_config:
datateam:
- type: slack
identifier: U02FHBSXXXX
name: user1
backend:
- type: email
identifier: user1@getre.io
name: user1
re_data metrics are time-based. (re_data filters all your table data to a specific time window.) In general, we advise setting up a time window this way that all new data is monitored. It's also possible to compute metrics from overlapping data for example last 7 days.
By default, re_data computes daily stats from the last day (it actually uses exact configuration from example for that)
This is a list which allows you to additionally restrict re_data to only compute metrics/anomalies for certain models. Each model listed here still needs to have re_data_monitored=true
to be monitored. If the list is not passed, re_data will computed stats for all re_data_monitored=true
models.
Groups of metrics to compute. By defult table_metrics
and column_metrics
are defined here, and that's their definition:
re_data:metrics_groups:
table_metrics:
table:
- row_count
- freshness
column_metrics:
column:
numeric:
- min
- max
- avg
- stddev
- variance
- nulls_count
- nulls_percent
text:
- min_length
- max_length
- avg_length
- nulls_count
- missing_count
- nulls_percent
- missing_percent
If you remove table_metrics
, column_metrics
group you will then not be able to use them in re_data_metrics_groups
settings.
Default metrics to compute for each model if no re_data_metrics_groups
is specified. You can use any of the metrics groups defined in re_data:metrics_groups
here. The default re_data configuration is a follows:
re_data:default_metrics:
- table_metrics
- column_metrics
The period which re_data
considers when looking for anomalies. (By default, it's 30 days)
Variable to enable storing test history. See re_data tests history for more details.
Variable to configure if re_data should query failed rows (true
by default)
Variable to configure how many failured rows to fetch per table (10
by default)
This is used to enable storing sample data of monitored tables.
Variable to configure owners for your data. See re_data notifications for more details.
Re_data metrics are currently just expressions which are added to select statements run automatically by re_data. We recommend that most of your metrics computed would be time-based (data is then filtered by the time_filter
specified in the table config. time_filter
can be either some date column comparable to timestamp or SQL expression that will be comparable to the timestamp in your data warehouse. In cases when time-based filtering is not possible re_data can compute global metrics for a table. Global metrics don't filter by time and work on data from the whole table. You can pass time_filter: null
in the re_data table config to compute global metrics.
re_data:default_metrics
. Default metrics variable contain list of metrics groups which you would like to compute for all the tables.row_count
metric. Your custom table level metrics can use multiple columns when computing the value.Definition of all base metrics is available under default metrics section.
re_data:default_metrics:
re_data:metrics_groups:
table_metrics:
table:
- row_count
- freshness
column_metrics:
column:
numeric:
- min
- max
- avg
- stddev
- variance
- nulls_count
- nulls_percent
text:
- min_length
- max_length
- avg_length
- nulls_count
- missing_count
- nulls_percent
- missing_percent
re_data:default_metrics:
- table_metrics
- column_metricsExtra metrics. There are metrics provided by re_data but are not computed by default in monitored tables. You can compute them by updating the configuration for the specific table or adding them to a metrics groups which are computed by default.
More information about default metrics https://docs.getre.io/latest/docs/re_data/reference/metrics/base_metrics
More information about custom metrics https://docs.getre.io/latest/docs/re_data/reference/metrics/your_own_metric
More information about extra metrics https://docs.getre.io/latest/docs/re_data/reference/metrics/extra_metrics
re_data supports three types of anomaly detection for monitoring data. There is no one-size-fits-all choice for anomaly detection. Usually, it depends on the nature of your dataset, so it's worth exploring the methods available. A good rule of thumb when choosing a method is:
More information about types of anomaly detection for monitoring https://docs.getre.io/latest/docs/re_data/reference/anomaly_detection
You can set up anomaly detection by configuring it to monitor specific metrics or KPIs in your data warehouse. This typically involves defining the metrics you want to monitor and specifying the thresholds for detecting anomalies.
You can configuring anomalies globally in dbt_project.yml:
vars:
re_data:anomaly_detector:
name: z_score
threshold: 3
Or configuring per model in <model_name>.sql:
{{
config(
re_data_monitored=true,
re_data_anomaly_detector={'name': 'z_score', 'threshold': 3}
)
}}
Next step is configure alerts or notifications to be triggered whenever an anomaly is detected. This could involve sending emails, Slack messages.
Re_data can store dbt tests history into your data warehouse and visualize details of it in re_data UI. This history lets you see the history of all dbt tests run. You can filter on the table, time, etc.
Test history configuration set up in dbt_project.yml:
vars:
re_data:save_test_history: true
Re-data provides notification functionality to alert users about data quality issues or anomalies detected during monitoring. These notifications can be configured to inform relevant stakeholders via various channels such as email, Slack. Setting re_data owners is optional and re_data notification can work without any owners setup.
Mapping of re_data model owners and their identifier is defined in the re_data:owners_config
block in the dbt_project.yml file. Here we can define an individual user or a group of users (team) with their respective identifiers. Each owner definition consists of
An example configuration is shown below
dbt_project.yml
vars:
re_data:owners_config:
user1:
- type: slack
identifier: U02FHBSXXXX
name: user1
backend:
- type: email
identifier: user1@getre.io
name: user1
and use owners config in model file
<model_name>.sql
models/orders.sql
{{
config(
re_data_anomaly_detector={'name': 'z_score', 'threshold': 2.0},
re_data_owners=['backend'],
)
}}