Dbt can be used to validate the accuracy, freshness, and reliability of your data and data models, ideally identifying any issues before they cause downstream impacts on analytics and decision-making.
Dbt tests can be categorized into two primary types: generic and singular. Generic tests are pre-defined and can be readily applied to multiple data models. On the other hand, custom tests are tailored to the needs of a specific data model.
In dbt, these two types of tests frequently complement each other. You can effortlessly employ generic tests across all your models to conduct broad data quality assessments. Additionally, custom tests can be developed to enforce precise business rules for individual models or fields.
Generic Tests are put in place by calling a macro. They are defined in a .yaml
file within the folder where the model is.
There are 4 Out-of-the-box tests provided by dbt:
With a uniqueness test, you’re testing that all rows for a specific column are unique, keeping your data, your business users, and your spine happy.
You’ll most commonly test primary and surrogate keys for uniqueness in your data warehouse tables. Using some manual SQL, you’d probably write a little query like this every time you make an update to a table:
select
id
from your_table
group by id
having count(*) > 1
Any ids
that are non-unique will be returned with this query.
As much fun as it is to rewrite and execute this query any time there’s a change to your tables, there's a much faster and easier way to sniff out duplicates with dbt: the unique test. Using the uniqueness test, you add a unique
test to any column you want in your dbt models:
models:
- name: orders
columns:
- name: id
tests:
- unique
In this example above, the id
field will be checked for uniqueness every time you run your tests against your orders
* table in dbt. Behind the scenes, dbt is compiling a test very similar to the original SQL above to run against your data warehouse.
Non-null tests should check that specific column values are non-null. Some examples of these columns might be:
Similar to the unique test, you can define a not-null test in dbt with a single line of YAML:
models:
- name: orders
columns:
- name: id
tests:
- unique
- not_null
Now, the id
field is checked for both uniqueness and non-nullness, like any primary key should be tested. This testing structure is particularly useful considering some modern data warehouses recognize primary keys, but don’t enforce them.
Data teams will transform raw data into meaningful business entities by cleaning, joining, and aggregating the data. During the transformation process, as dimensions and measures are updated and created, it’s important to check that column values meet your expectations.
For example, if you run an ecommerce shop, you would expect the following to be true about your data:
placed
, shipped
, delivered
, or returned
USD
In a more ad hoc way, you could write some relatively simple queries…
select
order_id,
order_status
from orders
where order_status not in ('placed', 'shipped', 'delivered', ' returned')
select
order_id,
currency
from orders
where currency != 'USD'
But writing these ad hoc queries over and over again doesn’t make sense as your data grows and changes over time.
Using dbt tests for accepted_values
,not_accepted_values
, and accepted_range
, you can easily define expectations for column values within a version-controlled YAML file, and have them raise errors when your data deviates from them.
The ad hoc queries from above would simply be replaced with a few lines of YAML:
models:
- name: orders
columns:
- name: order_status
tests:
- accepted_values:
values: ['placed', 'shipped', 'delivered', 'returned']
- name: currency
tests:
- accepted_values:
values: ['usd']
As you transform your raw data into meaningful models, your data will likely change format with new dimensions, measures, and joins. Each step of the transformation process creates room for the data to deviate from your expectations, which is why it’s important to test for relationships in your data models.
dbt is built on the foundation that data transformations should be modular, referencing each other to ultimately form a DAG—a lineage of your data. In dbt, whenever you make a new data transformation, what we call models, you can create a relationship test, which will check if a specified column in your new model has a valid mapping to an upstream model you specify.
Relationships test is particularly useful if you're joining multiple models together, potentially introducing new rows, and changing the relationship between upstream and downstream models.
This test validates that all of the records in a child table have a corresponding record in a parent table. This property is referred to as "referential integrity".
The following example tests that every order's customer_id
maps back to a valid customer
.
schema.yml
version: 2
models:
- name: orders
columns:
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: id
The to
argument accepts a Relation – this means you can pass it a ref
to a model (e.g. ref('customers')
), or a source
(e.g. source('jaffle_shop', 'customers')
).
dbt ships with Not Null, Unique, Relationships, and Accepted Values generic data tests. Under the hood, these generic data tests are defined as test
blocks (like macros).
Generic tests are defined in SQL files. Those files can live in two places:
tests/generic/
: that is, a special subfolder named generic
within your test paths (tests/
by default)macros/
: Generic tests work a lot like macros, and historically, this was the only place they could be defined. If your generic test depends on complex macro logic, you may find it more convenient to define the macros and the generic test in the same file.To define your own generic tests, simply create a test
block called <test_name>
. All generic tests should accept one or both of the standard arguments:
model
: The resource on which the test is defined, templated out to its relation name. (Note that the argument is always named model
, even when the resource is a source, seed, or snapshot.)column_name
: The column on which the test is defined. Not all generic tests operate on the column level, but if they do, they should accept column_name
as an argument.Here's an example of an is_even
schema test that uses both arguments:
tests/generic/test_is_even.sql
{% test is_even(model, column_name) %}
with validation as (
select
{{ column_name }} as even_field
from {{ model }}
),
validation_errors as (
select
even_field
from validation
-- if this is true, then even_field is actually odd!
where (even_field % 2) = 1
)
select *
from validation_errors
{% endtest %}
If this select
statement returns zero records, then every record in the supplied model
argument is even! If a nonzero number of records is returned instead, then at least one record in model
is odd, and the test has failed.
To use this generic test, specify it by name in the tests
property of a model, source, snapshot, or seed:
models/<filename>.yml
version: 2
models:
- name: users
columns:
- name: favorite_number
tests:
- is_even
In this example, users
will be passed to the is_even
test as the model
argument, and favorite_number
will be passed in as the column_name
argument. You could add the same line for other columns, other models—each will add a new test to your project, using the same generic test definition.
The is_even
test works without needing to specify any additional arguments. Other tests, like relationships
, require more than just model
and column_name
. If your custom tests requires more than the standard arguments, include those arguments in the test signature, as field
and to
are included below:
tests/generic/test_relationships.sql
{% test relationships(model, column_name, field, to) %}
with parent as (
select
{{ field }} as id
from {{ to }}
),
child as (
select
{{ column_name }} as id
from {{ model }}
)
select *
from child
where id is not null
and id not in (select id from parent)
{% endtest %}
When calling this test from a .yml
file, supply the arguments to the test in a dictionary. Note that the standard arguments (model
and column_name
) are provided by the context, so you do not need to define them again.
models/<filename>.yml
version: 2
models:
- name: people
columns:
- name: account_id
tests:
- relationships:
to: ref('accounts')
field: id
It is possible to include a config()
block in a generic test definition. Values set there will set defaults for all specific instances of that generic test, unless overridden within the specific instance's .yml
properties.
tests/generic/warn_if_odd.sql
{% test warn_if_odd(model, column_name) %}
{{ config(severity = 'warn') }}
select *
from {{ model }}
where ({{ column_name }} % 2) = 1
{% endtest %}
Any time the warn_if_odd
test is used, it will always have warning-level severity, unless the specific test overrides that value:
models/<filename>.yml
version: 2
models:
- name: users
columns:
- name: favorite_number
tests:
- warn_if_odd # default 'warn'
- name: other_number
tests:
- warn_if_odd:
severity: error # overrides
To change the way a built-in generic test works—whether to add additional parameters, re-write the SQL, or for any other reason—you simply add a test block named <test_name>
to your own project. dbt will favor your version over the global implementation!
tests/generic/<filename>.sql
{% test unique(model, column_name) %}
-- whatever SQL you'd like!
{% endtest %}Singular Tests
In situation where we have to create own own custom tests, Singular Tests come into the picture. Singular tests are sql definitions that reference a particular model. These are expected to return an empty result set if the test passes. Singular tests give users the freedom to write tests for their custom use cases.
Singular tests allow you to create tests when the default generic ones (or the ones in the packages discussed below) do not meet your needs. These tests are simple SQL queries that express assertions about your data. An example of this type of test can be a more complex assertion such as having sales for one product be within +/- 10% of another product. The SQL simply needs to return the rows that do not meet this condition.
Main steps for conducting singular tests in dbt:
.sql
files, typically in your tests
directory.ref
and source
) in the test definition, just like you can when creating models..sql
file contains one select
statement, and it defines one test.model
folder as wellThe workflow for tests is similar to developing models, but is less involved.
tests/assert_total_payment_amount_is_positive.sql
-- Refunds have a negative amount, so the total amount should always be >= 0.
-- Therefore return records where total_amount < 0 to make the test fail.
select
order_id,
sum(amount) as total_amount
from {{ ref('fct_payments' )}}
group by 1
having total_amount < 0
1. Prepare SQL code. When creating a new test, it’s often best to use a SQL editor to create the initial query and work through any typical SQL issues there. Remember, we're looking for the rows that fail the test. If the test returns rows, then the test is marked as failed.
2. Place the query into the appropriate file, making sure to name the test uniquely. Singular tests are defined in .sql files. Save in tests directory. Tests directory defines in test-paths config in dbt_project.yml
3. To speed development, you can use the dbt test --select test_name command to run only that specific test. When you get a large dataset and have many tests, the amount of time required to run them all can increase greatly. The --select option should cut this down noticeably.
4. Finally, check any errors in your test and update accordingly.
Normally, a data test query will calculate failures as part of its execution. If you set the optional --store-failures
flag, the store_failures
, or the store_failures_as
configs, dbt will first save the results of a test query to a table in the database, and then query that table to calculate the number of failures.
More information in documentation https://docs.getdbt.com/reference/resource-configs/store_failures