Table Sampling process begins when a merge request (MR) is created in Git during the CI/CD pipeline.
The pipeline runs a macro create_samples_for_dataset that creates a sample dataset based on the production BigQuery tables. Instead of replicating full datasets, a sampled version is created to optimize resources and processing time during testing. The sampling percentage is controlled by a parameter (percent_sample), by default "percent_sample": 1
set in ci/cd
Purpose:
The create_samples_for_dataset macro determines which tables should be sampled based on conditions and the environment. Sample datasets are created only in test-related environments (target equals to e2e, test, e2e_test). For other environments, the macro skips execution.
It processes tables listed in the source.yml file, handles different scenarios (e.g., skipping machine learning or partitioned tables), and invokes another macro (set_data_tablesample) to create the sampled tables.
The sampled tables are stored in a new BigQuery dataset (target.dataset + "_" + source_dataset).
The tables are created using SQL TABLESAMPLE syntax, tailored to the database engine (e.g., BigQuery, Snowflake, PostgreSQL).
Input:
A sampling percentage (percent_sample).
Process:
1. Iterate over tables:
{% set source_nodes = graph.sources %}
{% for node_key, node_value in source_nodes.items() %}
...
{% endfor %}
{% set source_dataset = node_key.split('.')[-2] %}
{% set sample_dataset = target.dataset +"_"+ source_dataset %}
{% set sample_database = target.database %}
table_name = node_value.get('identifier', node_value.get('name'))
skipped_table_name = node_value.get('loader', '').lower()
2. Dataset creation:
Creates a new BigQuery dataset named as target.dataset + "_" + source_dataset for this reason call macro:
{{ create_dataset_bigquery(sample_database, sample_dataset) }}
3. Conditional logic:
Skips certain tables:
ml
)._*
in the table name).{% if skipped_table_name == 'ml' %}
{{ log("Skipping sample table creation for ml table " ~ sample_dataset ~ "." ~table_name,
info=True) }}
{% elif '_*' in table_name %}
{{ log("Skipping sample table creation for partitioned table " ~ sample_dataset ~ "." ~table_name,
info=True) }}
4. Sample table creation:
{{ set_data_tablesample(sample_database, source_dataset, sample_dataset, table_name,
percent_sample) }}
5. Non-Test Environments:
Purpose:
Creates a new BigQuery dataset if it doesn’t already exist.
Input:
Process:
CREATE SCHEMA IF NOT EXISTS
SQL statement.{% set query %}
CREATE SCHEMA IF NOT EXISTS `{{ project_id }}.{{ dataset_name }}`;
{% endset %}
{{ log("Creating dataset in BigQuery: " ~ dataset_name, info=True) }}
{{ run_query(query) }}
Purpose:
This macro generates and executes SQL to create the sampled tables:
Input:
Process:
{% set engine = target.type -%}
TABLESAMPLE SYSTEM
.{% if engine == 'bigquery' -%}
{% set query %}
CREATE OR REPLACE TABLE `{{ project_id }}.{{ sample_dataset }}.{{ table_name }}` AS (
SELECT *
FROM `{{ project_id }}.{{ source_dataset }}.{{ table_name }}`
TABLESAMPLE SYSTEM ({{ percent_sample }} PERCENT));
{% endset %}
TABLESAMPLE SYSTEM
.RANDOM()
for sampling.SAMPLE
.TABLESAMPLE
.run_query
function.