dbt Model Engineer GPT: The Ultimate Time-Saver for dbt Users#
Hey fellow data people! I want to share something I’ve been working on that’s seriously streamlined my workflow. It’s a GPT I created called the dbt Model Engineer, and it’s designed to automate the conversion of JSON, Excel, or CSV files into dbt models, specifically for Snowflake databases.
What does it do and how does it work?#
We’ve all spent three hours or more converting data formats into usable dbt models. It’s necessary but monotonous work that eats up time we could be using on more complex data problems. That’s where this GPT comes in. It handles the “monkey work,” churning out SQL models and documentation with ease.
The dbt Model Engineer GPT is a GPT-4-powered AI that converts JSON, Excel, or CSV files into dbt models. It’s designed to work with Snowflake databases, but it can be adapted to work with other databases as well. Let’s take a look at how it works.
For this example, I took a sample API response from the Shopify documentation. It shows customer information and looks like this:
{
"customers": [
{
"id": 207119551,
"email": "bob.norman@hostmail.com",
"accepts_marketing": false,
"created_at": "2021-02-12T13:48:32-05:00",
"updated_at": "2021-02-12T13:48:32-05:00",
"first_name": "Bob",
"last_name": "Norman",
"orders_count": 1,
"state": "disabled",
"total_spent": "199.65",
"last_order_id": 450789469,
"note": null,
"verified_email": true,
"multipass_identifier": null,
"tax_exempt": false,
"phone": "+16136120707",
"tags": "",
"last_order_name": "#1001",
"currency": "USD",
"addresses": [
{
"id": 207119551,
"customer_id": 207119551,
"first_name": null,
"last_name": null,
"company": null,
"address1": "Chestnut Street 92",
"address2": "",
"city": "Louisville",
"province": "Kentucky",
"country": "United States",
"zip": "40202",
"phone": "+1(800)-555-2181",
"name": "",
"province_code": "KY",
"country_code": "US",
"country_name": "United States",
"default": true
}
],
},
],
"accepts_marketing_updated_at": "2005-06-12T11:57:11-04:00",
"marketing_opt_in_level": null,
"tax_exemptions": [],
"admin_graphql_api_id": "gid://shopify/Customer/207119551",
"default_address": {
"id": 207119551,
"customer_id": 207119551,
"first_name": null,
"last_name": null,
"company": null,
"address1": "Chestnut Street 92",
"address2": "",
"city": "Louisville",
"province": "Kentucky",
"country": "United States",
"zip": "40202",
"phone": "+1(800)-555-2181",
"name": "",
"province_code": "KY",
"country_code": "US",
"country_name": "United States",
"default": true
}
}
I then fed this JSON into the GPT, without giving any context or instructions. The GPT then generated the following files:
- SQL model for the Customers Table
- dbt doc YAML for Customers
- SQL model for the Addresses Table
- dbt doc YAML for Addresses
Let’s take a look at the SQL model for the Customers table:
-- models/customers.sql
WITH source_data AS (
SELECT
id::INTEGER,
email::VARCHAR,
accepts_marketing::BOOLEAN,
TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS-FF') AS created_at,
TO_TIMESTAMP(updated_at, 'YYYY-MM-DD HH24:MI:SS-FF') AS updated_at,
first_name::VARCHAR,
last_name::VARCHAR,
orders_count::INTEGER,
state::VARCHAR,
total_spent::FLOAT,
last_order_id::INTEGER,
note::VARCHAR,
verified_email::BOOLEAN,
multipass_identifier::VARCHAR,
tax_exempt::BOOLEAN,
phone::VARCHAR,
tags::VARCHAR,
last_order_name::VARCHAR,
currency::VARCHAR
FROM {{ source('your_source_name', 'your_raw_table_name') }}
)
SELECT * FROM source_data;
As you can see, the GPT automatically extracted the column names and data types from the JSON. It also converted the date and timestamp fields into the proper format.
The GPT also generated the following dbt doc YAML for the Customers table:
# models/customers.yml
version: 2
models:
- name: customers
description: "Table for customer data."
columns:
- name: id
description: "The customer's unique identifier."
tests:
- unique
- not_null
- name: email
description: "The customer's email address."
- name: accepts_marketing
description: "Indicates if the customer accepts marketing."
- name: created_at
description: "Timestamp of when the customer was created."
- name: updated_at
description: "Timestamp of the last update to the customer's profile."
- name: first_name
description: "The customer's first name."
- name: last_name
description: "The customer's last name."
- name: orders_count
description: "The count of orders made by the customer."
- name: state
description: "The state of the customer."
- name: total_spent
description: "The total amount spent by the customer."
- name: last_order_id
description: "The ID of the last order made by the customer."
- name: note
description: "Additional notes about the customer."
- name: verified_email
description: "Indicates if the customer's email is verified."
- name: multipass_identifier
description: "Multipass identifier for the customer."
- name: tax_exempt
description: "Indicates if the customer is exempt from taxes."
- name: phone
description: "The customer's phone number."
- name: tags
description: "Tags associated with the customer."
- name: last_order_name
description: "The name of the last order made by the customer."
- name: currency
description: "The currency used by the customer."
If the GPT recognizes the data source, it is advised to check the API documentation for field descriptions. If the GPT does not recognize the data source, it will generate a generic description for each field. In this case, the GPT recognized the data source as Shopify, so it was able to generate the descriptions from the Shopify API documentation. Additionally, it also included a test for the id field which should be not_null
and unique
.
It also generated the SQL model and dbt doc YAML for the Addresses table:
-- models/addresses.sql
WITH source_data AS (
SELECT
id::INTEGER,
customer_id::INTEGER,
first_name::VARCHAR,
last_name::VARCHAR,
company::VARCHAR,
address1::VARCHAR,
address2::VARCHAR,
city::VARCHAR,
province::VARCHAR,
country::VARCHAR,
zip::VARCHAR,
phone::VARCHAR,
name::VARCHAR,
province_code::VARCHAR,
country_code::VARCHAR,
country_name::VARCHAR,
default::BOOLEAN
FROM {{ source('your_source_name', 'your_raw_table_name') }},
LATERAL FLATTEN(input => addresses)
)
SELECT * FROM source_data;
# models/addresses.yml
version: 2
models:
- name: addresses
description: "Table for customer addresses."
columns:
- name: id
description: "Unique identifier for the address."
tests:
- unique
- not_null
- name: customer_id
description: "Associated customer's ID."
tests:
- not_null
- name: first_name
description: "First name associated with the address."
- name: last_name
description: "Last name associated with the address."
- name: company
description: "Company associated with the address."
- name: address1
description: "Primary address line."
- name: address2
description: "Secondary address line."
- name: city
description: "City of the address."
- name: province
description: "Province of the address."
- name: country
description: "Country of the address."
- name: zip
description: "ZIP or postal code of the address."
- name: phone
description: "Phone number associated with the address."
- name: name
description: "Full name associated with the address."
- name: province_code
description: "Province code of the address."
- name: country_code
description: "Country code of the address."
- name: country_name
description: "Full country name of the address."
- name: default
description: "Indicates if it is the default address."
Bye bye, monkey work!#
While this GPT is useful and can save you a lot of time, it’s not perfect. It goes without saying that you should always review the generated code and documentation before using it in production. However, it can save you a lot of time and effort, especially if you’re working with large datasets.
Can I try it out?#
Well, yes and no. OpenAI just announced their GPT store and I plan to offer this GPT for free on the store. However, the GPT store is not yet open to the public. I will update this post when it is available.
EDIT: The GPT store is now open to the public! You can find the dbt Model Engineer GPT here: https://chat.openai.com/g/g-lBNhN3Ioj-dbt-model-engineer