Skip to main content
  1. Projects/

dbt Model Engineer GPT

Gpt Data
Falk Zeh
Author
Falk Zeh
Data Engineer & Humanoid Robotics Student
Table of Contents

dbt Model Engineer GPT

dbt Model Engineer GPT: The Ultimate Time-Saver for dbt Users
#

Hey fellow data people! I want to share something I’ve been working on that’s seriously streamlined my workflow. It’s a GPT I created called the dbt Model Engineer, and it’s designed to automate the conversion of JSON, Excel, or CSV files into dbt models, specifically for Snowflake databases.

dbt Model Engineer GPT

What does it do and how does it work?
#

We’ve all spent three hours or more converting data formats into usable dbt models. It’s necessary but monotonous work that eats up time we could be using on more complex data problems. That’s where this GPT comes in. It handles the “monkey work,” churning out SQL models and documentation with ease.

The dbt Model Engineer GPT is a GPT-4-powered AI that converts JSON, Excel, or CSV files into dbt models. It’s designed to work with Snowflake databases, but it can be adapted to work with other databases as well. Let’s take a look at how it works.

For this example, I took a sample API response from the Shopify documentation. It shows customer information and looks like this:

{
  "customers": [
    {
      "id": 207119551,
      "email": "bob.norman@hostmail.com",
      "accepts_marketing": false,
      "created_at": "2021-02-12T13:48:32-05:00",
      "updated_at": "2021-02-12T13:48:32-05:00",
      "first_name": "Bob",
      "last_name": "Norman",
      "orders_count": 1,
      "state": "disabled",
      "total_spent": "199.65",
      "last_order_id": 450789469,
      "note": null,
      "verified_email": true,
      "multipass_identifier": null,
      "tax_exempt": false,
      "phone": "+16136120707",
      "tags": "",
      "last_order_name": "#1001",
      "currency": "USD",
      "addresses": [
        {
          "id": 207119551,
          "customer_id": 207119551,
          "first_name": null,
          "last_name": null,
          "company": null,
          "address1": "Chestnut Street 92",
          "address2": "",
          "city": "Louisville",
          "province": "Kentucky",
          "country": "United States",
          "zip": "40202",
          "phone": "+1(800)-555-2181",
          "name": "",
          "province_code": "KY",
          "country_code": "US",
          "country_name": "United States",
          "default": true
        }
      ],
    },
  ],
  "accepts_marketing_updated_at": "2005-06-12T11:57:11-04:00",
  "marketing_opt_in_level": null,
  "tax_exemptions": [],
  "admin_graphql_api_id": "gid://shopify/Customer/207119551",
  "default_address": {
    "id": 207119551,
    "customer_id": 207119551,
    "first_name": null,
    "last_name": null,
    "company": null,
    "address1": "Chestnut Street 92",
    "address2": "",
    "city": "Louisville",
    "province": "Kentucky",
    "country": "United States",
    "zip": "40202",
    "phone": "+1(800)-555-2181",
    "name": "",
    "province_code": "KY",
    "country_code": "US",
    "country_name": "United States",
    "default": true
  }
}

I then fed this JSON into the GPT, without giving any context or instructions. The GPT then generated the following files:

  • SQL model for the Customers Table
  • dbt doc YAML for Customers
  • SQL model for the Addresses Table
  • dbt doc YAML for Addresses

Let’s take a look at the SQL model for the Customers table:

-- models/customers.sql

WITH source_data AS (
    SELECT 
        id::INTEGER,
        email::VARCHAR,
        accepts_marketing::BOOLEAN,
        TO_TIMESTAMP(created_at, 'YYYY-MM-DD HH24:MI:SS-FF') AS created_at,
        TO_TIMESTAMP(updated_at, 'YYYY-MM-DD HH24:MI:SS-FF') AS updated_at,
        first_name::VARCHAR,
        last_name::VARCHAR,
        orders_count::INTEGER,
        state::VARCHAR,
        total_spent::FLOAT,
        last_order_id::INTEGER,
        note::VARCHAR,
        verified_email::BOOLEAN,
        multipass_identifier::VARCHAR,
        tax_exempt::BOOLEAN,
        phone::VARCHAR,
        tags::VARCHAR,
        last_order_name::VARCHAR,
        currency::VARCHAR
    FROM {{ source('your_source_name', 'your_raw_table_name') }}
)

SELECT * FROM source_data;

As you can see, the GPT automatically extracted the column names and data types from the JSON. It also converted the date and timestamp fields into the proper format.

The GPT also generated the following dbt doc YAML for the Customers table:

# models/customers.yml

version: 2

models:
  - name: customers
    description: "Table for customer data."
    columns:
      - name: id
        description: "The customer's unique identifier."
        tests:
          - unique
          - not_null
      - name: email
        description: "The customer's email address."
      - name: accepts_marketing
        description: "Indicates if the customer accepts marketing."
      - name: created_at
        description: "Timestamp of when the customer was created."
      - name: updated_at
        description: "Timestamp of the last update to the customer's profile."
      - name: first_name
        description: "The customer's first name."
      - name: last_name
        description: "The customer's last name."
      - name: orders_count
        description: "The count of orders made by the customer."
      - name: state
        description: "The state of the customer."
      - name: total_spent
        description: "The total amount spent by the customer."
      - name: last_order_id
        description: "The ID of the last order made by the customer."
      - name: note
        description: "Additional notes about the customer."
      - name: verified_email
        description: "Indicates if the customer's email is verified."
      - name: multipass_identifier
        description: "Multipass identifier for the customer."
      - name: tax_exempt
        description: "Indicates if the customer is exempt from taxes."
      - name: phone
        description: "The customer's phone number."
      - name: tags
        description: "Tags associated with the customer."
      - name: last_order_name
        description: "The name of the last order made by the customer."
      - name: currency
        description: "The currency used by the customer."

If the GPT recognizes the data source, it is advised to check the API documentation for field descriptions. If the GPT does not recognize the data source, it will generate a generic description for each field. In this case, the GPT recognized the data source as Shopify, so it was able to generate the descriptions from the Shopify API documentation. Additionally, it also included a test for the id field which should be not_null and unique.

It also generated the SQL model and dbt doc YAML for the Addresses table:

-- models/addresses.sql

WITH source_data AS (
    SELECT 
        id::INTEGER,
        customer_id::INTEGER,
        first_name::VARCHAR,
        last_name::VARCHAR,
        company::VARCHAR,
        address1::VARCHAR,
        address2::VARCHAR,
        city::VARCHAR,
        province::VARCHAR,
        country::VARCHAR,
        zip::VARCHAR,
        phone::VARCHAR,
        name::VARCHAR,
        province_code::VARCHAR,
        country_code::VARCHAR,
        country_name::VARCHAR,
        default::BOOLEAN
    FROM {{ source('your_source_name', 'your_raw_table_name') }},
    LATERAL FLATTEN(input => addresses)
)

SELECT * FROM source_data;
# models/addresses.yml

version: 2

models:
  - name: addresses
    description: "Table for customer addresses."
    columns:
      - name: id
        description: "Unique identifier for the address."
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "Associated customer's ID."
        tests:
          - not_null
      - name: first_name
        description: "First name associated with the address."
      - name: last_name
        description: "Last name associated with the address."
      - name: company
        description: "Company associated with the address."
      - name: address1
        description: "Primary address line."
      - name: address2
        description: "Secondary address line."
      - name: city
        description: "City of the address."
      - name: province
        description: "Province of the address."
      - name: country
        description: "Country of the address."
      - name: zip
        description: "ZIP or postal code of the address."
      - name: phone
        description: "Phone number associated with the address."
      - name: name
        description: "Full name associated with the address."
      - name: province_code
        description: "Province code of the address."
      - name: country_code
        description: "Country code of the address."
      - name: country_name
        description: "Full country name of the address."
      - name: default
        description: "Indicates if it is the default address."

Bye bye, monkey work!
#

While this GPT is useful and can save you a lot of time, it’s not perfect. It goes without saying that you should always review the generated code and documentation before using it in production. However, it can save you a lot of time and effort, especially if you’re working with large datasets.

Can I try it out?
#

Well, yes and no. OpenAI just announced their GPT store and I plan to offer this GPT for free on the store. However, the GPT store is not yet open to the public. I will update this post when it is available.

EDIT: The GPT store is now open to the public! You can find the dbt Model Engineer GPT here: https://chat.openai.com/g/g-lBNhN3Ioj-dbt-model-engineer

Related

ETL Pipeline Implementation
Data Etl Airflow Mysql Aws
ByeByePII - Hashing Personal Identifiable Information (PII)
Data Etl PII