The Rise of Spaghetti SQL

Is your SQL query too long? It’s either genius or a disaster.

10 min readMar 2, 2025

In the world of data, it’s not uncommon to stumble upon extremely long SQL queries, thousands of lines long is not uncommon. Such monolithic queries are intimidating, but they often exist for a reason. One example of a user asking for guidance on r/AskProgramming:

Hello, I’m working on a 5000 line mysql file that queries magneto tables and creates an internal product & category taxonomy + inventory levels. This file total 5000+ lines and it’s growing.

If that sounds abnormal or extreme to you, keep reading to understand why SQL queries grow to 5,000 lines and beyond… and why it actually often makes sense.

Summary:

How Incremental Complexity Kills SQL
Modularization Madness: Can dbt Save You?
ELT Gone Wrong
Faster, But Uglier
Blame the Regulators

How Incremental Complexity Kills SQL

Business requirements rarely stay static. As applications evolve, business logic grows over time, often incrementally. A report or query that started simple may accumulate dozens of extra calculations and conditions as new rules are added. Each quarter or year, stakeholders might request new metrics, adjustments, or conditional logic. Over years, these additions turn a once-simple query into a sprawling one.

Financial reporting in banks or fintech is a prime example. Consider a profitability report for a bank.

Initially, it might just sum up revenues and expenses.
But then currency conversions are needed for international transactions.
Next, tax calculations must be applied for different jurisdictions.
Regulatory rules (for example, classifying accounts or flagging large transactions for compliance) add further conditions.

The result is a single SQL query encoding a complex web of financial business logic. Below is a simplified query for a profitability report. It converts transaction amounts from various currencies to USD, applies tax rates based on country, and aggregates net profit per account. Just an example, but quite complex at first glance already!

-- Example: profitability report with currency conversion and tax logic
WITH currency_rates AS (
    SELECT currency_code, rate_to_usd
    FROM fx_rate
    WHERE date = '2025-01-31'
),
tax_rules AS (
    SELECT country, tax_rate
    FROM tax_config
    WHERE is_current = TRUE
)
, profitability AS (
    SELECT
        t.account_id,
        t.transaction_date,
        t.currency_code,
        t.country,
        t.type,  -- 'INCOME' or 'EXPENSE'
        t.amount,
        cr.rate_to_usd,
        -- Convert amount to USD
        t.amount * cr.rate_to_usd AS amount_usd,
        -- Separate income and expense for aggregation
        CASE 
            WHEN t.type = 'INCOME' 
            THEN t.amount * cr.rate_to_usd 
            ELSE 0 
        END AS income_usd,
        CASE 
            WHEN t.type = 'EXPENSE' 
            THEN t.amount * cr.rate_to_usd 
            ELSE 0 
        END AS expense_usd,
        -- Apply tax rate if available
        tr.tax_rate,
        CASE 
            WHEN t.type = 'INCOME' AND tr.tax_rate IS NOT NULL
            THEN t.amount * cr.rate_to_usd * tr.tax_rate
            ELSE 0 
        END AS tax_due_usd
    FROM transactions t
    JOIN currency_rates cr 
      ON t.currency_code = cr.currency_code
    LEFT JOIN tax_rules tr 
      ON t.country = tr.country
)
SELECT 
    account_id,
    SUM(income_usd) AS total_income_usd,
    SUM(expense_usd) AS total_expense_usd,
    SUM(tax_due_usd) AS total_tax_due_usd,
    (SUM(income_usd) - SUM(expense_usd) - SUM(tax_due_usd)) AS net_profit_usd
FROM profitability
GROUP BY account_id
HAVING SUM(amount_usd) > 0  -- consider only accounts with activity
ORDER BY net_profit_usd DESC;

It started as a simple sum of transactions per account. Over time, developers added the currency join to convert values, then a join to apply tax rules, then CASE expressions for different transaction types, and so on. Each new business requirement (currency conversion, tax policy, compliance classification) made the SQL longer. The incremental complexity of business logic results in a lengthy query that tries to answer all aspects of a business question in one go.

Modularization Madness: Can dbt Save You?

In fast-paced environments, engineers often opt for quick fixes: rather than refactoring or creating helper views, we append new logic directly into an existing SQL. This “one query to do it all” approach will make queries grow organically. It accumulates joins, subqueries, and repeated logic that could have been separated.

Imagine an initial query that computes total sales per order for an E-commerce.

Later, marketing wants to include promotion discounts in the analysis, so a join is added to bring in coupon data.
Then finance asks to factor in refunds, so another join or subquery brings refund data.
Perhaps product team wants product category info in the same report, leading to another join.

Each request is implemented as a quick addition to the same SQL, instead of redesigning the data pipeline or creating intermediate views. The result is a single, ever-growing query touching many tables. It works, but it’s hard to read or maintain.

-- Example: e-commerce order item report with promotions and refunds integrated
WITH promo_data AS (
    SELECT order_item_id, SUM(discount_amount) AS item_discount
    FROM promotion_usages
    GROUP BY order_item_id
),
refund_data AS (
    SELECT order_item_id, SUM(refund_amount) AS item_refund
    FROM refunds
    GROUP BY order_item_id
)
SELECT 
    o.order_id,
    o.customer_id,
    o.order_date,
    oi.order_item_id,
    pr.product_name,
    c.category_name,
    oi.quantity,
    oi.price AS item_price,
    COALESCE(p.item_discount, 0) AS item_discount,
    COALESCE(r.item_refund, 0) AS item_refund,
    (oi.price * oi.quantity - COALESCE(p.item_discount, 0) - COALESCE(r.item_refund, 0)) AS net_item_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products pr ON oi.product_id = pr.product_id
JOIN categories c ON pr.category_id = c.category_id
LEFT JOIN promo_data p ON oi.order_item_id = p.order_item_id
LEFT JOIN refund_data r ON oi.order_item_id = r.order_item_id
WHERE o.order_date >= '2025-01-01'
ORDER BY o.order_date;

This query has many responsibilities: fetching order details, product info, applying promotions, subtracting refunds… A modular approach might have created a view for promotions and a view for refunds, or split the report into steps (for example, compute net revenue in a temporary table).

Due to tight deadlines or concerns about breaking existing logic, developers often add JOINs and calculations directly into queries rather than refactoring. Over time, this leads to complicated SQL that’s hard to read and maintain.

dbt addresses this by encouraging a modular approach. It promotes creating reusable, clearly named models, each handling a single responsibility. Instead of extending one complex query, teams build layers of simpler SQL views or tables. This method helps keep code organized, readable, and easier to debug, making future changes safer and quicker.

Example using dbt:

-- models/orders/base_orders.sql
SELECT
    order_id,
    customer_id,
    order_date
FROM orders
WHERE order_date >= '2025-01-01'

-- models/orders/order_items.sql
SELECT
    oi.order_item_id,
    oi.order_id,
    oi.product_id,
    oi.quantity,
    oi.price
FROM order_items oi

-- models/orders/item_promotions.sql
SELECT
    order_item_id,
    SUM(discount_amount) AS item_discount
FROM promotion_usages
GROUP BY order_item_id

-- models/orders/item_refunds.sql
SELECT
    order_item_id,
    SUM(refund_amount) AS item_refund
FROM refunds
GROUP BY order_item_id

-- models/orders/order_items_enriched.sql
SELECT
    oi.order_item_id,
    oi.order_id,
    oi.product_id,
    oi.quantity,
    oi.price AS item_price,
    COALESCE(p.item_discount, 0) AS item_discount,
    COALESCE(r.item_refund, 0) AS item_refund,
    (oi.price * oi.quantity - COALESCE(p.item_discount, 0) - COALESCE(r.item_refund, 0)) AS net_item_revenue
FROM {{ ref('order_items') }} oi
LEFT JOIN {{ ref('item_promotions') }} p ON oi.order_item_id = p.order_item_id
LEFT JOIN {{ ref('item_refunds') }} r ON oi.order_item_id = r.order_item_id

-- models/orders/order_report.sql
SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    oi.order_item_id,
    pr.product_name,
    c.category_name,
    oi.quantity,
    oi.item_price,
    oi.item_discount,
    oi.item_refund,
    oi.net_item_revenue
FROM {{ ref('base_orders') }} o
JOIN {{ ref('order_items_enriched') }} oi ON o.order_id = oi.order_id
JOIN products pr ON oi.product_id = pr.product_id
JOIN categories c ON pr.category_id = c.category_id
ORDER BY o.order_date

It’s cleaner right? Although it can sometimes feel like a scattered bowl of spaghetti. Just like complex Helm charts, you may end up chasing references between isolated models, making the big picture harder to grasp at once. Right?

ELT Gone Wrong

Modern data engineering often uses ELT workflows, where raw data is loaded into a warehouse (like BigQuery, Snowflake, or PostgreSQL) and then transformed using SQL. dbt again encourages writing transformations as SQL models. It’s powerful but can result in very large SQL queries or scripts, because the entire transformation logic is expressed in SQL. Each transformation step might be a CTE or subquery, and to achieve the final result, developers sometimes chain many steps together in one SQL statement.

The following example shows a single SQL script performing multiple transformation steps for customer analytics use-case. It’s using CTEs, resembling a dbt model. It ingests raw events, cleans them, aggregates counts, and segments users — all in one query:

-- Example: customer analytics transformation (raw events -> cleaned -> aggregated -> segmented)
WITH raw_data AS (
    SELECT 
        event_id,
        user_id,
        event_type,
        event_timestamp
    FROM events_staging
),
cleaned_data AS (
    SELECT
        user_id,
        date_trunc('day', event_timestamp) AS event_date,
        -- Normalize event type into a few categories
        CASE 
            WHEN event_type IN ('click', 'view') THEN event_type 
            ELSE 'other' 
        END AS event_type,
        -- Filter out events with missing user_id or future dates
        event_timestamp
    FROM raw_data
    WHERE user_id IS NOT NULL
      AND event_timestamp <= NOW()
),
aggregated AS (
    SELECT 
        user_id,
        event_type,
        COUNT(*) AS event_count
    FROM cleaned_data
    GROUP BY user_id, event_type
),
segmented AS (
    SELECT
        user_id,
        SUM(event_count) AS total_events,
        SUM(CASE WHEN event_type = 'click' THEN event_count ELSE 0 END) AS click_events,
        SUM(CASE WHEN event_type = 'view'  THEN event_count ELSE 0 END) AS view_events,
        -- Segment users based on activity
        CASE 
            WHEN SUM(event_count) > 100 THEN 'Power User'
            WHEN SUM(event_count) > 10  THEN 'Active User'
            ELSE 'Inactive User' 
        END AS user_segment
    FROM aggregated
    GROUP BY user_id
)
-- Final select from the last CTE
SELECT *
FROM segmented
ORDER BY user_segment;

This is doing the work of an entire pipeline. In a traditional ETL, you might have separate scripts for each stage (cleanup, aggregation, etc.), but in ELT with SQL, they can be expressed together. Each CTE adds more lines to the query, and complex logic (like those CASE statements for segmentation) adds even more.

As the scope of transformations expands (say you need to add more event types or another stage of processing), the SQL gets longer and dbt might not be the solution (dbt and the challenges of large-scale data transformation — Datameer).

Faster, But Uglier

It sounds counterintuitive, but sometimes making a query larger can improve performance. Databases are optimized for set-based operations; a single well-crafted query can often outperform multiple smaller queries that accomplish the same tasks. This is partly due to reducing the overhead of multiple round trips between an application and the database, and partly because the database’s optimizer can often find a more efficient plan when it has the whole task in one shot. In scenarios where performance is critical (like real-time dashboards or latency-sensitive applications), engineers might merge logic into one big query intentionally.

Real-time analytics dashboards are definitely where it matters a lot. Dashboards often display many metrics. Running separate queries for each might be a bad idea as multiple queries will hit the database, scanning the same large tables repeatedly. A single query that computes all needed metrics in one go, perhaps using subqueries or CTEs to organize the computation. You eliminate extra network calls and allow the database to do one pass over the data to produce all results.

Let’s look at what one SQL to build a dashboard might look like: it calculates DAU (daily active users) and daily new signups for the past week and combines them in one result set.

-- Example: single query computing multiple metrics (active users, new signups) for a dashboard
WITH daily_active AS (
    SELECT 
        date_trunc('day', activity_time) AS day,
        COUNT(DISTINCT user_id) AS active_users
    FROM user_activity
    WHERE activity_time >= current_date - INTERVAL '7 days'
    GROUP BY date_trunc('day', activity_time)
),
daily_signups AS (
    SELECT 
        date_trunc('day', signup_date) AS day, 
        COUNT(*) AS new_signups
    FROM users
    WHERE signup_date >= current_date - INTERVAL '7 days'
    GROUP BY date_trunc('day', signup_date)
)
SELECT 
    d.day,
    COALESCE(a.active_users, 0) AS active_users,
    COALESCE(s.new_signups, 0) AS new_signups,
    -- Compute an example metric: activation rate = active users / new signups (for that day)
    CASE 
        WHEN COALESCE(s.new_signups, 0) = 0 THEN NULL 
        ELSE ROUND(COALESCE(a.active_users, 0)::decimal / s.new_signups, 2) 
    END AS activation_rate
FROM generate_series(current_date - INTERVAL '6 days', current_date, '1 day') AS d(day)  -- generate last 7 days
LEFT JOIN daily_active a ON d.day = a.day
LEFT JOIN daily_signups s ON d.day = s.day
ORDER BY d.day;

We combined what could have been two or three separate queries into one. The use of two CTEs (daily_active, daily_signups) and then a final SELECT means this SQL is longer than any single metric query by itself.

This design avoids scanning the user_activity table and users table multiple times for each metric. It also returns all results with one call to the database. By precomputing related metrics together, we often improve performance at the cost of a more complex SQL script. For a real-time dashboard, that trade-off is justified: the priority is to minimize load time, even if the query is harder to read or maintain.

Blame the Regulators

In highly regulated industries, compliance requirements can drive the creation of extremely detailed SQL queries. Healthcare and finance industries or governments, have strict rules about how data must be processed, reported, and protected.

The SQL used to generate official reports or to audit data may include extensive logic to enforce these rules. Instead of relying on application code for things like data filtering, anonymization, or validation, it’s often done in SQL to ensure it happens within the database (where access can be controlled and logged). This leads to long queries that carefully implement every rule to the letter.

Consider an medical claims audit SQL query. We must ensure patient privacy (hide sensitive data), comply with regulations (HIPAA or GDPR). It means CASE statements to mask or omit sensitive data, joins to reference tables to verify codes or permissions, and filters to include/exclude records based on complex criteria.

-- Example: Healthcare claims audit query with validation and anonymization logic
SELECT
    cl.claim_id,
    cl.date_of_service,
    p.patient_id,
    CASE 
        WHEN p.age < 18 THEN NULL        -- do not show age for minors
        ELSE p.age 
    END AS patient_age,
    CASE 
        WHEN p.consent_to_share = FALSE THEN 'REDACTED' 
        ELSE p.zip_code 
    END AS patient_region,
    d.doctor_id,
    CASE 
        WHEN d.is_resident = TRUE THEN 'Resident Doctor'
        ELSE d.specialty 
    END AS doctor_role,
    cl.total_amount,
    cl.status,
    CASE 
        WHEN cl.status = 'DENIED' THEN cl.denial_reason 
        ELSE NULL 
    END AS denial_reason
FROM claims cl
JOIN patients p ON cl.patient_id = p.patient_id
JOIN doctors d  ON cl.doctor_id = d.doctor_id
WHERE cl.submission_date >= '2025-01-01'
  AND (p.opt_out = FALSE OR p.opt_out IS NULL);  -- exclude patients who opted out of data sharing

Multiple policies being enforced:

hiding data for minors
masking data when a patient didn’t consent to share
labeling doctor roles differently if they are residents, etc.

In a real scenario, there are so many more rules (various insurance types, extensive validation of codes, joining to lookup tables for data normalization, etc.). All these rules often must be applied in the query to ensure the result is compliant and auditable. If you did part of this logic in application code after retrieving data, you risk a breach (someone could query the raw data without the mask). Therefore, the safest route is to implement it in the SQL that generates the report, even if that makes the query very long.

Compliance-heavy queries might also be reviewed and approved by auditors or regulators, so once written, they remain as a single monolithic piece of code that everyone trusts. It’s justified because correctness and auditability are more important than brevity.

Conclusion

Long SQL queries are not always due to poor practice. Long SQL queries aren’t inherently bad. These giant queries emerge because they solve a specific problem: complex logic in one place, high performance, reliable rules enforcement.

The key is knowing when a long query is justified OR when it’s a sign of trouble. If all hope if lost, ask ChatGPT.