How To Use SQL The Best Way: Tips and Tricks

Spread the knowledge

Discover expert-level SQL tips and tricks to boost your database querying skills. From advanced window functions to optimizing complex queries, this post covers unique techniques that will enhance your SQL performance and problem-solving capabilities. Perfect for developers and data analysts looking to master SQL!

Table of Content

1. Gaps and Islands Concepts:
2. Cumulative Sum with Reset:
3. Pivoting Data with Dynamic Columns:
- Example:
4. Finding Missing Dates
- Problem Definition:
- Example:
  - Solution:
  - Explanation:
5. Running Total Using Window Functions
- Problem Definition:
- Example:
  - Solution:
  - Explanation:
6. Detecting Gaps in a Sequential ID
- Problem Definition:
- Example:
  - Solution:
  - Explanation:
7. Finding Overlapping Time Intervals
- Problem Definition:
- Example:
  - Solution:
  - Explanation:
8. Hierarchical Data with Recursive Queries
- Problem Definition:
- Example:
  - Solution:
  - Explanation:
9. Cumulative Distribution Function (CDF) using Window Functions
10. Moving Average with Variable Window Size
11. Percentile Calculation with FIRST_VALUE and LAG
Learn more about SQL and other topics

Welcome to our guide on SQL tips and tricks designed to take your database skills to the next level. Whether you’re an experienced developer or just getting started, mastering SQL is essential for efficient data handling and problem-solving. In this blog, we dive into advanced techniques, covering everything from window functions to query optimization. These expert tips will help you write more powerful, efficient queries, improve database performance, and unlock the full potential of SQL. Let’s get started!

1. Gaps and Islands Concepts:

The Gaps and Islands Problem is a common SQL problem that deals with identifying sequences (islands) or gaps in a dataset, typically when working with ranges of values such as dates, IDs, or numbers. It’s often used when you have ordered data, and you want to group together consecutive values or identify missing ones.

Gaps: These are discontinuities in a sequence. For example, if you have consecutive days of sales data, a day with no sales creates a gap in the sequence.
Islands: These are consecutive blocks of data without gaps. In the same example, an “island” would be a continuous streak of days where sales were recorded without any missing days in between.

Problem Definition:

Imagine you have a table of transactions with a date column, and you want to identify continuous periods of sales (islands) or gaps in the sales records. The challenge is to recognize which records are part of the same continuous sequence (island) and where there are gaps between these sequences.

Goal:

Islands: Identify the continuous streaks (islands) of sales.
Gaps: Find where there are breaks in those streaks (gaps).

Example 1:

Suppose you have a table sales with the following data:

-- Create and populate sample data
CREATE TABLE sales_data (
    sale_date DATE,
    amount number
);

--adding rows
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),200);
INSERT INTO sales_data VALUES(to_date('2024-01-05','YYYY-MM-DD'),180);
INSERT INTO sales_data VALUES(to_date('2024-01-06','YYYY-MM-DD'),220);
INSERT INTO sales_data VALUES(to_date('2024-01-09','YYYY-MM-DD'),130);

sale_date	amount
2024-01-01	100
2024-01-02	150
2024-01-03	200
2024-01-05	180
2024-01-06	220
2024-01-09	130

In this dataset, you can see there are gaps between 2024-01-03 and 2024-01-05, as well as between 2024-01-06 and 2024-01-09. The continuous streaks of sales are called “islands.”

Approach:

There are two primary techniques used to solve this problem:

1. Identifying Islands (Continuous Sequences) Using Grouping

The idea is to assign a unique group number to each continuous block of dates.

Step-by-Step Logic:

ROW_NUMBER(): Assign a row number to each date, ordered by the date itself.
Create a Group Key: The difference between the actual date and the row number will remain constant for each continuous group of dates, which we can use to identify islands.

Solution:

SQL query to find islands:

WITH sales AS (
  SELECT
    sale_date,
    amount,
    ROW_NUMBER() OVER (ORDER BY sale_date) AS rn
  FROM sales_data
)
SELECT
  MIN(sale_date) AS island_start,
  MAX(sale_date) AS island_end
FROM sales
GROUP BY sale_date - rn
ORDER BY island_start;

Explanation:

ROW_NUMBER() OVER (ORDER BY sale_date) assigns a sequential row number based on the sale_date.
sale_date - rn creates a constant identifier for consecutive rows. For example, in a continuous sequence, the difference between the actual date and the row number will be the same. When the sequence is interrupted, this difference will change, thus identifying a new group (island).
GROUP BY sale_date - rn groups the data into “islands” of consecutive dates.
MIN(sale_date) and MAX(sale_date) return the start and end dates of each island.

Output:

island_start	island_end
2024-01-01	2024-01-03
2024-01-05	2024-01-06
2024-01-09	2024-01-09

Here, we’ve identified three islands:

From 2024-01-01 to 2024-01-03
From 2024-01-05 to 2024-01-06
On 2024-01-09 (a single-day island)

2. Identifying Gaps in the Sequence

To find gaps, you need to look at where there’s a missing day (or other unit of measure) between consecutive records.

Solution:

SQL Query to Find Gaps:

WITH sales AS (
  SELECT 
    sale_date,
    LAG(sale_date) OVER (ORDER BY sale_date) AS prev_date
  FROM sales_data
)
SELECT 
  prev_date + 1 AS gap_start,
  sale_date - 1 AS gap_end
FROM sales
WHERE sale_date > prev_date + 1;

Explanation:

LAG(sale_date) allows you to access the previous row’s date.
The WHERE sale_date > prev_date + 1 condition identifies where the gap starts. If the difference between the current date and the previous date is more than 1, there’s a gap.
prev_date + 1 and sale_date - 1 help you identify the range of missing dates.

Output:

gap_start	gap_end
2024-01-04	2024-01-04
2024-01-07	2024-01-08

This identifies two gaps:

2024-01-04
2024-01-07 to 2024-01-08

Example 2:

This example demonstrates how to find consecutive date ranges for employee attendance. It uses a combination of window functions and self-joins to group consecutive dates together.

-- Create and populate sample data
CREATE TABLE employee_attendance (
    employee_id INT,
    attendance_date DATE
);

INSERT INTO employee_attendance (employee_id, attendance_date) VALUES
(1, '2023-01-01'), (1, '2023-01-02'), (1, '2023-01-03'), (1, '2023-01-05'),
(1, '2023-01-06'), (1, '2023-01-08'), (1, '2023-01-09'), (1, '2023-01-10');

--Some database does not support above syntax, use below if not working if above SQL not working
/*
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-01','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-02','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-03','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-05','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-06','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-08','YYYY-MM-DD')); 
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-09','YYYY-MM-DD')); 
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-10','YYYY-MM-DD'));
*/

1. Identifying Islands (Continuous Sequences) Using Grouping

The idea is to assign a unique group number to each continuous block of dates. Here we are finding consecutive date ranges for employee attendance

-- Query to find consecutive date ranges
WITH numbered_dates AS (
    SELECT 
        employee_id,
        attendance_date,
        attendance_date - ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY attendance_date) AS group_date
    FROM employee_attendance
)
SELECT 
    employee_id,
    MIN(attendance_date) AS gap_start,
    MAX(attendance_date) AS gap_end,
    COUNT(*) AS consecutive_days
FROM numbered_dates
GROUP BY employee_id, group_date
ORDER BY employee_id, gap_start;

Output:

Employee_ID	Gap_start	Gap_end	Consecutive_days
1	2023-01-01	2023-01-03	3
1	2023-01-05	2023-01-06	2
1	2023-01-08	2023-01-10	3

When to Use Gaps and Islands Solutions:

Gaps: When you need to know when data is missing, such as days with no sales or missing IDs in a sequence.
Islands: When you want to identify periods of continuous activity, such as streaks of daily sales, user logins, or consecutive event occurrences.

Both methods can be adapted for use in scenarios involving numeric sequences, date ranges, or any ordered data where gaps and sequences need to be identified.

2. Cumulative Sum with Reset:

The Cumulative Sum with Reset problem involves calculating a running total (cumulative sum) of values in a dataset, but with the added complexity that the sum resets under certain conditions. Typically, this reset happens based on changes in a specific column, such as a category, status, or date change.

This type of problem is commonly seen in scenarios like tracking a running total of sales, but resetting the total when a new month starts, or tracking performance across different categories.

Problem Definition:

Given a table of values, you want to compute the cumulative sum of a column, but you need the sum to restart when a certain condition is met.

Goal: Calculate a cumulative sum of sales, with the sum resetting each time the category changes.

Example 1:

Suppose you have a table of daily sales that looks like this:

-- Create and populate sample data
CREATE TABLE sales_data (
    sale_date DATE,
    category varchar(2),
    amount int
);

--Adding sales data to table
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),'A',100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),'A',150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),'A',200);
--new category added
INSERT INTO sales_data VALUES(to_date('2024-01-04','YYYY-MM-DD'),'B',300);   
INSERT INTO sales_data VALUES(to_date('2024-01-05','YYYY-MM-DD'),'B',100);
--old category again
INSERT INTO sales_data VALUES(to_date('2024-01-06','YYYY-MM-DD'),'A',250);   
INSERT INTO sales_data VALUES(to_date('2024-01-07','YYYY-MM-DD'),'A',50);

sale_date	category	amount
2024-01-01	A	100
2024-01-02	A	150
2024-01-03	A	200
2024-01-04	B	300
2024-01-05	B	100
2024-01-06	A	250
2024-01-07	A	50

You want to compute the cumulative sum of sales for each category, but the sum should reset when the category changes.

Approach:

To solve this problem in SQL, we can use window functions with partitioning and conditional logic.

Step-by-Step Logic:

Window Function: Use SUM() as a window function to calculate the cumulative sum.
Partitioning: Use PARTITION BY to group the data by the category so that the sum resets for each category.
Ordering: Use ORDER BY to ensure the cumulative sum is calculated in the correct order (e.g., by date or some other column).

Solution:

SELECT 
  sale_date,
  category,
  amount,
  SUM(amount) OVER (PARTITION BY category ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;

Explanation:

SUM(sales) OVER (PARTITION BY category ORDER BY sale_date) calculates the cumulative sum within each category. The PARTITION BY clause ensures that the sum resets when the category changes.
ORDER BY sale_date ensures that the cumulative sum is calculated in the correct sequence of dates.

Output:

sale_date	category	amount	cumulative_sales_amount
2024-01-01	A	100	100
2024-01-02	A	150	250
2024-01-03	A	200	450
2024-01-04	B	300	300
2024-01-05	B	100	400
2024-01-06	A	250	700
2024-01-07	A	50	750

Here’s what’s happening:

For category A, the cumulative sales are computed across the first three rows, summing 100, 150, and 200, giving cumulative values of 100, 250, and 450, respectively.
When the category switches to B, the cumulative sum starts fresh (i.e., resets) at 300 and continues summing for the next record.
After category B, when the data switches back to A, the cumulative sum starts again from where it has left i.e last calculated value was 450, with the first value being 700(450+250) (on 2024-01-06), and the next day adding 50 to make 750 (on 2024-01-07).

Example 2: Advanced – Cumulative Sum with Reset Based on Another Condition (e.g., Date)

Sometimes, you may want to reset the cumulative sum based on a time-based condition, such as the beginning of a new month or year, instead of a category change.

Let’s assume you want to compute a cumulative sum of sales, but it should reset at the beginning of each month.

--dropping old table
DROP TABLE sales_data;

-- Create and populate sample data
CREATE TABLE sales_data (
    sale_date DATE,
    category varchar(2),
    amount int
);

--Adding sales data to table
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),'A',100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),'A',150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),'A',200);
/*new month*/
INSERT INTO sales_data VALUES(to_date('2024-02-01','YYYY-MM-DD'),'B',300);   
INSERT INTO sales_data VALUES(to_date('2024-02-02','YYYY-MM-DD'),'B',100);  
INSERT INTO sales_data VALUES(to_date('2024-02-04','YYYY-MM-DD'),'A',50);

Solution:

SELECT 
  sale_date,
  amount,
  SUM(amount) OVER (PARTITION BY YEAR(sale_date), MONTH(sale_date) ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;

--Some database may use different syntax to get year and month from a date like Oracle, you can use below code for them 
/*
SELECT 
  sale_date,
  amount,
  SUM(amount) OVER (PARTITION BY extract(YEAR from sale_date), extract(MONTH from sale_date) ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;
*/

Output:

Explanation:

PARTITION BY YEAR(sale_date), MONTH(sale_date) groups the rows by each month of each year.
This query resets the cumulative sum at the start of each month.

Use Cases for Cumulative Sum with Reset:

Finance: Track cumulative revenue, resetting at the start of each fiscal year or quarter.
Sales: Calculate running totals for product sales, resetting totals for each new product or category.
Sports: Accumulate points or scores for players or teams, resetting at the beginning of each season or match.
Inventory Management: Track inventory stock, resetting the sum when moving to a new product category or warehouse.

Key Takeaways:

Cumulative sum allows tracking running totals over time or across categories.
Resetting logic can be applied using the PARTITION BY clause, which helps in defining when the sum should reset (e.g., by category, month, or other criteria).
Window functions like SUM() and ROW_NUMBER() are powerful tools for solving problems involving ordered data.

The “Cumulative Sum with Reset” is a crucial pattern for many analytical and reporting tasks, allowing you to handle scenarios where running totals are needed but must restart under specific conditions.

3. Pivoting Data with Dynamic Columns:

This example shows how to create a pivot table for product sales by category, where the categories are not known in advance. It uses dynamic SQL and PostgreSQL-specific functions to generate the pivot table on the fly.

Example:

Pivoting Data with Dynamic Columns

-- Create a pivot table for product sales by category

-- Create and populate sample data
CREATE TABLE product_sales (
    product_id INT,
    category VARCHAR(50),
    sale_amount DECIMAL(10, 2)
);

INSERT INTO product_sales (product_id, category, sale_amount) VALUES
(1, 'Electronics', 500), (2, 'Clothing', 200), (3, 'Books', 50),
(4, 'Electronics', 300), (5, 'Clothing', 150), (6, 'Food', 100),
(7, 'Books', 75), (8, 'Food', 80), (9, 'Electronics', 600);

-- Dynamic SQL to create pivot table
-- Note: This example uses PostgreSQL-specific syntax
CREATE OR REPLACE FUNCTION pivot_product_sales() RETURNS TABLE (
    product_id INT,
    Electronics DECIMAL(10, 2),
    Clothing DECIMAL(10, 2),
    Books DECIMAL(10, 2),
    Food DECIMAL(10, 2)
) AS $$
DECLARE
    categories TEXT;
BEGIN
    -- Get distinct categories dynamically
    SELECT string_agg(DISTINCT quote_ident(category), ', ')
    INTO categories
    FROM product_sales;

    -- Execute dynamic SQL
    RETURN QUERY EXECUTE format('
        SELECT product_id, %s
        FROM crosstab(
            ''SELECT product_id, category, sale_amount
            FROM product_sales
            ORDER BY 1, 2'',
            ''SELECT DISTINCT category FROM product_sales ORDER BY 1''
        ) AS ct(product_id INT, %s)
    ', categories, categories);
END;
$$ LANGUAGE plpgsql;

-- Call the function to get the pivot table
SELECT * FROM pivot_product_sales();

4. Finding Missing Dates

Problem Definition:

Suppose you have a table with daily records and want to find dates that are missing from the sequence.

Goal: Identify missing dates between the minimum and maximum dates in the table.

Example:

Suppose you have a table daily_sales with the following data:

drop table daily_sales;
create table daily_sales as 
  SELECT to_date('2023-01-01','YYYY-MM-DD') AS sales_date, 100 AS sales_amount UNION ALL 
  SELECT to_date('2023-01-02','YYYY-MM-DD') , 120 UNION ALL 
  SELECT to_date('2023-01-04','YYYY-MM-DD') , 150 UNION ALL 
  SELECT to_date('2023-01-05','YYYY-MM-DD') , 110 UNION ALL 
  SELECT to_date('2023-01-07','YYYY-MM-DD') , 140;

Date	Sales
2023-01-01	100
2023-01-02	120
2023-01-04	150
2013-01-05	110
2013-01-07	140

Solution:

WITH RECURSIVE DateRange (sales_date) AS (
    SELECT MIN(sales_date) AS sales_date
     FROM daily_sales
    UNION ALL
    SELECT date(sales_date)+1 as sales_date
     FROM DateRange
    WHERE date(sales_date)+1 <= (SELECT MAX(sales_date) FROM daily_sales)
)
SELECT sales_date as missing_dates
FROM DateRange
WHERE sales_date NOT IN (SELECT sales_date FROM daily_sales);

Output:

Missing_dates:

2023-01-03
2023-01-06

Explanation:

– The `DateRange` CTE generates a sequence of dates from the minimum date to the maximum date.

– The outer query then finds dates in this sequence that are not present in the `daily_sales` table.

5. Running Total Using Window Functions

Problem Definition:

Compute the daily running total of sales.

Goal: Get the running total of sales by date.

Example:

Suppose you have a table daily_sales with the following data:

Date	Sales
2024-09-01	100
2024-09-02	150
2024-09-03	200
2024-09-04	250

Solution:

SELECT
    date,
    sales,
    SUM(sales) OVER (ORDER BY date) AS running_total
from daily_sales;

Explanation:

The `SUM()` window function calculates the running total of the `sales` column, ordered by `date`.

6. Detecting Gaps in a Sequential ID

Problem Definition:

You have a table with sequential IDs and want to detect gaps.

Goal: Identify missing IDs in the sequence.

Example:

Suppose you have a table sequential_data with the following data:

Id	Value
1	A
2	B
4	C
5	D

Solution:

WITH NumberSeries AS (
    SELECT MIN(id) AS id
     FROM sequential_data
    UNION ALL
    SELECT id + 1
     FROM NumberSeries
    WHERE id + 1 <= (SELECT MAX(id) FROM sequential_data)
)
SELECT id
 FROM NumberSeries
WHERE id NOT IN (SELECT id FROM sequential_data);

Explanation:

– The `NumberSeries` CTE recursively generates a sequence of IDs from the minimum ID to the maximum ID.

– The outer query then identifies IDs in this sequence that are not present in the `sequential_data` table.

7. Finding Overlapping Time Intervals

Problem Definition:

Determine if there are overlapping time intervals in a table.

Goal: Find overlapping events.

Example:

Suppose you have a table events with the following data:

Event_ID	Start_Time	End_Time
1	2024-09-01 08:00:00	2024-09-01 10:00:00
2	2024-09-01 09:30:00	2024-09-01 11:00:00
3	2024-09-01 11:30:00	2024-09-01 12:30:00

Solution:

SELECT e1.event_id AS event1, e2.event_id AS event2
FROM events e1
JOIN events e2
ON e1.event_id < e2.event_id
AND e1.start_time < e2.end_time
AND e1.end_time > e2.start_time;

Explanation:

– The `JOIN` condition ensures that `e1` and `e2` are different events (`e1.event_id < e2.event_id`).

– The overlap condition is checked using the `start_time` and `end_time` of both events.

8. Hierarchical Data with Recursive Queries

Problem Definition:

Manage hierarchical data, such as an organizational chart.

Goal: Retrieve the hierarchy of employees reporting to a given manager.

Example:

Suppose you have a table employees with the following data:

Emp_ID	Emp_Name	Manager_ID
1	Alice	Null
2	Bob	1
3	Carol	1
4	Dave	2

Solution:

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT emp_id, emp_name, manager_id
    FROM employees
    WHERE manager_id IS NULL  -- Start with top-level managers
    UNION ALL
    SELECT e.emp_id, e.emp_name, e.manager_id
    FROM employees e
    INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.emp_id
)
SELECT *
FROM EmployeeHierarchy;

Explanation:

– The `EmployeeHierarchy` CTE starts with the top-level managers (`manager_id IS NULL`).

– It recursively joins with `employees` to include subordinates.

9. Cumulative Distribution Function (CDF) using Window Functions

The Cumulative Distribution Function (CDF) is a statistical concept that represents the probability that a random variable will take a value less than or equal to a specific value. In SQL, you can compute the CDF to analyze data distributions, often using window functions. This allows you to determine what proportion of the data lies below or at a certain threshold.

Problem Definition:

Given a dataset of numeric values (e.g., sales, scores, or prices), calculate the CDF for each value. The CDF for a value is the percentage of values in the dataset that are less than or equal to that value.

Goal: You want to calculate the Cumulative Distribution Function (CDF), which shows the percentage of products priced at or below each price point.

Example 1:

Imagine you have a table of product prices like this:

product_id	price
1	50
2	60
3	70
4	80
5	90
6	100

Step-by-Step Solution:

To calculate the CDF in SQL, we will follow these steps:

Arrange the information: Sort the information in ascending order.
Calculate the row rank: Assign a rank to each row based on the order of values.
Calculate cumulative count: Calculate the cumulative count of rows that have a value less than or equal to the current row.
Normalize the cumulative count: Convert the cumulative count into a percentage of the total number of rows, which represents the CDF.

1. Calculate Row Rank Using `ROW_NUMBER()` or `RANK()`:

We first assign a rank to each value. For this example, let’s assume you want to calculate the rank of each product’s price, ordered from lowest to highest.

WITH ranked_prices AS (
  SELECT 
    product_id,
    price,
    ROW_NUMBER() OVER (ORDER BY price) AS rn,
    COUNT(*) OVER () AS total_rows
  FROM product_prices
)

Explanation:

ROW_NUMBER() OVER (ORDER BY price) assigns a unique row number to each product based on its price, starting with the lowest price.
COUNT(*) OVER () calculates the total number of rows in the dataset, which we need to normalize the cumulative count into a percentage.

2. Calculate the Cumulative Count:

Next, we use the COUNT() window function to compute the cumulative count of rows where the price is less than or equal to the current row’s price.

WITH cumulative_count AS (
  SELECT 
    product_id,
    price,
    COUNT(*) OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_count,
    total_rows
  FROM ranked_prices
)

Explanation:

COUNT(*) OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) calculates the cumulative count of rows up to and including the current row, ordered by price. This represents how many products have a price less than or equal to the current product’s price.

3. Calculate the CDF:

Now that we have the cumulative count, we can compute the CDF by dividing the cumulative count by the total number of rows. This gives the percentage of rows that have a value less than or equal to the current price.

SELECT 
  product_id,
  price,
  cumulative_count,
  (cumulative_count * 1.0 / total_rows) AS cdf
FROM cumulative_count
ORDER BY price;

Explanation:

(cumulative_count * 1.0 / total_rows) converts the cumulative count into a fraction of the total rows, which represents the CDF value. Multiplying by 1.0 ensures the division produces a decimal (floating-point) result.

Final Output:

product_id	price	cumulative_count	cdf
1	50	1	0.167
2	60	2	0.333
3	70	3	0.500
4	80	4	0.667
5	90	5	0.833
6	100	6	1.000

Explanation of the Result:

For a product with a price of 50, the cumulative count is 1, and since there are 6 rows total, the CDF is 1 / 6 = 0.167 (16.7%).
For a price of 100, the cumulative count is 6, and the CDF is 6 / 6 = 1.0 (100%), meaning all products are priced at or below 100.

Key SQL Window Functions Explained:

ROW_NUMBER(): This assigns a unique number to each row based on the price. This helps in ranking the rows but does not handle ties (i.e., products with the same price will have different ranks). If you want to handle ties, you can use RANK() or DENSE_RANK().
COUNT() OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): This calculates the cumulative count, which is the count of all previous rows up to and including the current row. It’s key for building cumulative metrics like CDF.
COUNT(*) OVER (): This calculates the total number of rows in the table, allowing us to normalize the cumulative count into a percentage of the total dataset.
Division: We divide the cumulative count by the total number of rows to calculate the cumulative distribution value, effectively turning it into a percentage (0–100%).

Key Considerations:

Ties: If your data contains ties (i.e., multiple rows with the same value), the ROW_NUMBER() function will assign a unique rank to each, meaning that tied values will have slightly different CDF values. If you want ties to share the same rank, you can use RANK() or DENSE_RANK() instead.
Precision: The CDF values are typically represented as decimals between 0 and 1. If you want percentages, you can multiply the result by 100.
Performance: CDF calculations using window functions can be resource-intensive for large datasets, so performance optimizations (such as indexing or partitioning) may be necessary depending on your data volume.

Example 2

This SQL trick calculates the Cumulative Distribution Function (CDF) for product sales.

Solution:

The CDF shows the proportion of values below a certain point. This can be useful for understanding the distribution of sales across products.

WITH sales_data AS (
  SELECT 'Product A' AS product, 100 AS sales
  UNION ALL SELECT 'Product B', 200
  UNION ALL SELECT 'Product C', 150
  UNION ALL SELECT 'Product D', 300
  UNION ALL SELECT 'Product E', 250
)
SELECT 
  product,
  sales,
  sales / SUM(sales) OVER () AS sales_percentage,
  SUM(sales) OVER (ORDER BY sales) / SUM(sales) OVER () AS cdf
FROM sales_data
ORDER BY sales;

Explanation:

Here’s how it works:

We create a CTE (sales_data) with sample product sales data.
In the main query, we calculate:
- sales_percentage: Each product’s sales as a percentage of total sales.
- cdf: The cumulative sum of sales percentages, ordered by sales amount.

Key Takeaways:

Cumulative Distribution Function (CDF) is useful for understanding the distribution of data and determining the proportion of values that lie below a given threshold.
SQL window functions like COUNT(), ROW_NUMBER(), and RANK() make it possible to calculate the CDF by providing cumulative counts and ranks.
Normalizing the cumulative count by the total number of rows yields the CDF value, showing the percentage of the dataset that falls below a certain point.

This technique is widely used in analytics and can help provide valuable insights into how values are distributed within a dataset.

10. Moving Average with Variable Window Size

A moving average (also known as a running average) is a technique used to smooth out fluctuations in data by calculating the average of a subset of values (window) that changes as you move through the data. In many scenarios, the window size (the number of rows or time intervals) is fixed. However, in more complex cases, you may want the window size to be variable, meaning the number of rows included in the average can change dynamically.

This SQL problem requires using window functions to calculate a moving average but with flexibility in determining the size of the window used for the calculation.

Problem Definition:

Given a table of data, calculate the moving average of a column (like sales or prices), but with a window size that can vary based on conditions (like date ranges or number of preceding rows).

Example:

Suppose you have a table of daily sales that looks like this:

sale_date	sales
2024-01-01	100
2024-01-02	150
2024-01-03	200
2024-01-04	250
2024-01-05	300
2024-01-06	350
2024-01-07	400

You want to calculate a moving average of the past 3 days (a fixed window size of 3), but you also want to handle cases where the window size may vary, such as when there are fewer than 3 preceding days or when conditions, like business logic, determine the window size.

Fixed vs Variable Window Size:

Fixed Window Size: A constant window size is applied uniformly across all rows (e.g., a moving average over the last 3 days).
Variable Window Size: The window size can change for each row based on conditions, such as a dynamically determined number of previous days or rows.

1. Moving Average with Fixed Window Size (for Reference):

Let’s first calculate the moving average using a fixed window size of 3 days.

Solution: (Fixed Window)

SELECT 
  sale_date,
  sales,
  AVG(sales) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales_data
ORDER BY sale_date;

Explanation:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW: This creates a fixed window of 3 rows, including the current row and the previous two.
AVG(sales) OVER (ORDER BY sale_date): Calculates the average of sales within that fixed window.

Output (Fixed Window):

sale_date	sales	moving_avg
2024-01-01	100	100
2024-01-02	150	125
2024-01-03	200	150
2024-01-04	250	200
2024-01-05	300	250
2024-01-06	350	300
2024-01-07	400	350

2. Moving Average with Variable Window Size:

Now, let’s calculate the moving average with a variable window size. The window size could depend on different factors, such as:

A changing number of previous days (e.g., vary the range of dates).
Dynamic business logic (e.g., only use rows where the sales were above a certain threshold).

Example 1: Moving Average Based on a Variable Number of Previous Days

In this case, we want the moving average to include all sales from the past N days dynamically (e.g., based on a specific time period).

Solution: (Variable Date-Based Window)

SELECT 
  sale_date,
  sales,
  AVG(sales) OVER (ORDER BY sale_date RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales_data
ORDER BY sale_date;

Explanation:

RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW: Instead of using a fixed number of rows, this uses a date-based window. It includes all rows within the last 2 days (including the current date).
AVG(sales): Calculates the moving average of the sales over that variable window of dates.

Output (Variable Window Based on Days):

sale_date	sales	moving_avg
2024-01-01	100	100
2024-01-02	150	125
2024-01-03	200	150
2024-01-04	250	200
2024-01-05	300	275
2024-01-06	350	325
2024-01-07	400	375

In this case, the window is variable based on the date range, not the number of rows. For each row, the moving average is calculated using the sales from the current date and the previous 2 days.

Example 2: Moving Average with Business Logic-Based Variable Window

You might want to compute a moving average but only for the rows that satisfy certain business conditions. For example, you want to include all previous sales in the moving average as long as the sales are above 200.

Solution: (Conditional Window)

This scenario requires filtering rows inside the window function, which standard SQL cannot directly do. However, we can work around this by using conditional aggregation inside a window function.

SELECT 
  sale_date,
  sales,
  SUM(CASE WHEN sales > 200 THEN sales ELSE 0 END) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
  NULLIF(SUM(CASE WHEN sales > 200 THEN 1 ELSE 0 END) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 0) AS conditional_moving_avg
FROM sales_data
ORDER BY sale_date;

Explanation:

CASE WHEN sales > 200 THEN sales ELSE 0 END: This ensures that only sales greater than 200 are considered for the sum.
NULLIF(SUM(CASE WHEN sales > 200 THEN 1 ELSE 0 END), 0): This counts how many rows met the condition (sales > 200). NULLIF(..., 0) ensures we don’t divide by zero.
OVER (ORDER BY sale_date): Ensures that the sum and count are computed as running totals across the entire dataset up to the current row.

Output (Conditional Variable Window):

sale_date	sales	conditional_moving_avg
2024-01-01	100	NULL
2024-01-02	150	NULL
2024-01-03	200	NULL
2024-01-04	250	250
2024-01-05	300	275
2024-01-06	350	300
2024-01-07	400	325

Here, the moving average only considers rows where sales are greater than 200, so the moving average starts on 2024-01-04, when the first sales value exceeds 200.

Example 2:

This trick demonstrates a moving average with a variable window size based on whether the current day is a weekend or weekday.

Solution:

This approach allows for different smoothing effects on weekends vs. weekdays, which can be useful when sales patterns differ between these periods.

--Works in Oracle
WITH daily_sales AS (
  SELECT to_date('2023-01-01','yyyy-mm-dd') AS sales_date, 100 AS sales_amount from dual
  UNION ALL SELECT to_date('2023-01-02','yyyy-mm-dd'), 120 from dual
  UNION ALL SELECT to_date('2023-01-03','yyyy-mm-dd'), 80 from dual
  UNION ALL SELECT to_date('2023-01-04','yyyy-mm-dd'), 150 from dual
  UNION ALL SELECT to_date('2023-01-05','yyyy-mm-dd'), 110 from dual
  UNION ALL SELECT to_date('2023-01-06','yyyy-mm-dd'), 130 from dual
  UNION ALL SELECT to_date('2023-01-07','yyyy-mm-dd'), 140 from dual
)
SELECT 
  sales_date,
  sales_amount,
  AVG(sales_amount) OVER (
    ORDER BY sales_date
    ROWS BETWEEN CASE WHEN TO_CHAR(sales_date, 'DY') IN ('SUN', 'SAT') THEN 1 ELSE 2 END PRECEDING AND CURRENT ROW
  ) AS moving_avg
FROM daily_sales;

Output:

SALES_DATE	SALES_AMOUNT	MOVING_AVG
01-JAN-23	100	100
02-JAN-23	120	110
03-JAN-23	80	100
04-JAN-23	150	116.67
05-JAN-23	110	113.33
06-JAN-23	130	130
07-JAN-23	140	135

Explanation:

Here’s the breakdown

We create a CTE (daily_sales) with sample daily sales data.
In the main query, we use a window function with a CASE statement in the ROWS BETWEEN clause:
- For weekends (Sunday, Saturday), we use a 2-day window (current day and 1 preceding).
- For weekdays, we use a 3-day window (current day and 2 preceding).

Key Takeaways:

Fixed Window: Standard moving average with a fixed number of preceding rows.
Variable Window (Date-Based): The window size can be based on time intervals, allowing you to define the window dynamically based on date ranges.
Conditional Window (Logic-Based): By using conditional logic within window functions, you can vary which rows contribute to the moving average based on specific conditions.
Window Functions: SQL window functions like AVG() and SUM() combined with ROWS or RANGE clauses allow for flexible moving average calculations.

The ability to dynamically change the window size based on conditions makes SQL window functions powerful for complex data analysis scenarios.

11. Percentile Calculation with FIRST_VALUE and LAG

Percentile calculations are used in statistical analysis to determine the value below which a given percentage of observations in a dataset falls. In SQL, percentiles can be calculated using different methods, including window functions like FIRST_VALUE and LAG. These functions are typically combined with ranking or cumulative distribution techniques to identify the value at a specific percentile.

In this context, we’ll explore how you can calculate percentiles using the FIRST_VALUE and LAG functions in SQL, along with some logical steps.

Problem Definition

Given a dataset of values (e.g., sales, scores, or prices), calculate the value at a specific percentile, such as the 90th percentile. In SQL, this can be done by leveraging window functions like FIRST_VALUE (to access values in a defined window) and LAG (to get previous rows in an ordered sequence).

Goal: You want to find the value at the 90th percentile, which means 90% of the students have a score below this value.

Example 1:

Imagine you have a table of student scores that looks like this:

WITH student_scores AS (
  SELECT 1 AS student_id, 50 AS score UNION ALL 
  SELECT 2, 60 UNION ALL 
  SELECT 3, 70 UNION ALL 
  SELECT 4, 80 UNION ALL 
  SELECT 5, 90 UNION ALL 
  SELECT 6, 100
)
select * from student_scores;

student_id	score
1	50
2	60
3	70
4	80
5	90
6	100

Approach Using `FIRST_VALUE` and `LAG`:

To calculate the percentile, we’ll follow these steps:

Sort the data: Arrange the data in ascending or descending order of scores.
Assign row numbers: Assign row numbers or ranks to each row in the dataset.
Calculate the percentile rank: Compute the percentile rank of each row, which indicates what percentage of the dataset falls below that row.
Use FIRST_VALUE to get the percentile value: Once we’ve computed the percentile rank, use FIRST_VALUE to select the score that corresponds to the desired percentile (e.g., 90th percentile).
Use LAG to compare: You can use LAG to check how the score changes between consecutive rows and see where the percentile threshold lies.

Step-by-Step SQL Solution:

1. Assign Row Numbers Using `ROW_NUMBER()`:

First, we assign a row number to each row, ordered by the score. This will help in determining the position of each score in the dataset.

WITH student_scores AS (
  SELECT 1 AS student_id, 50 AS score UNION ALL 
  SELECT 2, 60 UNION ALL 
  SELECT 3, 70 UNION ALL 
  SELECT 4, 80 UNION ALL 
  SELECT 5, 90 UNION ALL 
  SELECT 6, 100
),ranked_scores AS (
  SELECT 
    student_id,
    score,
    ROW_NUMBER() OVER (ORDER BY score) AS rn,
    COUNT(*) OVER () AS total_rows
  FROM student_scores
)
select * from ranked_scores;

Output:

student_id	score	rn	total_rows
1	50	1	6
2	60	2	6
3	70	3	6
4	80	4	6
5	90	5	6
6	100	6	6

Explanation:

ROW_NUMBER() OVER (ORDER BY score) assigns a sequential number to each row based on the score, ordered from smallest to largest.
COUNT(*) OVER () calculates the total number of rows in the dataset.

2. Calculate the Percentile Rank:

Next, we calculate the percentile rank for each row. This is done by dividing the row number by the total number of rows.

WITH student_scores AS (
  SELECT 1 AS student_id, 50 AS score UNION ALL 
  SELECT 2, 60 UNION ALL 
  SELECT 3, 70 UNION ALL 
  SELECT 4, 80 UNION ALL 
  SELECT 5, 90 UNION ALL 
  SELECT 6, 100
), ranked_scores AS (
  SELECT 
    student_id,
    score,
    ROW_NUMBER() OVER (ORDER BY score) AS rn,
    COUNT(*) OVER () AS total_rows
  FROM student_scores
), percentile_ranks AS (
  SELECT 
    student_id,
    score,
    rn,
    total_rows,
    (rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
  FROM ranked_scores
)
select * from percentile_ranks;

Output:

student_id	score	rn	total_rows	percentile_rank
1	50	1	6	0
2	60	2	6	0.2
3	70	3	6	0.4
4	80	4	6	0.6
5	90	5	6	0.8
6	100	6	6	1

Explanation:

(rn - 1) * 1.0 / (total_rows - 1) computes the percentile rank for each score. For example, the first score will have a rank close to 0%, and the last score will have a rank close to 100%.

3. Use `FIRST_VALUE` to Get the 90th Percentile:

Now that we have the percentile rank, we can use FIRST_VALUE() to capture the first score where the percentile rank is greater than or equal to 0.9 (the 90th percentile).

WITH student_scores AS (
  SELECT 1 AS student_id, 50 AS score UNION ALL 
  SELECT 2, 60 UNION ALL 
  SELECT 3, 70 UNION ALL 
  SELECT 4, 80 UNION ALL 
  SELECT 5, 90 UNION ALL 
  SELECT 6, 100
),ranked_scores AS (
  SELECT 
    student_id,
    score,
    ROW_NUMBER() OVER (ORDER BY score) AS rn,
    COUNT(*) OVER () AS total_rows
  FROM student_scores
),percentile_ranks AS (
  SELECT 
    student_id,
    score,
    rn,
    total_rows,
    (rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
  FROM ranked_scores
) SELECT 
  student_id,
  score,
  percentile_rank,
  FIRST_VALUE(score) OVER (ORDER BY percentile_rank ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS percentile_90
FROM percentile_ranks
WHERE percentile_rank >= 0.9
LIMIT 1;

Output: Result for 90th Percentile

student_id	score	percentile_rank	percentile_90
6	100	1.00	100

Since the 90th percentile rank corresponds to the last score (100), the output would be: 90th Percentile Value = 100

Explanation:

FIRST_VALUE(score) OVER (ORDER BY percentile_rank) selects the first score for which the percentile rank is greater than or equal to 0.9 (the 90th percentile).
The LIMIT 1 ensures that we only return the first row that meets the condition.

4. Use `LAG` to Compare Consecutive Rows:

We can further enhance this calculation by using the LAG function to compare the score changes between consecutive rows. This helps to confirm whether the desired percentile threshold lies between two scores.

WITH student_scores AS (
  SELECT 1 AS student_id, 50 AS score UNION ALL 
  SELECT 2, 60 UNION ALL 
  SELECT 3, 70 UNION ALL 
  SELECT 4, 80 UNION ALL 
  SELECT 5, 90 UNION ALL 
  SELECT 6, 100
),ranked_scores AS (
  SELECT 
    student_id,
    score,
    ROW_NUMBER() OVER (ORDER BY score) AS rn,
    COUNT(*) OVER () AS total_rows
  FROM student_scores
),percentile_ranks AS (
  SELECT 
    student_id,
    score,
    rn,
    total_rows,
    (rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
  FROM ranked_scores
), score_changes AS (
  SELECT 
    student_id,
    score,
    percentile_rank,
    LAG(score) OVER (ORDER BY percentile_rank) AS prev_score
  FROM percentile_ranks
)
SELECT 
  student_id,
  score,
  prev_score,
  CASE 
    WHEN prev_score IS NULL THEN score
    ELSE (prev_score + score) / 2
  END AS adjusted_percentile
FROM score_changes
WHERE percentile_rank >= 0.9
LIMIT 1;

Output:

Student_Id	Score	Prev_Score	adjusted_percentile
6	100	90	95

Explanation:

LAG(score) OVER (ORDER BY percentile_rank) allows us to compare the current score with the previous score.
CASE statement calculates an adjusted percentile by averaging the previous and current scores to smooth out the transition between rows if needed.

Explanation of Key Functions:

FIRST_VALUE(): This window function is used to retrieve the first value within a specific window or range. In the context of percentile calculations, it helps us get the score that first crosses the desired percentile threshold (e.g., 90%).
LAG(): This function provides access to the previous row’s value in a specified order. In percentile calculations, LAG() can be used to compare the current score with the previous one, which helps in smoothing transitions between percentiles or calculating exact percentiles when data isn’t perfectly distributed.
ROW_NUMBER(): This function assigns a unique row number to each row in the result set. It’s essential for ordering and calculating percentile ranks.
COUNT(): This aggregate function is used to calculate the total number of rows, which is crucial for calculating percentile ranks.

Example 2

This SQL trick combines several window functions to provide a comprehensive salary analysis:

Solution:

This query provides a rich set of analytics for each employee’s salary, showing where they stand in relation to their peers and the overall salary range.

 WITH employee_salaries AS (
  SELECT 'Alice' AS name, 50000 AS salary UNION ALL 
  SELECT 'Bob', 55000 UNION ALL 
  SELECT 'Charlie', 60000 UNION ALL 
  SELECT 'David', 65000 UNION ALL 
  SELECT 'Eve', 70000
),
ranked_salaries AS (
  SELECT 
    name,
    salary,
    PERCENT_RANK() OVER (ORDER BY salary) AS percentile,
    ROW_NUMBER() OVER (ORDER BY salary) AS row_num
  FROM employee_salaries
)
SELECT 
  name,
  salary,
  percentile,
  FIRST_VALUE(salary) OVER (ORDER BY salary rows between unbounded preceding and unbounded following ) AS min_salary,
  LAG(salary, 1) OVER (ORDER BY salary) AS prev_salary,
  LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary,
  LAST_VALUE(salary) OVER (ORDER BY salary rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_salary
FROM ranked_salaries;

Output:

Name	Salary	Percentile	Min_salary	Prev_salary	Next_salary	Max_salary
Eve	70,000	1	50,000	65,000	null	70,000
David	65,000	0.75	50,000	60,000	70,000	70,000
Charlie	60,000	0.5	50,000	55,000	65,000	70,000
Bob	55,000	0.25	50,000	50,000	60,000	70,000
Alice	50,000	0	50,000	null	55,000	70,000

Explanation:

We start with a CTE (employee_salaries) containing sample salary data.
In the ranked_salaries CTE, we calculate the percentile rank and row number for each salary.
The main query then uses:
- FIRST_VALUE: To get the minimum salary.
- LAG: To get the previous salary in the ordered list.
- LEAD: To get the next salary in the ordered list.
- LAST_VALUE: To get the maximum salary.

Key Takeaways:

Percentile Rank Calculation: You calculate the percentile rank by dividing the row position by the total number of rows.
FIRST_VALUE Usage: This function is great for retrieving the value at a particular percentile based on rank or score.
LAG for Comparison: Use LAG to compare consecutive rows, helping to smooth transitions or identify where the percentile changes.

This method provides a flexible and efficient way to calculate percentiles in SQL using window functions.

These SQL tricks demonstrate the power and flexibility of window functions in SQL. They allow for complex calculations and analytics to be performed efficiently within a single query.

Discover the power of technology and learning with TechyBuddy

Table of Content

1. Gaps and Islands Concepts:

Problem Definition:

Example 1:

Approach:

1. Identifying Islands (Continuous Sequences) Using Grouping

Solution:

Explanation:

2. Identifying Gaps in the Sequence

Solution:

Explanation:

Example 2:

1. Identifying Islands (Continuous Sequences) Using Grouping

When to Use Gaps and Islands Solutions:

2. Cumulative Sum with Reset:

Problem Definition:

Example 1:

Approach:

Solution:

Explanation:

Example 2: Advanced – Cumulative Sum with Reset Based on Another Condition (e.g., Date)

Solution:

Explanation:

Use Cases for Cumulative Sum with Reset:

Key Takeaways:

3. Pivoting Data with Dynamic Columns:

Example:

4. Finding Missing Dates

Problem Definition:

Example:

Solution:

Explanation:

5. Running Total Using Window Functions

Problem Definition:

Example:

Solution:

Explanation:

6. Detecting Gaps in a Sequential ID

Problem Definition:

Example:

Solution:

Explanation:

7. Finding Overlapping Time Intervals

Problem Definition:

Example:

Solution:

Explanation:

8. Hierarchical Data with Recursive Queries

Problem Definition:

Example:

Solution:

Explanation:

9. Cumulative Distribution Function (CDF) using Window Functions

Problem Definition:

Example 1:

Step-by-Step Solution:

1. Calculate Row Rank Using ROW_NUMBER() or RANK():

Explanation:

2. Calculate the Cumulative Count:

Explanation:

3. Calculate the CDF:

Explanation:

Key SQL Window Functions Explained:

Key Considerations:

Example 2

Solution:

Explanation:

Key Takeaways:

10. Moving Average with Variable Window Size

Problem Definition:

Example:

Fixed vs Variable Window Size:

1. Moving Average with Fixed Window Size (for Reference):

Solution: (Fixed Window)

Explanation:

2. Moving Average with Variable Window Size:

Example 1: Moving Average Based on a Variable Number of Previous Days

Solution: (Variable Date-Based Window)

Explanation:

1. Calculate Row Rank Using `ROW_NUMBER()` or `RANK()`:

Approach Using `FIRST_VALUE` and `LAG`:

1. Assign Row Numbers Using `ROW_NUMBER()`:

3. Use `FIRST_VALUE` to Get the 90th Percentile:

4. Use `LAG` to Compare Consecutive Rows: