Discover expert-level SQL tips and tricks to boost your database querying skills. From advanced window functions to optimizing complex queries, this post covers unique techniques that will enhance your SQL performance and problem-solving capabilities. Perfect for developers and data analysts looking to master SQL!
Table of Content
- 1. Gaps and Islands Concepts:
- 2. Cumulative Sum with Reset:
- 3. Pivoting Data with Dynamic Columns:
- 4. Finding Missing Dates
- 5. Running Total Using Window Functions
- 6. Detecting Gaps in a Sequential ID
- 7. Finding Overlapping Time Intervals
- 8. Hierarchical Data with Recursive Queries
- 9. Cumulative Distribution Function (CDF) using Window Functions
- 10. Moving Average with Variable Window Size
- 11. Percentile Calculation with FIRST_VALUE and LAG
- Learn more about SQL and other topics
Welcome to our guide on SQL tips and tricks designed to take your database skills to the next level. Whether you’re an experienced developer or just getting started, mastering SQL is essential for efficient data handling and problem-solving. In this blog, we dive into advanced techniques, covering everything from window functions to query optimization. These expert tips will help you write more powerful, efficient queries, improve database performance, and unlock the full potential of SQL. Let’s get started!
1. Gaps and Islands Concepts:
The Gaps and Islands Problem is a common SQL problem that deals with identifying sequences (islands) or gaps in a dataset, typically when working with ranges of values such as dates, IDs, or numbers. It’s often used when you have ordered data, and you want to group together consecutive values or identify missing ones.
- Gaps: These are discontinuities in a sequence. For example, if you have consecutive days of sales data, a day with no sales creates a gap in the sequence.
- Islands: These are consecutive blocks of data without gaps. In the same example, an “island” would be a continuous streak of days where sales were recorded without any missing days in between.
Problem Definition:
Imagine you have a table of transactions with a date column, and you want to identify continuous periods of sales (islands) or gaps in the sales records. The challenge is to recognize which records are part of the same continuous sequence (island) and where there are gaps between these sequences.
Goal:
- Islands: Identify the continuous streaks (islands) of sales.
- Gaps: Find where there are breaks in those streaks (gaps).
Example 1:
Suppose you have a table sales
with the following data:
-- Create and populate sample data
CREATE TABLE sales_data (
sale_date DATE,
amount number
);
--adding rows
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),200);
INSERT INTO sales_data VALUES(to_date('2024-01-05','YYYY-MM-DD'),180);
INSERT INTO sales_data VALUES(to_date('2024-01-06','YYYY-MM-DD'),220);
INSERT INTO sales_data VALUES(to_date('2024-01-09','YYYY-MM-DD'),130);
sale_date | amount |
---|---|
2024-01-01 | 100 |
2024-01-02 | 150 |
2024-01-03 | 200 |
2024-01-05 | 180 |
2024-01-06 | 220 |
2024-01-09 | 130 |
In this dataset, you can see there are gaps between 2024-01-03
and 2024-01-05
, as well as between 2024-01-06
and 2024-01-09
. The continuous streaks of sales are called “islands.”
Approach:
There are two primary techniques used to solve this problem:
1. Identifying Islands (Continuous Sequences) Using Grouping
The idea is to assign a unique group number to each continuous block of dates.
Step-by-Step Logic:
- ROW_NUMBER(): Assign a row number to each date, ordered by the date itself.
- Create a Group Key: The difference between the actual date and the row number will remain constant for each continuous group of dates, which we can use to identify islands.
Solution:
SQL query to find islands:
WITH sales AS (
SELECT
sale_date,
amount,
ROW_NUMBER() OVER (ORDER BY sale_date) AS rn
FROM sales_data
)
SELECT
MIN(sale_date) AS island_start,
MAX(sale_date) AS island_end
FROM sales
GROUP BY sale_date - rn
ORDER BY island_start;
Explanation:
ROW_NUMBER() OVER (ORDER BY sale_date)
assigns a sequential row number based on thesale_date
.sale_date - rn
creates a constant identifier for consecutive rows. For example, in a continuous sequence, the difference between the actual date and the row number will be the same. When the sequence is interrupted, this difference will change, thus identifying a new group (island).GROUP BY sale_date - rn
groups the data into “islands” of consecutive dates.MIN(sale_date)
andMAX(sale_date)
return the start and end dates of each island.
Output:
island_start | island_end |
---|---|
2024-01-01 | 2024-01-03 |
2024-01-05 | 2024-01-06 |
2024-01-09 | 2024-01-09 |
Here, we’ve identified three islands:
- From
2024-01-01
to2024-01-03
- From
2024-01-05
to2024-01-06
- On
2024-01-09
(a single-day island)
2. Identifying Gaps in the Sequence
To find gaps, you need to look at where there’s a missing day (or other unit of measure) between consecutive records.
Solution:
SQL Query to Find Gaps:
WITH sales AS (
SELECT
sale_date,
LAG(sale_date) OVER (ORDER BY sale_date) AS prev_date
FROM sales_data
)
SELECT
prev_date + 1 AS gap_start,
sale_date - 1 AS gap_end
FROM sales
WHERE sale_date > prev_date + 1;
Explanation:
LAG(sale_date)
allows you to access the previous row’s date.- The
WHERE sale_date > prev_date + 1
condition identifies where the gap starts. If the difference between the current date and the previous date is more than 1, there’s a gap. prev_date + 1
andsale_date - 1
help you identify the range of missing dates.
Output:
gap_start | gap_end |
---|---|
2024-01-04 | 2024-01-04 |
2024-01-07 | 2024-01-08 |
This identifies two gaps:
2024-01-04
2024-01-07
to2024-01-08
Example 2:
This example demonstrates how to find consecutive date ranges for employee attendance. It uses a combination of window functions and self-joins to group consecutive dates together.
-- Create and populate sample data
CREATE TABLE employee_attendance (
employee_id INT,
attendance_date DATE
);
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES
(1, '2023-01-01'), (1, '2023-01-02'), (1, '2023-01-03'), (1, '2023-01-05'),
(1, '2023-01-06'), (1, '2023-01-08'), (1, '2023-01-09'), (1, '2023-01-10');
--Some database does not support above syntax, use below if not working if above SQL not working
/*
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-01','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-02','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-03','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-05','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-06','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-08','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-09','YYYY-MM-DD'));
INSERT INTO employee_attendance (employee_id, attendance_date) VALUES(1, to_date('2023-01-10','YYYY-MM-DD'));
*/
1. Identifying Islands (Continuous Sequences) Using Grouping
The idea is to assign a unique group number to each continuous block of dates. Here we are finding consecutive date ranges for employee attendance
-- Query to find consecutive date ranges
WITH numbered_dates AS (
SELECT
employee_id,
attendance_date,
attendance_date - ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY attendance_date) AS group_date
FROM employee_attendance
)
SELECT
employee_id,
MIN(attendance_date) AS gap_start,
MAX(attendance_date) AS gap_end,
COUNT(*) AS consecutive_days
FROM numbered_dates
GROUP BY employee_id, group_date
ORDER BY employee_id, gap_start;
Output:
Employee_ID | Gap_start | Gap_end | Consecutive_days |
---|---|---|---|
1 | 2023-01-01 | 2023-01-03 | 3 |
1 | 2023-01-05 | 2023-01-06 | 2 |
1 | 2023-01-08 | 2023-01-10 | 3 |
When to Use Gaps and Islands Solutions:
- Gaps: When you need to know when data is missing, such as days with no sales or missing IDs in a sequence.
- Islands: When you want to identify periods of continuous activity, such as streaks of daily sales, user logins, or consecutive event occurrences.
Both methods can be adapted for use in scenarios involving numeric sequences, date ranges, or any ordered data where gaps and sequences need to be identified.
2. Cumulative Sum with Reset:
The Cumulative Sum with Reset problem involves calculating a running total (cumulative sum) of values in a dataset, but with the added complexity that the sum resets under certain conditions. Typically, this reset happens based on changes in a specific column, such as a category, status, or date change.
This type of problem is commonly seen in scenarios like tracking a running total of sales, but resetting the total when a new month starts, or tracking performance across different categories.
Problem Definition:
Given a table of values, you want to compute the cumulative sum of a column, but you need the sum to restart when a certain condition is met.
Goal: Calculate a cumulative sum of sales
, with the sum resetting each time the category
changes.
Example 1:
Suppose you have a table of daily sales that looks like this:
-- Create and populate sample data
CREATE TABLE sales_data (
sale_date DATE,
category varchar(2),
amount int
);
--Adding sales data to table
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),'A',100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),'A',150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),'A',200);
--new category added
INSERT INTO sales_data VALUES(to_date('2024-01-04','YYYY-MM-DD'),'B',300);
INSERT INTO sales_data VALUES(to_date('2024-01-05','YYYY-MM-DD'),'B',100);
--old category again
INSERT INTO sales_data VALUES(to_date('2024-01-06','YYYY-MM-DD'),'A',250);
INSERT INTO sales_data VALUES(to_date('2024-01-07','YYYY-MM-DD'),'A',50);
sale_date | category | amount |
---|---|---|
2024-01-01 | A | 100 |
2024-01-02 | A | 150 |
2024-01-03 | A | 200 |
2024-01-04 | B | 300 |
2024-01-05 | B | 100 |
2024-01-06 | A | 250 |
2024-01-07 | A | 50 |
You want to compute the cumulative sum of sales for each category, but the sum should reset when the category changes.
Approach:
To solve this problem in SQL, we can use window functions with partitioning and conditional logic.
Step-by-Step Logic:
- Window Function: Use
SUM()
as a window function to calculate the cumulative sum. - Partitioning: Use
PARTITION BY
to group the data by thecategory
so that the sum resets for each category. - Ordering: Use
ORDER BY
to ensure the cumulative sum is calculated in the correct order (e.g., by date or some other column).
Solution:
SELECT
sale_date,
category,
amount,
SUM(amount) OVER (PARTITION BY category ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;
Explanation:
SUM(sales) OVER (PARTITION BY category ORDER BY sale_date)
calculates the cumulative sum within eachcategory
. ThePARTITION BY
clause ensures that the sum resets when thecategory
changes.ORDER BY sale_date
ensures that the cumulative sum is calculated in the correct sequence of dates.
Output:
sale_date | category | amount | cumulative_sales_amount |
---|---|---|---|
2024-01-01 | A | 100 | 100 |
2024-01-02 | A | 150 | 250 |
2024-01-03 | A | 200 | 450 |
2024-01-04 | B | 300 | 300 |
2024-01-05 | B | 100 | 400 |
2024-01-06 | A | 250 | 700 |
2024-01-07 | A | 50 | 750 |
Here’s what’s happening:
- For category
A
, the cumulative sales are computed across the first three rows, summing 100, 150, and 200, giving cumulative values of 100, 250, and 450, respectively. - When the
category
switches toB
, the cumulative sum starts fresh (i.e., resets) at 300 and continues summing for the next record. - After category
B
, when the data switches back toA
, the cumulative sum starts again from where it has left i.e last calculated value was 450, with the first value being 700(450+250) (on 2024-01-06), and the next day adding 50 to make 750 (on 2024-01-07).
Example 2: Advanced – Cumulative Sum with Reset Based on Another Condition (e.g., Date)
Sometimes, you may want to reset the cumulative sum based on a time-based condition, such as the beginning of a new month or year, instead of a category change.
Let’s assume you want to compute a cumulative sum of sales, but it should reset at the beginning of each month.
--dropping old table
DROP TABLE sales_data;
-- Create and populate sample data
CREATE TABLE sales_data (
sale_date DATE,
category varchar(2),
amount int
);
--Adding sales data to table
INSERT INTO sales_data VALUES(to_date('2024-01-01','YYYY-MM-DD'),'A',100);
INSERT INTO sales_data VALUES(to_date('2024-01-02','YYYY-MM-DD'),'A',150);
INSERT INTO sales_data VALUES(to_date('2024-01-03','YYYY-MM-DD'),'A',200);
/*new month*/
INSERT INTO sales_data VALUES(to_date('2024-02-01','YYYY-MM-DD'),'B',300);
INSERT INTO sales_data VALUES(to_date('2024-02-02','YYYY-MM-DD'),'B',100);
INSERT INTO sales_data VALUES(to_date('2024-02-04','YYYY-MM-DD'),'A',50);
Solution:
SELECT
sale_date,
amount,
SUM(amount) OVER (PARTITION BY YEAR(sale_date), MONTH(sale_date) ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;
--Some database may use different syntax to get year and month from a date like Oracle, you can use below code for them
/*
SELECT
sale_date,
amount,
SUM(amount) OVER (PARTITION BY extract(YEAR from sale_date), extract(MONTH from sale_date) ORDER BY sale_date) AS cumulative_sales_amount
FROM sales_data
ORDER BY sale_date;
*/
Output:
Explanation:
PARTITION BY YEAR(sale_date), MONTH(sale_date)
groups the rows by each month of each year.- This query resets the cumulative sum at the start of each month.
Use Cases for Cumulative Sum with Reset:
- Finance: Track cumulative revenue, resetting at the start of each fiscal year or quarter.
- Sales: Calculate running totals for product sales, resetting totals for each new product or category.
- Sports: Accumulate points or scores for players or teams, resetting at the beginning of each season or match.
- Inventory Management: Track inventory stock, resetting the sum when moving to a new product category or warehouse.
Key Takeaways:
- Cumulative sum allows tracking running totals over time or across categories.
- Resetting logic can be applied using the
PARTITION BY
clause, which helps in defining when the sum should reset (e.g., by category, month, or other criteria). - Window functions like
SUM()
andROW_NUMBER()
are powerful tools for solving problems involving ordered data.
The “Cumulative Sum with Reset” is a crucial pattern for many analytical and reporting tasks, allowing you to handle scenarios where running totals are needed but must restart under specific conditions.
3. Pivoting Data with Dynamic Columns:
This example shows how to create a pivot table for product sales by category, where the categories are not known in advance. It uses dynamic SQL and PostgreSQL-specific functions to generate the pivot table on the fly.
Example:
Pivoting Data with Dynamic Columns
-- Create a pivot table for product sales by category
-- Create and populate sample data
CREATE TABLE product_sales (
product_id INT,
category VARCHAR(50),
sale_amount DECIMAL(10, 2)
);
INSERT INTO product_sales (product_id, category, sale_amount) VALUES
(1, 'Electronics', 500), (2, 'Clothing', 200), (3, 'Books', 50),
(4, 'Electronics', 300), (5, 'Clothing', 150), (6, 'Food', 100),
(7, 'Books', 75), (8, 'Food', 80), (9, 'Electronics', 600);
-- Dynamic SQL to create pivot table
-- Note: This example uses PostgreSQL-specific syntax
CREATE OR REPLACE FUNCTION pivot_product_sales() RETURNS TABLE (
product_id INT,
Electronics DECIMAL(10, 2),
Clothing DECIMAL(10, 2),
Books DECIMAL(10, 2),
Food DECIMAL(10, 2)
) AS $$
DECLARE
categories TEXT;
BEGIN
-- Get distinct categories dynamically
SELECT string_agg(DISTINCT quote_ident(category), ', ')
INTO categories
FROM product_sales;
-- Execute dynamic SQL
RETURN QUERY EXECUTE format('
SELECT product_id, %s
FROM crosstab(
''SELECT product_id, category, sale_amount
FROM product_sales
ORDER BY 1, 2'',
''SELECT DISTINCT category FROM product_sales ORDER BY 1''
) AS ct(product_id INT, %s)
', categories, categories);
END;
$$ LANGUAGE plpgsql;
-- Call the function to get the pivot table
SELECT * FROM pivot_product_sales();
4. Finding Missing Dates
Problem Definition:
Suppose you have a table with daily records and want to find dates that are missing from the sequence.
Goal: Identify missing dates between the minimum and maximum dates in the table.
Example:
Suppose you have a table daily_sales
with the following data:
drop table daily_sales;
create table daily_sales as
SELECT to_date('2023-01-01','YYYY-MM-DD') AS sales_date, 100 AS sales_amount UNION ALL
SELECT to_date('2023-01-02','YYYY-MM-DD') , 120 UNION ALL
SELECT to_date('2023-01-04','YYYY-MM-DD') , 150 UNION ALL
SELECT to_date('2023-01-05','YYYY-MM-DD') , 110 UNION ALL
SELECT to_date('2023-01-07','YYYY-MM-DD') , 140;
Date | Sales |
2023-01-01 | 100 |
2023-01-02 | 120 |
2023-01-04 | 150 |
2013-01-05 | 110 |
2013-01-07 | 140 |
Solution:
WITH RECURSIVE DateRange (sales_date) AS (
SELECT MIN(sales_date) AS sales_date
FROM daily_sales
UNION ALL
SELECT date(sales_date)+1 as sales_date
FROM DateRange
WHERE date(sales_date)+1 <= (SELECT MAX(sales_date) FROM daily_sales)
)
SELECT sales_date as missing_dates
FROM DateRange
WHERE sales_date NOT IN (SELECT sales_date FROM daily_sales);
Output:
Missing_dates:
2023-01-03
2023-01-06
Explanation:
– The `DateRange` CTE generates a sequence of dates from the minimum date to the maximum date.
– The outer query then finds dates in this sequence that are not present in the `daily_sales` table.
5. Running Total Using Window Functions
Problem Definition:
Compute the daily running total of sales.
Goal: Get the running total of sales by date.
Example:
Suppose you have a table daily_sales
with the following data:
Date | Sales |
2024-09-01 | 100 |
2024-09-02 | 150 |
2024-09-03 | 200 |
2024-09-04 | 250 |
Solution:
SELECT
date,
sales,
SUM(sales) OVER (ORDER BY date) AS running_total
from daily_sales;
Explanation:
The `SUM()` window function calculates the running total of the `sales` column, ordered by `date`.
6. Detecting Gaps in a Sequential ID
Problem Definition:
You have a table with sequential IDs and want to detect gaps.
Goal: Identify missing IDs in the sequence.
Example:
Suppose you have a table sequential_data
with the following data:
Id | Value |
1 | A |
2 | B |
4 | C |
5 | D |
Solution:
WITH NumberSeries AS (
SELECT MIN(id) AS id
FROM sequential_data
UNION ALL
SELECT id + 1
FROM NumberSeries
WHERE id + 1 <= (SELECT MAX(id) FROM sequential_data)
)
SELECT id
FROM NumberSeries
WHERE id NOT IN (SELECT id FROM sequential_data);
Explanation:
– The `NumberSeries` CTE recursively generates a sequence of IDs from the minimum ID to the maximum ID.
– The outer query then identifies IDs in this sequence that are not present in the `sequential_data` table.
7. Finding Overlapping Time Intervals
Problem Definition:
Determine if there are overlapping time intervals in a table.
Goal: Find overlapping events.
Example:
Suppose you have a table events
with the following data:
Event_ID | Start_Time | End_Time |
1 | 2024-09-01 08:00:00 | 2024-09-01 10:00:00 |
2 | 2024-09-01 09:30:00 | 2024-09-01 11:00:00 |
3 | 2024-09-01 11:30:00 | 2024-09-01 12:30:00 |
Solution:
SELECT e1.event_id AS event1, e2.event_id AS event2
FROM events e1
JOIN events e2
ON e1.event_id < e2.event_id
AND e1.start_time < e2.end_time
AND e1.end_time > e2.start_time;
Explanation:
– The `JOIN` condition ensures that `e1` and `e2` are different events (`e1.event_id < e2.event_id`).
– The overlap condition is checked using the `start_time` and `end_time` of both events.
8. Hierarchical Data with Recursive Queries
Problem Definition:
Manage hierarchical data, such as an organizational chart.
Goal: Retrieve the hierarchy of employees reporting to a given manager.
Example:
Suppose you have a table employees
with the following data:
Emp_ID | Emp_Name | Manager_ID |
1 | Alice | Null |
2 | Bob | 1 |
3 | Carol | 1 |
4 | Dave | 2 |
Solution:
WITH RECURSIVE EmployeeHierarchy AS (
SELECT emp_id, emp_name, manager_id
FROM employees
WHERE manager_id IS NULL -- Start with top-level managers
UNION ALL
SELECT e.emp_id, e.emp_name, e.manager_id
FROM employees e
INNER JOIN EmployeeHierarchy eh ON e.manager_id = eh.emp_id
)
SELECT *
FROM EmployeeHierarchy;
Explanation:
– The `EmployeeHierarchy` CTE starts with the top-level managers (`manager_id IS NULL`).
– It recursively joins with `employees` to include subordinates.
9. Cumulative Distribution Function (CDF) using Window Functions
The Cumulative Distribution Function (CDF) is a statistical concept that represents the probability that a random variable will take a value less than or equal to a specific value. In SQL, you can compute the CDF to analyze data distributions, often using window functions. This allows you to determine what proportion of the data lies below or at a certain threshold.
Problem Definition:
Given a dataset of numeric values (e.g., sales, scores, or prices), calculate the CDF for each value. The CDF for a value is the percentage of values in the dataset that are less than or equal to that value.
Goal: You want to calculate the Cumulative Distribution Function (CDF), which shows the percentage of products priced at or below each price point.
Example 1:
Imagine you have a table of product prices like this:
product_id | price |
---|---|
1 | 50 |
2 | 60 |
3 | 70 |
4 | 80 |
5 | 90 |
6 | 100 |
Step-by-Step Solution:
To calculate the CDF in SQL, we will follow these steps:
- Arrange the information: Sort the information in ascending order.
- Calculate the row rank: Assign a rank to each row based on the order of values.
- Calculate cumulative count: Calculate the cumulative count of rows that have a value less than or equal to the current row.
- Normalize the cumulative count: Convert the cumulative count into a percentage of the total number of rows, which represents the CDF.
1. Calculate Row Rank Using ROW_NUMBER()
or RANK()
:
We first assign a rank to each value. For this example, let’s assume you want to calculate the rank of each product’s price, ordered from lowest to highest.
WITH ranked_prices AS (
SELECT
product_id,
price,
ROW_NUMBER() OVER (ORDER BY price) AS rn,
COUNT(*) OVER () AS total_rows
FROM product_prices
)
Explanation:
ROW_NUMBER() OVER (ORDER BY price)
assigns a unique row number to each product based on its price, starting with the lowest price.COUNT(*) OVER ()
calculates the total number of rows in the dataset, which we need to normalize the cumulative count into a percentage.
2. Calculate the Cumulative Count:
Next, we use the COUNT()
window function to compute the cumulative count of rows where the price is less than or equal to the current row’s price.
WITH cumulative_count AS (
SELECT
product_id,
price,
COUNT(*) OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_count,
total_rows
FROM ranked_prices
)
Explanation:
COUNT(*) OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
calculates the cumulative count of rows up to and including the current row, ordered by price. This represents how many products have a price less than or equal to the current product’s price.
3. Calculate the CDF:
Now that we have the cumulative count, we can compute the CDF by dividing the cumulative count by the total number of rows. This gives the percentage of rows that have a value less than or equal to the current price.
SELECT
product_id,
price,
cumulative_count,
(cumulative_count * 1.0 / total_rows) AS cdf
FROM cumulative_count
ORDER BY price;
Explanation:
(cumulative_count * 1.0 / total_rows)
converts the cumulative count into a fraction of the total rows, which represents the CDF value. Multiplying by1.0
ensures the division produces a decimal (floating-point) result.
Final Output:
product_id | price | cumulative_count | cdf |
---|---|---|---|
1 | 50 | 1 | 0.167 |
2 | 60 | 2 | 0.333 |
3 | 70 | 3 | 0.500 |
4 | 80 | 4 | 0.667 |
5 | 90 | 5 | 0.833 |
6 | 100 | 6 | 1.000 |
Explanation of the Result:
- For a product with a price of 50, the cumulative count is 1, and since there are 6 rows total, the CDF is
1 / 6 = 0.167
(16.7%). - For a price of 100, the cumulative count is 6, and the CDF is
6 / 6 = 1.0
(100%), meaning all products are priced at or below 100.
Key SQL Window Functions Explained:
ROW_NUMBER()
: This assigns a unique number to each row based on the price. This helps in ranking the rows but does not handle ties (i.e., products with the same price will have different ranks). If you want to handle ties, you can useRANK()
orDENSE_RANK()
.COUNT() OVER (ORDER BY price ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
: This calculates the cumulative count, which is the count of all previous rows up to and including the current row. It’s key for building cumulative metrics like CDF.COUNT(*) OVER ()
: This calculates the total number of rows in the table, allowing us to normalize the cumulative count into a percentage of the total dataset.- Division: We divide the cumulative count by the total number of rows to calculate the cumulative distribution value, effectively turning it into a percentage (0–100%).
Key Considerations:
- Ties: If your data contains ties (i.e., multiple rows with the same value), the
ROW_NUMBER()
function will assign a unique rank to each, meaning that tied values will have slightly different CDF values. If you want ties to share the same rank, you can useRANK()
orDENSE_RANK()
instead. - Precision: The CDF values are typically represented as decimals between 0 and 1. If you want percentages, you can multiply the result by 100.
- Performance: CDF calculations using window functions can be resource-intensive for large datasets, so performance optimizations (such as indexing or partitioning) may be necessary depending on your data volume.
Example 2
This SQL trick calculates the Cumulative Distribution Function (CDF) for product sales.
Solution:
The CDF shows the proportion of values below a certain point. This can be useful for understanding the distribution of sales across products.
WITH sales_data AS (
SELECT 'Product A' AS product, 100 AS sales
UNION ALL SELECT 'Product B', 200
UNION ALL SELECT 'Product C', 150
UNION ALL SELECT 'Product D', 300
UNION ALL SELECT 'Product E', 250
)
SELECT
product,
sales,
sales / SUM(sales) OVER () AS sales_percentage,
SUM(sales) OVER (ORDER BY sales) / SUM(sales) OVER () AS cdf
FROM sales_data
ORDER BY sales;
Explanation:
Here’s how it works:
- We create a CTE (
sales_data
) with sample product sales data. - In the main query, we calculate:
sales_percentage
: Each product’s sales as a percentage of total sales.cdf
: The cumulative sum of sales percentages, ordered by sales amount.
Key Takeaways:
- Cumulative Distribution Function (CDF) is useful for understanding the distribution of data and determining the proportion of values that lie below a given threshold.
- SQL window functions like
COUNT()
,ROW_NUMBER()
, andRANK()
make it possible to calculate the CDF by providing cumulative counts and ranks. - Normalizing the cumulative count by the total number of rows yields the CDF value, showing the percentage of the dataset that falls below a certain point.
This technique is widely used in analytics and can help provide valuable insights into how values are distributed within a dataset.
10. Moving Average with Variable Window Size
A moving average (also known as a running average) is a technique used to smooth out fluctuations in data by calculating the average of a subset of values (window) that changes as you move through the data. In many scenarios, the window size (the number of rows or time intervals) is fixed. However, in more complex cases, you may want the window size to be variable, meaning the number of rows included in the average can change dynamically.
This SQL problem requires using window functions to calculate a moving average but with flexibility in determining the size of the window used for the calculation.
Problem Definition:
Given a table of data, calculate the moving average of a column (like sales or prices), but with a window size that can vary based on conditions (like date ranges or number of preceding rows).
Example:
Suppose you have a table of daily sales that looks like this:
sale_date | sales |
---|---|
2024-01-01 | 100 |
2024-01-02 | 150 |
2024-01-03 | 200 |
2024-01-04 | 250 |
2024-01-05 | 300 |
2024-01-06 | 350 |
2024-01-07 | 400 |
You want to calculate a moving average of the past 3 days (a fixed window size of 3), but you also want to handle cases where the window size may vary, such as when there are fewer than 3 preceding days or when conditions, like business logic, determine the window size.
Fixed vs Variable Window Size:
- Fixed Window Size: A constant window size is applied uniformly across all rows (e.g., a moving average over the last 3 days).
- Variable Window Size: The window size can change for each row based on conditions, such as a dynamically determined number of previous days or rows.
1. Moving Average with Fixed Window Size (for Reference):
Let’s first calculate the moving average using a fixed window size of 3 days.
Solution: (Fixed Window)
SELECT
sale_date,
sales,
AVG(sales) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales_data
ORDER BY sale_date;
Explanation:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
: This creates a fixed window of 3 rows, including the current row and the previous two.AVG(sales) OVER (ORDER BY sale_date)
: Calculates the average of sales within that fixed window.
Output (Fixed Window):
sale_date | sales | moving_avg |
---|---|---|
2024-01-01 | 100 | 100 |
2024-01-02 | 150 | 125 |
2024-01-03 | 200 | 150 |
2024-01-04 | 250 | 200 |
2024-01-05 | 300 | 250 |
2024-01-06 | 350 | 300 |
2024-01-07 | 400 | 350 |
2. Moving Average with Variable Window Size:
Now, let’s calculate the moving average with a variable window size. The window size could depend on different factors, such as:
- A changing number of previous days (e.g., vary the range of dates).
- Dynamic business logic (e.g., only use rows where the sales were above a certain threshold).
Example 1: Moving Average Based on a Variable Number of Previous Days
In this case, we want the moving average to include all sales from the past N
days dynamically (e.g., based on a specific time period).
Solution: (Variable Date-Based Window)
SELECT
sale_date,
sales,
AVG(sales) OVER (ORDER BY sale_date RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales_data
ORDER BY sale_date;
Explanation:
RANGE BETWEEN INTERVAL '2' DAY PRECEDING AND CURRENT ROW
: Instead of using a fixed number of rows, this uses a date-based window. It includes all rows within the last 2 days (including the current date).AVG(sales)
: Calculates the moving average of the sales over that variable window of dates.
Output (Variable Window Based on Days):
sale_date | sales | moving_avg |
---|---|---|
2024-01-01 | 100 | 100 |
2024-01-02 | 150 | 125 |
2024-01-03 | 200 | 150 |
2024-01-04 | 250 | 200 |
2024-01-05 | 300 | 275 |
2024-01-06 | 350 | 325 |
2024-01-07 | 400 | 375 |
In this case, the window is variable based on the date range, not the number of rows. For each row, the moving average is calculated using the sales from the current date and the previous 2 days.
Example 2: Moving Average with Business Logic-Based Variable Window
You might want to compute a moving average but only for the rows that satisfy certain business conditions. For example, you want to include all previous sales in the moving average as long as the sales are above 200.
Solution: (Conditional Window)
This scenario requires filtering rows inside the window function, which standard SQL cannot directly do. However, we can work around this by using conditional aggregation inside a window function.
SELECT
sale_date,
sales,
SUM(CASE WHEN sales > 200 THEN sales ELSE 0 END) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
NULLIF(SUM(CASE WHEN sales > 200 THEN 1 ELSE 0 END) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 0) AS conditional_moving_avg
FROM sales_data
ORDER BY sale_date;
Explanation:
CASE WHEN sales > 200 THEN sales ELSE 0 END
: This ensures that only sales greater than 200 are considered for the sum.NULLIF(SUM(CASE WHEN sales > 200 THEN 1 ELSE 0 END), 0)
: This counts how many rows met the condition (sales > 200
).NULLIF(..., 0)
ensures we don’t divide by zero.OVER (ORDER BY sale_date)
: Ensures that the sum and count are computed as running totals across the entire dataset up to the current row.
Output (Conditional Variable Window):
sale_date | sales | conditional_moving_avg |
---|---|---|
2024-01-01 | 100 | NULL |
2024-01-02 | 150 | NULL |
2024-01-03 | 200 | NULL |
2024-01-04 | 250 | 250 |
2024-01-05 | 300 | 275 |
2024-01-06 | 350 | 300 |
2024-01-07 | 400 | 325 |
Here, the moving average only considers rows where sales are greater than 200, so the moving average starts on 2024-01-04, when the first sales value exceeds 200.
Example 2:
This trick demonstrates a moving average with a variable window size based on whether the current day is a weekend or weekday.
Solution:
This approach allows for different smoothing effects on weekends vs. weekdays, which can be useful when sales patterns differ between these periods.
--Works in Oracle
WITH daily_sales AS (
SELECT to_date('2023-01-01','yyyy-mm-dd') AS sales_date, 100 AS sales_amount from dual
UNION ALL SELECT to_date('2023-01-02','yyyy-mm-dd'), 120 from dual
UNION ALL SELECT to_date('2023-01-03','yyyy-mm-dd'), 80 from dual
UNION ALL SELECT to_date('2023-01-04','yyyy-mm-dd'), 150 from dual
UNION ALL SELECT to_date('2023-01-05','yyyy-mm-dd'), 110 from dual
UNION ALL SELECT to_date('2023-01-06','yyyy-mm-dd'), 130 from dual
UNION ALL SELECT to_date('2023-01-07','yyyy-mm-dd'), 140 from dual
)
SELECT
sales_date,
sales_amount,
AVG(sales_amount) OVER (
ORDER BY sales_date
ROWS BETWEEN CASE WHEN TO_CHAR(sales_date, 'DY') IN ('SUN', 'SAT') THEN 1 ELSE 2 END PRECEDING AND CURRENT ROW
) AS moving_avg
FROM daily_sales;
Output:
SALES_DATE | SALES_AMOUNT | MOVING_AVG |
---|---|---|
01-JAN-23 | 100 | 100 |
02-JAN-23 | 120 | 110 |
03-JAN-23 | 80 | 100 |
04-JAN-23 | 150 | 116.67 |
05-JAN-23 | 110 | 113.33 |
06-JAN-23 | 130 | 130 |
07-JAN-23 | 140 | 135 |
Explanation:
Here’s the breakdown
- We create a CTE (
daily_sales
) with sample daily sales data. - In the main query, we use a window function with a CASE statement in the ROWS BETWEEN clause:
- For weekends (Sunday, Saturday), we use a 2-day window (current day and 1 preceding).
- For weekdays, we use a 3-day window (current day and 2 preceding).
Key Takeaways:
- Fixed Window: Standard moving average with a fixed number of preceding rows.
- Variable Window (Date-Based): The window size can be based on time intervals, allowing you to define the window dynamically based on date ranges.
- Conditional Window (Logic-Based): By using conditional logic within window functions, you can vary which rows contribute to the moving average based on specific conditions.
- Window Functions: SQL window functions like
AVG()
andSUM()
combined withROWS
orRANGE
clauses allow for flexible moving average calculations.
The ability to dynamically change the window size based on conditions makes SQL window functions powerful for complex data analysis scenarios.
11. Percentile Calculation with FIRST_VALUE and LAG
Percentile calculations are used in statistical analysis to determine the value below which a given percentage of observations in a dataset falls. In SQL, percentiles can be calculated using different methods, including window functions like FIRST_VALUE
and LAG
. These functions are typically combined with ranking or cumulative distribution techniques to identify the value at a specific percentile.
In this context, we’ll explore how you can calculate percentiles using the FIRST_VALUE
and LAG
functions in SQL, along with some logical steps.
Problem Definition
Given a dataset of values (e.g., sales, scores, or prices), calculate the value at a specific percentile, such as the 90th percentile. In SQL, this can be done by leveraging window functions like FIRST_VALUE
(to access values in a defined window) and LAG
(to get previous rows in an ordered sequence).
Goal: You want to find the value at the 90th percentile, which means 90% of the students have a score below this value.
Example 1:
Imagine you have a table of student scores that looks like this:
WITH student_scores AS (
SELECT 1 AS student_id, 50 AS score UNION ALL
SELECT 2, 60 UNION ALL
SELECT 3, 70 UNION ALL
SELECT 4, 80 UNION ALL
SELECT 5, 90 UNION ALL
SELECT 6, 100
)
select * from student_scores;
student_id | score |
---|---|
1 | 50 |
2 | 60 |
3 | 70 |
4 | 80 |
5 | 90 |
6 | 100 |
Approach Using FIRST_VALUE
and LAG
:
To calculate the percentile, we’ll follow these steps:
- Sort the data: Arrange the data in ascending or descending order of scores.
- Assign row numbers: Assign row numbers or ranks to each row in the dataset.
- Calculate the percentile rank: Compute the percentile rank of each row, which indicates what percentage of the dataset falls below that row.
- Use
FIRST_VALUE
to get the percentile value: Once we’ve computed the percentile rank, useFIRST_VALUE
to select the score that corresponds to the desired percentile (e.g., 90th percentile). - Use
LAG
to compare: You can useLAG
to check how the score changes between consecutive rows and see where the percentile threshold lies.
Step-by-Step SQL Solution:
1. Assign Row Numbers Using ROW_NUMBER()
:
First, we assign a row number to each row, ordered by the score
. This will help in determining the position of each score in the dataset.
WITH student_scores AS (
SELECT 1 AS student_id, 50 AS score UNION ALL
SELECT 2, 60 UNION ALL
SELECT 3, 70 UNION ALL
SELECT 4, 80 UNION ALL
SELECT 5, 90 UNION ALL
SELECT 6, 100
),ranked_scores AS (
SELECT
student_id,
score,
ROW_NUMBER() OVER (ORDER BY score) AS rn,
COUNT(*) OVER () AS total_rows
FROM student_scores
)
select * from ranked_scores;
Output:
student_id | score | rn | total_rows |
1 | 50 | 1 | 6 |
2 | 60 | 2 | 6 |
3 | 70 | 3 | 6 |
4 | 80 | 4 | 6 |
5 | 90 | 5 | 6 |
6 | 100 | 6 | 6 |
Explanation:
ROW_NUMBER() OVER (ORDER BY score)
assigns a sequential number to each row based on thescore
, ordered from smallest to largest.COUNT(*) OVER ()
calculates the total number of rows in the dataset.
2. Calculate the Percentile Rank:
Next, we calculate the percentile rank for each row. This is done by dividing the row number by the total number of rows.
WITH student_scores AS (
SELECT 1 AS student_id, 50 AS score UNION ALL
SELECT 2, 60 UNION ALL
SELECT 3, 70 UNION ALL
SELECT 4, 80 UNION ALL
SELECT 5, 90 UNION ALL
SELECT 6, 100
), ranked_scores AS (
SELECT
student_id,
score,
ROW_NUMBER() OVER (ORDER BY score) AS rn,
COUNT(*) OVER () AS total_rows
FROM student_scores
), percentile_ranks AS (
SELECT
student_id,
score,
rn,
total_rows,
(rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
FROM ranked_scores
)
select * from percentile_ranks;
Output:
student_id | score | rn | total_rows | percentile_rank |
1 | 50 | 1 | 6 | 0 |
2 | 60 | 2 | 6 | 0.2 |
3 | 70 | 3 | 6 | 0.4 |
4 | 80 | 4 | 6 | 0.6 |
5 | 90 | 5 | 6 | 0.8 |
6 | 100 | 6 | 6 | 1 |
Explanation:
(rn - 1) * 1.0 / (total_rows - 1)
computes the percentile rank for each score. For example, the first score will have a rank close to 0%, and the last score will have a rank close to 100%.
3. Use FIRST_VALUE
to Get the 90th Percentile:
Now that we have the percentile rank, we can use FIRST_VALUE()
to capture the first score where the percentile rank is greater than or equal to 0.9 (the 90th percentile).
WITH student_scores AS (
SELECT 1 AS student_id, 50 AS score UNION ALL
SELECT 2, 60 UNION ALL
SELECT 3, 70 UNION ALL
SELECT 4, 80 UNION ALL
SELECT 5, 90 UNION ALL
SELECT 6, 100
),ranked_scores AS (
SELECT
student_id,
score,
ROW_NUMBER() OVER (ORDER BY score) AS rn,
COUNT(*) OVER () AS total_rows
FROM student_scores
),percentile_ranks AS (
SELECT
student_id,
score,
rn,
total_rows,
(rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
FROM ranked_scores
) SELECT
student_id,
score,
percentile_rank,
FIRST_VALUE(score) OVER (ORDER BY percentile_rank ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS percentile_90
FROM percentile_ranks
WHERE percentile_rank >= 0.9
LIMIT 1;
Output: Result for 90th Percentile
student_id | score | percentile_rank | percentile_90 |
---|---|---|---|
6 | 100 | 1.00 | 100 |
Since the 90th percentile rank corresponds to the last score (100), the output would be: 90th Percentile Value = 100
Explanation:
FIRST_VALUE(score) OVER (ORDER BY percentile_rank)
selects the first score for which the percentile rank is greater than or equal to 0.9 (the 90th percentile).- The
LIMIT 1
ensures that we only return the first row that meets the condition.
4. Use LAG
to Compare Consecutive Rows:
We can further enhance this calculation by using the LAG
function to compare the score changes between consecutive rows. This helps to confirm whether the desired percentile threshold lies between two scores.
WITH student_scores AS (
SELECT 1 AS student_id, 50 AS score UNION ALL
SELECT 2, 60 UNION ALL
SELECT 3, 70 UNION ALL
SELECT 4, 80 UNION ALL
SELECT 5, 90 UNION ALL
SELECT 6, 100
),ranked_scores AS (
SELECT
student_id,
score,
ROW_NUMBER() OVER (ORDER BY score) AS rn,
COUNT(*) OVER () AS total_rows
FROM student_scores
),percentile_ranks AS (
SELECT
student_id,
score,
rn,
total_rows,
(rn - 1) * 1.0 / (total_rows - 1) AS percentile_rank
FROM ranked_scores
), score_changes AS (
SELECT
student_id,
score,
percentile_rank,
LAG(score) OVER (ORDER BY percentile_rank) AS prev_score
FROM percentile_ranks
)
SELECT
student_id,
score,
prev_score,
CASE
WHEN prev_score IS NULL THEN score
ELSE (prev_score + score) / 2
END AS adjusted_percentile
FROM score_changes
WHERE percentile_rank >= 0.9
LIMIT 1;
Output:
Student_Id | Score | Prev_Score | adjusted_percentile |
6 | 100 | 90 | 95 |
Explanation:
LAG(score) OVER (ORDER BY percentile_rank)
allows us to compare the current score with the previous score.CASE
statement calculates an adjusted percentile by averaging the previous and current scores to smooth out the transition between rows if needed.
Explanation of Key Functions:
FIRST_VALUE()
: This window function is used to retrieve the first value within a specific window or range. In the context of percentile calculations, it helps us get the score that first crosses the desired percentile threshold (e.g., 90%).LAG()
: This function provides access to the previous row’s value in a specified order. In percentile calculations,LAG()
can be used to compare the current score with the previous one, which helps in smoothing transitions between percentiles or calculating exact percentiles when data isn’t perfectly distributed.ROW_NUMBER()
: This function assigns a unique row number to each row in the result set. It’s essential for ordering and calculating percentile ranks.COUNT()
: This aggregate function is used to calculate the total number of rows, which is crucial for calculating percentile ranks.
Example 2
This SQL trick combines several window functions to provide a comprehensive salary analysis:
Solution:
This query provides a rich set of analytics for each employee’s salary, showing where they stand in relation to their peers and the overall salary range.
WITH employee_salaries AS (
SELECT 'Alice' AS name, 50000 AS salary UNION ALL
SELECT 'Bob', 55000 UNION ALL
SELECT 'Charlie', 60000 UNION ALL
SELECT 'David', 65000 UNION ALL
SELECT 'Eve', 70000
),
ranked_salaries AS (
SELECT
name,
salary,
PERCENT_RANK() OVER (ORDER BY salary) AS percentile,
ROW_NUMBER() OVER (ORDER BY salary) AS row_num
FROM employee_salaries
)
SELECT
name,
salary,
percentile,
FIRST_VALUE(salary) OVER (ORDER BY salary rows between unbounded preceding and unbounded following ) AS min_salary,
LAG(salary, 1) OVER (ORDER BY salary) AS prev_salary,
LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary,
LAST_VALUE(salary) OVER (ORDER BY salary rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_salary
FROM ranked_salaries;
Output:
Name | Salary | Percentile | Min_salary | Prev_salary | Next_salary | Max_salary |
Eve | 70,000 | 1 | 50,000 | 65,000 | null | 70,000 |
David | 65,000 | 0.75 | 50,000 | 60,000 | 70,000 | 70,000 |
Charlie | 60,000 | 0.5 | 50,000 | 55,000 | 65,000 | 70,000 |
Bob | 55,000 | 0.25 | 50,000 | 50,000 | 60,000 | 70,000 |
Alice | 50,000 | 0 | 50,000 | null | 55,000 | 70,000 |
Explanation:
- We start with a CTE (
employee_salaries
) containing sample salary data. - In the
ranked_salaries
CTE, we calculate the percentile rank and row number for each salary. - The main query then uses:
FIRST_VALUE
: To get the minimum salary.LAG
: To get the previous salary in the ordered list.LEAD
: To get the next salary in the ordered list.LAST_VALUE
: To get the maximum salary.
Key Takeaways:
- Percentile Rank Calculation: You calculate the percentile rank by dividing the row position by the total number of rows.
FIRST_VALUE
Usage: This function is great for retrieving the value at a particular percentile based on rank or score.LAG
for Comparison: UseLAG
to compare consecutive rows, helping to smooth transitions or identify where the percentile changes.
This method provides a flexible and efficient way to calculate percentiles in SQL using window functions.
These SQL tricks demonstrate the power and flexibility of window functions in SQL. They allow for complex calculations and analytics to be performed efficiently within a single query.
Learn more about SQL and other topics
- What are the Most Common SQL Tricks Everyone Should Know?
- Learn SQL by Oracle
- Data Preparation With SQL: An Ultimate Cheat Sheet
- Beginner’s SQL Interview Questions: Get Prepared Now!
- Oracle Interview Questions: Must Know Before Your Big Day
- How To Use SQL? An Ultimate Beginners Guide
- Oracle Live SQL: How To Use Oracle Without Downloading?
- Unmasking The Best of SQL: Test Your Expertise! (2024)