SQL Analytical Functions: An Ultimate Guide To Data Analysis

Spread the knowledge

Ever wished you could perform complex calculations across your data while preserving every detail? Enter SQL Analytical Functions. These powerhouse features give you the best of both worlds: the ability to analyze groups of related rows without sacrificing the granularity of your data that traditional aggregate functions often require.

Table of Content

Introduction
Understanding Analytical Functions Syntax
- Basic Analytical Functions Structure
- Frame Clause Syntax
Common Analytical Functions
Understanding Aggregate Functions as Analytical Functions
Best Practices for Analytical Functions
Conclusion
FAQs
Learn more about SQL and other topics

Introduction

SQL Analytical Functions, also known as window functions, are powerful tools that perform calculations across a set of rows related to the current row. Unlike regular aggregate functions, analytical functions don’t collapse the result set, allowing you to maintain the individual row details while performing complex calculations.

Understanding Analytical Functions Syntax

Basic Analytical Functions Structure

FUNCION_NAME(<arguments>) 
OVER ([PARTITION BY column1[, column2,...]] 
      [ORDER BY column1 [ASC|DESC][, column2,...]] 
      [ROWS|RANGE frame_clause])

Frame Clause Syntax

[ROWS | RANGE] BETWEEN frame_start AND frame_end

frame_start and frame_end options:
- UNBOUNDED PRECEDING
- n PRECEDING
- CURRENT ROW
- n FOLLOWING
- UNBOUNDED FOLLOWING

Common Analytical Functions

1. Ranking Functions

ROW_NUMBER()

Assigns a unique number to each row within a partition based on the order by clause.

Syntax

ROW_NUMBER()
OVER (
[PARTITION BY partition_expression]
ORDER BY sort_expression [ASC|DESC]
)

Example: Rank employees by salary within each department

-- Example: Rank employees by salary within each department
SELECT 
    employee_id,
    department_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as salary_rank
FROM employees;

-- Output:
-- EMPLOYEE_ID  DEPARTMENT_ID  SALARY  SALARY_RANK
-- 100         10            5000     1
-- 101         10            4500     2
-- 102         20            6000     1
-- 103         20            4800     2

Explanation:

PARTITION BY department_id breaks the data into groups by department_id.
ROW_NUMBER() gives a unique number within each partition.

RANK() and DENSE_RANK()

RANK() leaves gaps for ties, while DENSE_RANK() provides consecutive rankings.

Syntax

RANK()|DENSE_RANK() 
OVER (
    [PARTITION BY partition_expression]
    ORDER BY sort_expression [ASC|DESC]
)

Example: Compare RANK vs DENSE_RANK for employee salaries

-- Example: Compare RANK vs DENSE_RANK for employee salaries
SELECT 
    employee_id,
    salary,
    RANK() OVER (ORDER BY salary DESC) as salary_rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as salary_dense_rank
FROM employees;

-- Output:
-- EMPLOYEE_ID  SALARY  SALARY_RANK  SALARY_DENSE_RANK
-- 102         6000    1            1
-- 100         5000    2            2
-- 103         4800    3            3
-- 101         4500    4            4
-- 104         4500    4            4
-- 105         4200    6            5
-- 106         4000    7            6

Explanation:

RANK() gives the rank of each row within a partition, with gaps for ties.
If two rows have the same salary, they’ll receive the same rank, and the next rank will be skipped.
In case of DENSE_RANK()you will have no gaps between ranks i.e if two rows have the same salary, they’ll receive the same rank but the next rank will not be skipped.
For employee 105, rank is 6 where as dense rank is 5

NTILE(): Breaking Data into Buckets

Divides rows into specified number of groups.

Syntax

NTILE(number_of_buckets) 
OVER (
    [PARTITION BY partition_expression]
    ORDER BY sort_expression [ASC|DESC]
)

Example: Divide employees into 4 salary quartiles

-- Example: Divide employees into 4 salary quartiles
SELECT 
    employee_id,
    salary,
    NTILE(4) OVER (ORDER BY salary DESC) as salary_quartile
FROM employees;

-- Output:
-- EMPLOYEE_ID  SALARY  SALARY_QUARTILE
-- 102         6000    1
-- 100         5000    1
-- 103         4800    2
-- 101         4500    3

Explanation:

NTILE(4) splits the data into four roughly equal parts based on the salary column.
Useful for percentile-based reporting.

2. Value Functions

LAG() and LEAD()

Access data from previous (LAG) or next (LEAD) rows.

Syntax

LAG|LEAD(expression[, offset[, default_value]]) 
OVER (
    [PARTITION BY partition_expression]
    ORDER BY sort_expression [ASC|DESC]
)

Example: Compare current salary with previous employee’s salary

-- Example: Compare current salary with previous employee's salary
SELECT 
    employee_id,
    salary,
    LAG(salary) OVER (ORDER BY employee_id) as prev_salary,
    LEAD(salary) OVER (ORDER BY employee_id) as next_salary
FROM employees;

-- Output:
-- EMPLOYEE_ID  SALARY  PREV_SALARY  NEXT_SALARY
-- 100         5000    NULL         4500
-- 101         4500    5000         6000
-- 102         6000    4500         4800

Explanation:

LAG() retrieves the value from the previous row (previous employee’s salary).
LEAD() retrieves the value from the next row (next employee’s salary).

FIRST_VALUE() and LAST_VALUE()

Retrieve first or last value in a window frame.

Syntax

FIRST_VALUE|LAST_VALUE(expression) 
OVER (
    [PARTITION BY partition_expression]
    ORDER BY sort_expression [ASC|DESC]
    [ROWS|RANGE frame_clause]
)

Example: Compare each salary with highest salary in department. You can use the function LAST_VALUE()to get minimum salary of the department.

-- Example: Compare each salary with highest salary in department
SELECT 
    employee_id,
    department_id,
    salary,
    FIRST_VALUE(salary) OVER (
        PARTITION BY department_id 
        ORDER BY salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as highest_dept_salary,
   LAST_VALUE(salary) OVER (
        PARTITION BY department_id 
        ORDER BY salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as lowest_dept_salary
FROM employees;

-- Output:
-- EMPLOYEE_ID  DEPARTMENT_ID  SALARY  HIGHEST_DEPT_SALARY   LOWEST_DEPT_SALARY
-- 100         10            5000     5000     4500
-- 101         10            4500     5000     4500
-- 102         20            6000     6000     4800
-- 103         20            4800     6000     4800

Explanation:

FIRST_VALUE() fetches the first salary in the window, while LAST_VALUE() fetches the last one.
The range in LAST_VALUE() ensures it looks at all rows in the partition.

3. Statistical Functions

PERCENT_RANK() and CUME_DIST()

Calculate relative rankings within a result set.

Syntax

PERCENT_RANK|CUME_DIST() 
OVER (
    [PARTITION BY partition_expression]
    ORDER BY sort_expression [ASC|DESC]
)

Example: Calculate percentage ranking of salaries

-- Example: Calculate percentage ranking of salaries
SELECT 
    employee_id,
    salary,
    ROUND(PERCENT_RANK() OVER (ORDER BY salary), 2) as percent_rank,
    ROUND(CUME_DIST() OVER (ORDER BY salary), 2) as cume_dist
FROM employees;

-- Output:
-- EMPLOYEE_ID  SALARY  PERCENT_RANK  CUME_DIST
-- 101         4500    0.00          0.20
-- 103         4800    0.25          0.40
-- 100         5000    0.50          0.60
-- 102         6000    0.75          1.00

Advanced Usage: Combining Multiple Analytical Functions

-- Example: Comprehensive employee analysis
SELECT 
    employee_id,
    department_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as dept_salary_rank,
    FIRST_VALUE(salary) OVER (
        PARTITION BY department_id 
        ORDER BY salary DESC
    ) as highest_dept_salary,
    ROUND(salary / SUM(salary) OVER (PARTITION BY department_id) * 100, 2) as salary_percentage
FROM employees;

-- Output:
-- EMPLOYEE_ID  DEPARTMENT_ID  SALARY  DEPT_SALARY_RANK  HIGHEST_DEPT_SALARY  SALARY_PERCENTAGE
-- 100         10            5000     1                5000                52.63
-- 101         10            4500     2                5000                47.37
-- 102         20            6000     1                6000                55.56
-- 103         20            4800     2                6000                44.44

Understanding Aggregate Functions as Analytical Functions

When aggregate functions (SUM, COUNT, AVG, MIN, MAX) are used with the OVER clause, they become powerful analytical tools that can perform calculations across specified sets of rows while maintaining the detail-level data.

SUM() as Analytical Function

The SUM analytic function calculates running totals and partitioned sums while preserving row detail.

--Example 1: Calculate running total of sales by department
SELECT 
    department_id,
    employee_id,
    sale_amount,
    SUM(sale_amount) OVER (
        PARTITION BY department_id 
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total,
    SUM(sale_amount) OVER (PARTITION BY department_id) as dept_total
FROM sales;

-- Output:
-- DEPT_ID  EMP_ID  SALE_AMOUNT  RUNNING_TOTAL  DEPT_TOTAL
-- 10       101     1000         1000           3500
-- 10       102     1500         2500           3500
-- 10       103     1000         3500           3500
-- 20       201     2000         2000           5000
-- 20       202     3000         5000           5000

-- Example 2: Calculate percentage of department total
SELECT 
    department_id,
    employee_id,
    sale_amount,
    ROUND(
        sale_amount / SUM(sale_amount) OVER (PARTITION BY department_id) * 100,
        2
    ) as pct_of_dept_total
FROM sales;

-- Output:
-- DEPT_ID  EMP_ID  SALE_AMOUNT  PCT_OF_DEPT_TOTAL
-- 10       101     1000         28.57
-- 10       102     1500         42.86
-- 10       103     1000         28.57
-- 20       201     2000         40.00
-- 20       202     3000         60.00

Explanation: For example 1

The SUM() function calculates the cumulative sum of the amount column.
OVER (ORDER BY sale_date) tells SQL to compute the sum in a running order based on sale_date.

COUNT() as Analytical Function

COUNT analytics can track running counts and provide insights about data distribution.

-- Example 1: Running count of orders by customer
SELECT 
    customer_id,
    order_date,
    order_amount,
    COUNT(*) OVER (
        PARTITION BY customer_id 
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as order_count,
    COUNT(*) OVER (PARTITION BY customer_id) as total_customer_orders
FROM orders;

-- Output:
-- CUST_ID  ORDER_DATE  ORDER_AMOUNT  ORDER_COUNT  TOTAL_CUSTOMER_ORDERS
-- C1       2024-01-01  500          1            3
-- C1       2024-01-15  700          2            3
-- C1       2024-02-01  600          3            3
-- C2       2024-01-05  800          1            2
-- C2       2024-01-20  900          2            2

-- Example 2: Calculate density of orders by date range
SELECT 
    order_date,
    COUNT(*) OVER (
        ORDER BY order_date
        RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
    ) as orders_last_7_days
FROM orders;

-- Output:
-- ORDER_DATE  ORDERS_LAST_7_DAYS
-- 2024-01-01  1
-- 2024-01-05  2
-- 2024-01-15  3
-- 2024-01-20  2

Explanation: For example 1

The COUNT() function calculates the cumulative count of the order for each customer.
OVER (ORDER BY order_date) tells SQL to compute the count in a running order based on order_date.

AVG() as Analytical Function

AVG analytics are perfect for calculating moving averages and comparing values to averages.

-- Example 1: Calculate 3-day moving average of sales
SELECT 
    sale_date,
    daily_sales,
    ROUND(
        AVG(daily_sales) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ),
        2
    ) as moving_avg_3day
FROM daily_sales;

-- Output:
-- SALE_DATE   DAILY_SALES  MOVING_AVG_3DAY
-- 2024-01-01  1000        1000.00
-- 2024-01-02  1200        1100.00
-- 2024-01-03  1500        1233.33
-- 2024-01-04  1300        1333.33

-- Example 2: Compare to department average
SELECT 
    department_id,
    employee_id,
    salary,
    ROUND(
        AVG(salary) OVER (PARTITION BY department_id),
        2
    ) as dept_avg_salary,
    ROUND(
        (salary - AVG(salary) OVER (PARTITION BY department_id)) / 
        AVG(salary) OVER (PARTITION BY department_id) * 100,
        2
    ) as pct_diff_from_avg
FROM employees;

-- Output:
-- DEPT_ID  EMP_ID  SALARY  DEPT_AVG_SALARY  PCT_DIFF_FROM_AVG
-- 10       101     5000    4750.00          5.26
-- 10       102     4500    4750.00          -5.26
-- 20       201     6000    5400.00          11.11
-- 20       202     4800    5400.00          -11.11

Explanation: For example 1

The AVG() function calculates the average.
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW specifies a window of three rows (the current row and two preceding ones) over which to calculate the average.

MIN() and MAX() as Analytical Functions

MIN and MAX analytics help identify boundaries and ranges within data sets.

-- Example 1: Track high and low points in each quarter
SELECT 
    quarter,
    sale_date,
    sale_amount,
    MIN(sale_amount) OVER (PARTITION BY quarter) as quarter_low,
    MAX(sale_amount) OVER (PARTITION BY quarter) as quarter_high,
    sale_amount - MIN(sale_amount) OVER (PARTITION BY quarter) as amount_above_low
FROM quarterly_sales;

-- Output:
-- QUARTER  SALE_DATE   SALE_AMOUNT  QUARTER_LOW  QUARTER_HIGH  AMOUNT_ABOVE_LOW
-- Q1       2024-01-15  1000        1000         1500          0
-- Q1       2024-02-01  1500        1000         1500          500
-- Q1       2024-03-15  1200        1000         1500          200
-- Q2       2024-04-01  1800        1300         1800          500
-- Q2       2024-05-15  1300        1300         1800          0

-- Example 2: Rolling high/low comparison
SELECT 
    trade_date,
    stock_price,
    MIN(stock_price) OVER (
        ORDER BY trade_date
        ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as rolling_low,
    MAX(stock_price) OVER (
        ORDER BY trade_date
        ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) as rolling_high,
    ROUND(
        (stock_price - MIN(stock_price) OVER (
            ORDER BY trade_date
            ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
        )) / MIN(stock_price) OVER (
            ORDER BY trade_date
            ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
        ) * 100,
        2
    ) as pct_above_low
FROM stock_prices;

-- Output:
-- TRADE_DATE  STOCK_PRICE  ROLLING_LOW  ROLLING_HIGH  PCT_ABOVE_LOW
-- 2024-01-01  100.00      100.00       100.00        0.00
-- 2024-01-02  102.50      100.00       102.50        2.50
-- 2024-01-03  101.75      100.00       102.50        1.75
-- 2024-01-04  103.25      100.00       103.25        3.25

Explanation: For example 1

The MIN() function calculates the minimum sale_amount of the each quarter.
The MAX() function calculates the maximum sale_amount of the each quarter.

LISTAGG() as Analytical Functions

LISTAGG is a powerful analytical function that aggregates data from multiple rows into a single concatenated string. When used with the OVER clause, it provides unique capabilities for creating comma-separated lists while maintaining row-level details.

Syntax

LISTAGG(measure_column [, 'delimiter']) WITHIN GROUP (ORDER BY sort_column)
OVER (PARTITION BY partition_column)

Basic Employee Skills Aggregation

-- Create a sample employee skills table
CREATE TABLE employee_skills (
    employee_id NUMBER,
    department_id NUMBER,
    skill_name VARCHAR2(50)
);

-- Sample data
INSERT INTO employee_skills VALUES (101, 10, 'Java');
INSERT INTO employee_skills VALUES (101, 10, 'Python');
INSERT INTO employee_skills VALUES (101, 10, 'SQL');
INSERT INTO employee_skills VALUES (102, 10, 'Python');
INSERT INTO employee_skills VALUES (102, 10, 'R');
INSERT INTO employee_skills VALUES (103, 20, 'JavaScript');
INSERT INTO employee_skills VALUES (103, 20, 'HTML');

-- Basic LISTAGG query
SELECT 
    employee_id,
    department_id,
    LISTAGG(skill_name, ', ') WITHIN GROUP (ORDER BY skill_name) 
        OVER (PARTITION BY employee_id) as employee_skills
FROM employee_skills;

-- Output:
-- EMPLOYEE_ID  DEPARTMENT_ID  EMPLOYEE_SKILLS
-- 101         10             Java, Python, SQL
-- 101         10             Java, Python, SQL
-- 101         10             Java, Python, SQL
-- 102         10             Python, R
-- 102         10             Python, R
-- 103         20             HTML, JavaScript
-- 103         20             HTML, JavaScript

Explanation:

The LISTAGG() function groups the employee skills for each employee and creates a comma-separated lists of skills they have.

Advanced Usage: Combining Multiple Analytical Functions

--Comprehensive sales analysis
SELECT 
    department_id,
    sale_date,
    sale_amount,
    -- Running total
    SUM(sale_amount) OVER (
        PARTITION BY department_id 
        ORDER BY sale_date
    ) as running_total,
    -- Moving average
    ROUND(AVG(sale_amount) OVER (
        PARTITION BY department_id 
        ORDER BY sale_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) as moving_avg,
    -- Comparison to department bounds
    ROUND(
        (sale_amount - MIN(sale_amount) OVER (PARTITION BY department_id)) /
        (MAX(sale_amount) OVER (PARTITION BY department_id) - 
         MIN(sale_amount) OVER (PARTITION BY department_id)) * 100,
        2
    ) as percentile_in_dept
FROM sales;

-- Output:
-- DEPT_ID  SALE_DATE   SALE_AMOUNT  RUNNING_TOTAL  MOVING_AVG  PERCENTILE_IN_DEPT
-- 10       2024-01-01  1000         1000           1000.00     0.00
-- 10       2024-01-15  1500         2500           1250.00     100.00
-- 10       2024-02-01  1200         3700           1233.33     40.00
-- 20       2024-01-05  2000         2000           2000.00     0.00
-- 20       2024-01-20  3000         5000           2500.00     100.00

Explanation:

The example shows multiple analytical functions usage in single SQL for complex calculations.
The SUM() is used to get running total and AVG() is used to calculate moving average.

Best Practices for Analytical Functions

Performance Considerations

Use appropriate indexes for PARTITION BY and ORDER BY columns
Limit window size when possible using ROWS clause
Consider materialized views for frequently accessed calculations
Consider partitioning for large datasets

Window Frame Considerations

Default frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Use ROWS for count-based windows
Use RANGE for value-based windows

Formatting, Readability and Maintainability

Use meaningful column aliases
Break complex analytics into steps using CTEs
Document window frame specifications
Round decimal values appropriately

Common Pitfalls and NULL Handling

Be careful with NULL values in window functions
Pay attention to window frame boundaries
Use COALESCE or NVL when necessary
Consider data type conversions
Consider excluding NULLs with filtering

Conclusion

Oracle SQL Analytical Functions provide powerful tools for complex data analysis. By understanding these functions and their proper usage, you can significantly improve your data analysis capabilities and write more efficient queries. Remember to consider performance implications when working with large datasets and always test your queries thoroughly.

FAQs

Q1. What’s the difference between analytical functions and aggregate functions?

Aggregate functions group rows and return a single result per group
Analytical functions perform calculations across rows while preserving individual row details
Example: SUM as an aggregate collapses rows, while SUM as an analytical function keeps all rows and shows running totals

Q2. What’s the purpose of the PARTITION BY clause?

PARTITION BY divides data into groups for processing
Calculations are performed independently within each partition
Think of it as a way to segment your data (like by department, region, or year) without losing row-level detail

Q3. How does ORDER BY affect analytical function results?

ORDER BY determines the sequence of rows for processing
Critical for functions like LAG, LEAD, FIRST_VALUE, LAST_VALUE
Impacts calculations involving running totals and moving averages

Q4. What’s a window frame and why is it important?

Window frame defines the range of rows used for each calculation
Specified using ROWS or RANGE clause
Examples: ROWS BETWEEN 3 PRECEDING AND CURRENT ROW, or RANGE UNBOUNDED PRECEDING

Q5. Can analytical functions be used in WHERE clauses?

No, analytical functions cannot be used in WHERE clauses
They can only appear in SELECT list or ORDER BY clause
Use subqueries or CTEs if you need to filter based on analytical function results

Q6. What are the most commonly used analytical functions?

ROW_NUMBER(): Assigns unique numbers to rows
LAG/LEAD: Access previous/next row values
RANK/DENSE_RANK: Assign rankings to rows
FIRST_VALUE/LAST_VALUE: Retrieve first/last values in a window
NTILE: Divide rows into specified number of groups

Q7. How do you handle NULL values in analytical functions?

Most analytical functions ignore NULLs by default
IGNORE NULLS or RESPECT NULLS clause can be used with certain functions
NULLs can affect window frame calculations and rankings

Discover the power of technology and learning with TechyBuddy

Table of Content

Introduction

Understanding Analytical Functions Syntax

Basic Analytical Functions Structure

Frame Clause Syntax

Common Analytical Functions

1. Ranking Functions

ROW_NUMBER()

RANK() and DENSE_RANK()

NTILE(): Breaking Data into Buckets

2. Value Functions

LAG() and LEAD()

FIRST_VALUE() and LAST_VALUE()

3. Statistical Functions

PERCENT_RANK() and CUME_DIST()

Advanced Usage: Combining Multiple Analytical Functions

Understanding Aggregate Functions as Analytical Functions

SUM() as Analytical Function

COUNT() as Analytical Function

AVG() as Analytical Function

MIN() and MAX() as Analytical Functions

LISTAGG() as Analytical Functions

Advanced Usage: Combining Multiple Analytical Functions

Best Practices for Analytical Functions

Performance Considerations

Window Frame Considerations

Formatting, Readability and Maintainability

Common Pitfalls and NULL Handling

Conclusion

FAQs

Q1. What’s the difference between analytical functions and aggregate functions?

Q2. What’s the purpose of the PARTITION BY clause?

Q3. How does ORDER BY affect analytical function results?

Q4. What’s a window frame and why is it important?

Q5. Can analytical functions be used in WHERE clauses?

Q6. What are the most commonly used analytical functions?

Q7. How do you handle NULL values in analytical functions?

Learn more about SQL and other topics

Leave a Comment Cancel Reply