Recuperar contraseña

Window functions are what every T-SQL programmer should know. Part 2.

Dmitriy Fedyashov

In the second part of the article we will talk about the functions themselves, which are used to form values. The window function calculates the value from a set of data associated with the current row, that is, data from the same group, if Partition by is used. Conventional aggregate functions for grouping require the grouping of rows, thus losing the necessary unique information from the sample. Therefore, it is necessary to use two instead of a single request in order to have all the necessary data and the sum by groups. Window aggregate functions allow to achieve the same result in one query.

Let me remind you that a window is a set of lines by which a function is calculated. The OVER instruction splits the entire set of rows into separate groups — windows according to the specified condition.

Let's talk about the types of window functions. There are three groups of destination:

• Aggregate functions: SUM (), MAX (), MIN (), AVG (). COUNT (). These functions return the value obtained by arithmetic calculations;

• Ranking functions: RANK (), DENSE_RANK (), ROW_NUMBER (), NTILE (). Allow to get the sequence number of entries in the window;

• Offset functions: LAG (), LEAD (), FIRST_VALUE (), LAST_VALUE (). Return a value from another window row.

To demonstrate how the functions work, I will use a simple table:

1
2
3
4
5
6
7
CREATE TABLE ForWindowFunc (ID INT, GroupId INT, Amount INT)
GO
 
INSERT INTO ForWindowFunc (ID, GroupId, Amount)
VALUES(1, 1, 100), (1, 1, 200), (1, 2, 150),
 (2, 1, 100), (2, 1, 300), (2, 2, 200), (2, 2, 50),
 (3, 1, 150), (3, 2, 200), (3, 2, 10);

 

Aggregate functions

 

SUM ()

The SUM () function works just like a regular aggregate function — it sums up all the values of a given column in a data set. However, thanks to the OVER () instruction, we break the data set into windows. The summation is performed inside the windows according to the order specified in the ORDER BY clause. Let's look at a simple example - the sum of the three groups.

1
2
3
SELECT ID,
 Amount,
 SUM(Amount) OVER (ORDER BY id) AS SUM FROM ForWindowFunc

 

ID

Amount

Sum

1

100

450

1

200

450

1

150

450

2

100

650

2

300

650

2

200

650

2

50

650

3

150

360

3

200

360

3

10

360

For convenience, the windows are highlighted in different colors. All values in the window have the same amount - the sum of all Amount in the window.

Let's add another column to the selection and change the OVER instruction:

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
SUM(Amount) OVER (Partition BY id ORDER BY id, GroupId) AS SUM
FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

300

1

1

200

300

1

2

150

450

2

1

100

400

2

1

300

400

2

2

200

650

2

2

50

650

3

1

150

150

3

2

200

360

3

2

10

360

As you can see, now each window is divided into groups thanks to the GroupId field. Each group now has its own amount.

And now, let's make a cumulative result inside each window:

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
SUM(Amount) OVER (Partition BY id ORDER BY id, GroupId, Amount) AS SUM
FROM ForWindowFunc

 

ID

Amount

Sum

1

100

100

1

200

300

1

150

450

2

100

100

2

300

400

2

50

450

2

200

650

3

150

150

3

10

160

3

200

360

We no longer need the GroupId field, so we removed it from the selection. Now, for each line in the window, a total is calculated, which is the sum of the current value of the Amount and all previous ones.

AVG()

This function calculates the average value. It can be used with the sentences Partition by and Order by.

1
2
3
4
SELECT ID,
Amount,
AVG(Amount) OVER (Partition BY id ORDER BY id) AS AVG
FROM ForWindowFunc

 

ID

Amount

AVG

1

100

150

1

200

150

1

150

150

2

100

162

2

300

162

2

200

162

2

50

162

3

150

120

3

200

120

3

10

120

 

Each line in the window has an average value of Amount, which is calculated by the formula: the sum of all Amount / for the number of lines.

The behavior of this function is similar to SUM ().

MIN()

From the name of the function it is clear that it returns the minimum value in the window.

1
2
3
4
SELECT ID,
Amount,
MIN(Amount) OVER (Partition BY id ORDER BY id) AS MIN
FROM ForWindowFunc

 

ID

Amount

Min

1

100

100

1

200

100

1

150

100

2

100

50

2

300

50

2

200

50

2

50

50

3

150

10

3

200

10

3

10

10

As you can see, in the Min column, the minimum Amount value is displayed in the window.

MAX()

The MAX function works in the same way as MIN, it only gives the maximum value of the field in the window:

1
2
3
4
SELECT ID,
Amount,
MAX(Amount) OVER (Partition BY id ORDER BY id) AS MAX
FROM ForWindowFunc

 

ID

Amount

Max

1

100

200

1

200

200

1

150

200

2

100

300

2

300

300

2

200

300

2

50

300

3

150

200

3

200

200

3

10

200

Everything is very clear. In the first group, the maximum Amount is 200, in the second 300, and in the third - 200.

COUNT()

This function returns the number of lines in a window.

1
2
3
SELECT ID,
 Amount,
 COUNT(Amount) OVER (Partition BY id ORDER BY id) AS COUNT FROM ForWindowFunc

 

ID

Amount

Count

1

100

3

1

200

3

1

150

3

2

100

4

2

300

4

2

200

4

2

50

4

3

150

3

3

200

3

3

10

3

Let’s make the query a bit more complicated and add the GroupId field.

1
2
3
4
5
SELECT ID,
GroupId,
 Amount,
 COUNT(Amount) OVER (Partition BY id ORDER BY id, GroupId) AS COUNT
FROM ForWindowFunc

 

ID

GroupId

Amount

Count

1

1

100

2

1

1

200

2

1

2

150

3

2

1

100

2

2

1

300

2

2

2

200

4

2

2

50

4

3

1

150

1

3

2

200

3

3

2

10

3

In this case, it is more interesting. Let's look at the first window. For the first and second lines, the number of records was 2. But for the third line, the value is already 3. We managed to accumulate the amount in groups like the cumulative amount.

If we still want the number in each group, then GroupId needs to be added to the Partition by clause.

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
COUNT(Amount) OVER (Partition BY id, GroupId) AS COUNT
FROM ForWindowFunc

 

ID

GroupId

Amount

Count

1

1

100

2

1

1

200

2

1

2

150

1

2

1

100

2

2

1

300

2

2

2

200

2

2

2

50

2

3

1

150

1

3

2

200

2

3

2

10

2

 

Ranking functions

 

RANK () / DENSE_RANK ()

The RANK () function returns the sequence number of the current row in the window. However, there is a feature. If the Order By clause contains several equivalent strings for a rule, then all of them will be considered the current string. Thus, the RANK () function should be used for ranking, not rowing. However, if you correctly set the Order by, then you can also number the physical strings. For example:

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
RANK() OVER (Partition BY id ORDER BY id, GroupId, Amount) AS RANK
FROM ForWindowFunc

 

ID

GroupId

Amount

RANK

1

1

100

1

1

1

200

2

1

2

150

3

2

1

100

1

2

1

300

2

2

2

50

3

2

2

200

4

3

1

150

1

3

2

10

2

3

2

200

3

Here is the case with the same lines in the context of Order by:

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
RANK() OVER (Partition BY id ORDER BY id, GroupId) AS RANK
FROM ForWindowFunc

 

ID

GroupId

Amount

RANK

1

1

100

1

1

1

200

1

1

2

150

3

2

1

100

1

2

1

300

1

2

2

200

3

2

2

50

3

3

1

150

1

3

2

200

2

3

2

10

2

It's interesting that the third line in the first window has a rank of 3, although the previous two lines are assigned to the first rank. Not the most understandable logic. In this case, it is better to use DENSE_RANK ().

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
DENSE_RANK() OVER (Partition BY id ORDER BY id, GroupId) AS DENSE_RANK
FROM ForWindowFunc

 

ID

GroupId

Amount

DENSE_RANK

1

1

100

1

1

1

200

1

1

2

150

2

2

1

100

1

2

1

300

1

2

2

200

2

2

2

50

2

3

1

150

1

3

2

200

2

3

2

10

2

Now everything is as it should be. DENSE_RANK () does not skip ranks if the previous rank contains several lines.

The functions RANK () and DENSE_RANK () do not require indication of the field in brackets.

 

ROW_NUMBER()

The ROW_NUMBER () function displays the current row number in the window. Like the previous two functions, ROW_NUMBER () does not require specifying a field in parentheses.

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
ROW_NUMBER() OVER (Partition BY id ORDER BY id, GroupId, Amount) AS ROW_NUMBER
FROM ForWindowFunc

 

ID

GroupId

Amount

ROW_NUMBER

1

1

100

1

1

1

200

2

1

2

150

3

2

1

100

1

2

1

300

2

2

2

50

3

2

2

200

4

3

1

150

1

3

2

10

2

3

2

200

3

In the query, we used Partition by to divide the data set into groups. Everything is clear here and should not cause questions.

If you do not use Partition by, then get through numbering throughout the data set:

1
2
3
4
5
SELECT ID,
GroupId,
Amount,
ROW_NUMBER() OVER (ORDER BY id, GroupId, Amount) AS ROW_NUMBER
FROM ForWindowFunc

 

ID

GroupId

Amount

ROW_NUMBER

1

1

100

1

1

1

200

2

1

2

150

3

2

1

100

4

2

1

300

5

2

2

50

6

2

2

200

7

3

1

150

8

3

2

10

9

3

2

200

10

In fact, the absence of the Partition by clause says that the entire data set is a window.

NTILE()

The NTILE () function allows you to determine which group the current line belongs to. The number of groups is specified in brackets, and the ORDER BY clause determines which column is used to define the group.

For example, this means that if you have 100 lines and you want to create 4 quartiles based on the specified value field, you can easily do this and see how many lines fall into each quartile.

Let's have a look at the example. In the query below, we indicated that we want to create four quartiles based on the order amount. Then we want to see how many orders fall into each quartile.

NTILE creates tiles based on the following formula:

Number of lines in each group = number of lines in the set / number of specified groups

Here is our example: the request contains only 10 lines and 4 tiles, so the number of lines in each plate will be 2.5 (10/4). Because the number of lines must be an integer, not decimal. The SQL engine will assign 3 rows for the first two groups and 2 rows for the remaining two groups.

1
2
3
SELECT Amount,
NTILE(4) OVER(ORDER BY amount) AS Ntile
FROM ForWindowFunc

 

Amount

Ntile

10

1

50

1

100

1

100

2

150

2

150

2

200

3

200

3

200

4

300

4

This is a very simple example, but it demonstrates the function well. All Amount values are sorted in ascending order and divided into 4 groups.

Offset functions

LAG() and LEAD()

These two functions allow you to get the previous and next value, respectively. Quite often it is necessary to compare the current value with the previous or next in the calculated columns.

As parameters, you can pass to the function the name of the field and the number of lines that you need to deviate from the current one and take the value. As in SUBSTRING (), we specify the position from which to take characters, and here we indicate the position from which to take the value. If you do not specify the number of values, then the default is one.

So, the LAG function allows you to access data from the previous line in one window.

1
2
3
SELECT id, Amount,
LAG(Amount) OVER(ORDER BY id, amount) AS Lag
FROM ForWindowFunc

 

id

Amount

Lag

1

100

NULL

1

150

100

1

200

150

2

50

200

2

100

50

2

200

100

2

300

200

3

10

300

3

150

10

3

200

150

In the first line, the value of the Lag field is definitely Null because there is no previous Amount value for this line. For all subsequent lines, the value of Amount from the previous line is taken.

The LEAD function works in the same way, but in the other direction - it takes the value from the next line.

1
2
3
SELECT id, Amount,
LEAD(Amount,2) OVER(ORDER BY id, amount) AS Lag
FROM ForWindowFunc

 

id

Amount

Lag

1

100

200

1

150

50

1

200

100

2

50

200

2

100

300

2

200

10

2

300

150

3

10

200

3

150

NULL

3

200

NULL

As you can see, in the query we pass the parameter 2 to the LEAD function. This means that we get the second value from the current Amount. For the last two lines, the value is Null, since for them there are no following values.

FIRST_VALUE() и LAST_VALUE()

With these functions, we can get the first and last value in the window. If the Partition by clause is not specified, then the functions will return the first and last value of the entire data set.

1
2
3
SELECT id, Amount,
FIRST_VALUE(Amount) OVER(Partition BY Id ORDER BY Id, amount) AS FIRST
FROM ForWindowFunc

 

id

Amount

First

1

100

100

1

150

100

1

200

100

2

50

50

2

100

50

2

200

50

2

300

50

3

10

10

3

150

10

3

200

10

Here we got the first value for each window.

And now we get the first value across the entire data set:

1
2
3
SELECT id, Amount,
FIRST_VALUE(Amount) OVER(ORDER BY Id, amount) AS FIRST
FROM ForWindowFunc

 

id

Amount

First

1

100

100

1

150

100

1

200

100

2

50

100

2

100

100

2

200

100

2

300

100

3

10

100

3

150

100

3

200

100

We removed the Partition clause from the query, thus we defined the entire data set as a window.

And now let's look at the work of the LAST_VALUE function:

1
2
3
SELECT id, Amount,
LAST_VALUE(Amount) OVER(ORDER BY id) AS LAST
FROM ForWindowFunc

 

id

Amount

Last

1

100

150

1

200

150

1

150

150

2

100

50

2

300

50

2

200

50

2

50

50

3

150

10

3

200

10

The query is almost the same as the previous one, but the result is completely different. Since we do not have a unique identifier in the table, we cannot sort the dataset by it. Sorting by field Id actually divided the data into three groups. And the function returned the last value for each of them – this is a feature of the function.

With this we will complete consideration of window functions. The examples given are greatly simplified specifically for a better understanding of the way the function works. Practical tasks are often more difficult, so it is important to understand well the behavior of a function depending on the sentences in the OVER instruction.

Similar articles:

Volver