Forgot password?
  • Home
  • /
  • White Papers
  • /
  • Window functions are what every T-SQL programmer should know. Part 1.

Window functions are what every T-SQL programmer should know. Part 1.

Dmitriy Fedyashov

Long time ago an interesting functionality appeared in Microsoft SQL Server 2005- window functions. These functions allow you to perform calculations in a given range of lines within a Select clause. For those who have not encountered these functions, the question arises - “What does window means?”. A window means a row set within which calculations are performed. The window function allows you to break the entire data set into such windows.

Of course, all that window functions can do is also possible without them. However, window functions have a great advantage over regular aggregate functions: there is no need to group the data set for calculations, which allows you to save all the rows in the set with their unique identifiers. At the same time, the results of the work of window functions are simply added to the resulting sample as another field.

The main advantage of using window functions over regular aggregate functions is as follows: window functions do not lead to the grouping of lines into one line of output, the lines retain their separate identifiers, and the aggregated value is added to each line.

The window is defined using the OVER () instruction. Let's look at the syntax of this instruction:

Window function (column for calculations) OVER ([PARTITION BY column for grouping] [ORDER BY column for sorting] [ROWS or RANGE expression for restricting rows within a group])

The range of functions we will consider in the second part of the article. Let me just say that they are divided into: aggregating, ranking and bias.

To demonstrate the operation of window functions, I suggest on a test table:

1
2
3
4
5
6
7
CREATE TABLE ForWindowFunc (ID INT, GroupId INT, Amount INT)
GO
 
INSERT INTO ForWindowFunc (ID, GroupId, Amount)
 VALUES(1, 1, 100), (1, 1, 200), (1, 2, 150),
 (2, 1, 100), (2, 1, 300), (2, 2, 200), (2, 2, 50),
 (3, 1, 150), (3, 2, 200), (3, 2, 10);

 

ID

GroupId

Amount

1

1

100

1

1

200

1

2

150

2

1

100

2

1

300

2

2

200

2

2

50

3

1

150

3

2

200

As you can see, there are three groups with different number of elements in the group.

The summation function is most oftenly used, so we will conduct the demonstration on it. Let's see how the OVER instruction works:

1
2
3
 SELECT ID,
 Amount,
 SUM(Amount) OVER() AS SUM FROM ForWindowFunc

 

ID

Amount

Sum

1

100

1310

1

200

1310

2

100

1310

2

300

1310

2

200

1310

2

50

1310

3

150

1310

3

200

1310

3

10

1310

We used the OVER () instruction without sentences. In this embodiment, the window will be the entire data set and no sorting is applied. We were just lucky that the data was output in the same order in which it was inserted into the table, but SQL Server can change the display order if there is no explicitly defined sorting. Therefore, the OVER () instruction is almost never used without suggestions. But let's turn our attention to the new column SUM. For each row, the same value of 1310 is output. This is the cumulative sum of all the values in the Amount column.

PARTITION BY clause

The PARTITION BY clause defines the column by which it is needed to group, and it is key in splitting the row set into windows.

Let's change our query written earlier this way:

1
2
3
SELECT ID, 
 Amount, 
 SUM(Amount) OVER(PARTITION BY ID) AS SUM FROM ForWindowFunc

 

ID

Amount

Sum

1

100

300

1

200

300

2

100

650

2

300

650

2

200

650

2

50

650

3

150

360

3

200

360

3

10

360

The PARTITION BY clause grouped strings by the ID field. Now for each group is calculated its own sum of the Amount values. You can create windows by several fields. Then in PARTITION BY you need to write fields for grouping separated by commas (for example, PARTITION BY ID, Amount).

ORDER BY clause

Together with PARTITION BY, the ORDER BY clause can be used. ORDER BY clause determines the sorting order within the window. The sort order is very important, because the window function will process the data according to this particular order. If you do not use the PARTITION BY clause, but only ORDER BY, then the window will represent the entire data set.

1
2
3
4
 SELECT ID,
 GroupId,
 Amount,
 SUM(Amount) OVER(PARTITION BY id ORDER BY Amount) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

100

1

2

150

250

1

1

200

450

2

2

50

50

2

1

100

150

2

2

200

350

2

1

300

650

3

2

10

10

3

1

150

160

3

2

200

360

To the PARTITION BY clause an ORDER BY was added across the Amount field. Thus, we indicated that we want to see the sum of not all Amount values in the window, but for each Amount value, the sum with all the previous ones. Such summation is often referred to as a increscent total or cumulative total.

You have noticed that the GpoupId field appeared in the sample. This field will allow you to show how the cumulative total will change, depending on the sorting. Change the query:

1
2
3
4
SELECT ID, 
 GroupId, 
 Amount, 
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId, Amount) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

100

1

1

200

300

1

2

150

450

2

1

100

100

2

1

300

400

2

2

50

450

2

2

200

650

3

1

150

150

3

2

10

160

3

2

200

360

And we get a completely different behavior. And although in the end for the last value in the window, the values converge with the previous example, but the sum for all the others is different. Therefore, it is important to clearly understand what you want to get in the end.

ROWS / RANG clause

Two more ROWS and RANGE clauses are used in the OVER instruction. This functionality appeared in MS SQL Server 2012.

The ROWS clause limits the rows in the window, indicating a fixed number of rows preceding or following the current one. Both ROWS and RANGE clauses are used with ORDER BY.

The ROWS clause can be specified using the methods:

• CURRENT ROW - display the current row;

• UNBOUNDED FOLLOWING - all records after the current one;

• UNBOUNDED PRECEDING - all previous entries;

• <integer> PRECEDING - specified number of previous lines;

• <integer> FOLLOWING — The specified number of subsequent entries.

You can combine these functions to achieve the desired result, for example:

ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING - the current and next one entry will appear in the window;

1
2
3
4
SELECT ID, 
 GroupId, 
 Amount, 
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId, Amount ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING ) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

300

1

1

200

350

1

2

150

150

2

1

100

400

2

1

300

350

2

2

50

250

2

2

200

200

3

1

150

160

3

2

10

210

3

2

200

200

Here, the amount is calculated by the current and next cell in the window. And the last line in the window has the same meaning as Amount. Let's look at the first window highlighted in blue. The sum 300 is calculated by adding 100 and 200. For the next value the situation is similar. And the last amount in the window is 150, because the current Amount has nothing more to add.

ROWS BETWEEN 1 PRECEDING AND CURRENT ROW - one previous and current record

1
2
3
4
SELECT ID, 
 GroupId, 
 Amount, 
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId, Amount ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

100

1

1

200

300

1

2

150

350

2

1

100

100

2

1

300

400

2

2

50

350

2

2

200

250

3

1

150

150

3

2

10

160

3

2

200

210

In this query, we get the sum by adding the current value of the Amount and the previous one. The first line is 100, because there is no previous Amount value.

The RANGE clause is also intended to limit the row set. Unlike ROWS, it does not work with physical strings, but with a range of rows in the ORDER BY clause. This means that the same-ranked rows in the context of the ORDER BY clause will be counted as one current row for the CURRENT ROW function. And in the ROWS clause, the current row is the current row of the dataset.

The RANGE clause can only be used with the CURRENT ROW, UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING options.

The RANGE clause can use options:

• CURRENT ROW - display the current row;

• UNBOUNDED FOLLOWING - all records after the current one;

• UNBOUNDED PRECEDING — All previous entries.

And can not:

• <integer> PRECEDING - specified number of previous lines;

• <integer> FOLLOWING — The specified number of subsequent entries.

Examples:

RANGE CURRENT ROW

1
2
3
4
SELECT ID,
 GroupId,
 Amount,
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId RANGE CURRENT ROW) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

300

1

1

200

300

1

2

150

150

2

1

100

400

2

1

300

400

2

2

200

250

2

2

50

250

3

1

150

150

3

2

200

210

3

2

10

210

The Range clause is set to the current line. But, as we remember, for the Range, the current line is all the lines corresponding to the same sort value. Sorting in this case by the GroupId field. The first two lines of the first window have a GroupId value of 1 - therefore both of these values satisfy the RANGE CURRENT ROW constraint. Therefore, Sum for each of these lines is equal to the total Amount on them - 300.

  • RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
1
2
3
4
SELECT ID,
 GroupId,
 Amount,
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

1

100

300

1

1

200

300

1

2

150

450

2

1

100

400

2

1

300

400

2

2

200

650

2

2

50

650

3

1

150

150

3

2

200

360

3

2

10

360

In this case, the restriction is on all previous lines and the current one. For the first and second lines, this rule works like the previous one (remember CURRENT ROW), and for the third as the sum of the Amount of previous lines with the current one.

  • RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
1
2
3
4
SELECT ID, 
 GroupId, 
 Amount, 
 SUM(Amount) OVER(Partition BY id ORDER BY GroupId RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS SUM FROM ForWindowFunc

 

ID

GroupId

Amount

Sum

1

2

150

150

1

1

100

450

1

1

200

450

2

2

200

250

2

2

50

250

2

1

100

650

2

1

300

650

3

2

200

210

3

2

10

210

3

1

150

360

This restriction allowed us to get the amount from the current line and all previous ones within one window. Since the second and third row are in the same GroupId, these values are Current Row. Therefore, they are summed up immediately.

This concludes the first part of the article. And finally - an example of using the window function from real practice.

The best way to understand the essence of window functions is by an example. Suppose you have data about payments subscribers. Payment arrives on the contract. But this contract has affiliated contracts on which there is negative balance. And we want to distribute the funds received to repay the debt of the subsidiaries.

Thus, we need to find out how much money we will write off from the account of the main contract and how much we will transfer to the child. Let's look at the table:

ContractId

ChildContractId

PayId

CustAccount

PayAmount

1000000000

1000000002

1000000752

-200,00

800,00

1000000000

1000000003

1000000753

-1000,00

800,00

Where, ContractId is the identifier of the main contract,

ChildContractId - child contract identifier,

PayId - payment ID,

CustAccount - child contract balance,

PayAmount - payment.

From the table it is clear that for each child contract the amount of payment is 800. This is due to the fact that the payment is on the parent contract.

So our task is to calculate the amount of money transfers from the parent to the child contract.

To do this, sum up CustAccount and PayAmount. However, a simple amount of balance and payment does not suit us. After all, to repay the debt on the second subsidiary contract we must take into account the remaining balance of the first contract and payment.

How can we act in this situation? We could select:

1
2
3
4
5
6
7
8
SELECT
ContractId,
ChildContractId,
PayId,
CustAccount,
PayAmount,
PayAmount + (SELECT SUM(CustAccount) FROM dbo.Pays p2 WHERE p1.PayId = p2.PayId AND p2.ChildContractId <= p1.ChildContractId) AS [SUM]
FROM dbo.Pays p1

This query solves the task, but the subquery spoils the whole picture - it increases the query execution time. Let's apply the window function of addition:

1
2
3
4
5
6
7
8
SELECT
ContractId,
ChildContractId,
PayId,
CustAccount,
PayAmount,
PayAmount + SUM(CustAccount) OVER (ORDER BY ChildContractId) AS [SUM]
FROM dbo.Pays p1

This option is faster and more concise. In our case, we get the sum for the CustAccount field in the window, which is formed by the ChildContractId field.

The result of these queries will be the table:

ContractId

ChildContractId

PayId

CustAccount

PayAmount

Sum

1000000000

1000000002

1000000752

-200,00

800,00

600

1000000000

1000000003

1000000753

-1000,00

800,00

-400

Based on the data obtained in the Sum column, we determine the amount to be transferred from the parent contract to the child contract. For the contract 1000000002, we repaid the debt in full, so the payment amount is 200. For the agreement 1000000003, the debt is partially repaid - the payment amount is equal to the sum of the balance and the balance of the payment after calculation for the first record (-1000 + 600 = -400).

Similar articles:

back