[中文]
Author:
fuyuncat
Source:
www.HelloDBA.com
Date:
2009-08-01 15:03:35
3.4.4. Filter column position
In fact, this can not be considered separately, because there may be a number of filter columns, each filter column has its own location. We first consider the case of only one filter column.
(select /*+full(a)*/a, b, c, d, e from T_PEEKING11 a where col < :v2;)
(TABROWS: 1000000; TABBLKS: 1000; QRYCOLNUMS: 5)
FLTCOLPOS
COST_CPU
CPU_B
cpu_A
1
211121440
7121440
204000000
2
230121440
7121440
223000000
3
249121440
7121440
242000000
4
268121440
7121440
261000000
Well, change is linear, and in accordance with the assumption 4. Therefore, we need to derive the FLTCOLPOS in the formula of TYPFAC. In the formula 6, there exists two uncertain factors, so we have to solve a system of linear equations of four unknowns. Here we just modified one uncertain factor, and therefore can only come to two equations, it is not sufficient to solve:
(X1*1+Z1)*5 + (X2*1+Z2) = 204000000/1000000 … … (1)
(X1*2+Z1)*5 + (X2*2+Z2) = 223000000/1000000 … … (2)
Then we modify another factor, QRYCOLNUMS:
(select /*+full(a)*/a, b, c, d from T_PEEKING11 a where col < :v2;)
(TABROWS: 1000000; TABBLKS: 1000; QRYCOLNUMS: 4)
FLTCOLPOS
COST_CPU
CPU_B
cpu_A
1
210121440
7121440
203000000
2
229121440
7121440
222000000
3
248121440
7121440
241000000
4
267121440
7121440
260000000
We get the other 2 equations:
(X1*1+Z1)*4 + (X2*1+Z2) = 203000000/1000000 … … (3)
(X1*2+Z1)*4 + (X2*2+Z2) = 222000000/1000000 … … (4)
(2) – (1), (4) – (3) =>
X1*5 + X2 = 19
X1*4 + X2 = 19
=>
X1 = 0
=>
X2 = 19
Then solve the system of linear equations of two unknowns:
Z1*5 + 19 + Z2 = 204
Z1*4 + 19 + Z2 = 203
=>
Z1 = 1
Z2 = 180
We got the new formula of TYPFAC
Formula 7: TYPFAC = QRYCOLNUMS + 19*FLTCOLPOS + 180
3.4.5. Effective columns number
This concept maybe hard to understand. It comes from the special cases that mismatched the formula I ever got. Look at the following test data:
(select /*+full(a)*/a from T_PEEKING11 a where c < :v2;)
COST_CPU
CPU_B
cpu_A
247121440
7121440
240000000
(select /*+full(a)*/a, b from T_PEEKING11 a where c < :v2;)
COST_CPU
CPU_B
cpu_A
247121440
7121440
240000000
(select /*+full(a)*/a, b, d from T_PEEKING11 a where c < :v2;)
COST_CPU
CPU_B
cpu_A
248121440
7121440
241000000
The first test data queried one column, and the second data queried two columns, but their results are the same! The third one queried three columns, but its result is not the same as the previous 2! Such a result is initially confused me. It seems that formula 4&6 are incorrect. And when I try these three data into the formula 7, found the results are incorrect:
1 +19 * 3 +180 = 238! = 240000000/1000000
2 +19 * 3 +180 = 239! = 240000000/1000000
3 +19 * 3 +180 = 240! = 241000000/1000000
OK! The process seems to have erred, or the one of our assumption is invalid. Let us carefully look at these special cases to see if we can find any rules.
Look at the first and second figures - the number of the query column has been changed, but it did not change COST_CPU. Here I have noticed one commonality between them: the location position of query columns are smaller than the filter column position (a : 1; b: 2; c: 3)! Whether can we draw such an assumption 5: the query columns that whose location position is less than or equal to the location position of the query filter columns will not be included in the effective columns. Many random testing data show that this assumption is valid, (can not find any anti-case, testing data will no longer be listed).
Then, where is the QEYCOLNUMS in formula 4 come from? Based on the formula 7, we can calculate the QEYCOLNUMS of first & second testing data:
240000000/1000000 - 180 - 19 * 3 = 3
3, what does this figure mean? I found the location position of the filter (FLTCOLPOS) is also 3, is a coincidence (or perhaps the finding is a coincidenceJ). Base on this finding, I set up assumption 6: when the query column is not effective column (or the location of query column is less than or equal to filter column), the number of query columns (we will call it effective column number) is the query column position (that is, the number of the query column and the column located before it). Similarly, I also adopted a number of random data to prove the assumption (can not find any anti-case).
However, these two assumptions are still not enough to explain a third data. In accordance with the formula 7, the QRYCOLNUMS = 241000000/1000000 - 180 - 19 * 3 = 4, does not equal to its column position (3). Don’t worry, according to the assumption 5, a, b two columns can not be calculated in the effective column, but the d is not in the scope of the assumption. If it is considered as an effective column, together with the assumption 6 (effective column from the filter column is 3), come to (3 +1) = 4. So we can draw an assumption 7: the columns whose location position is greater than the query filter will also be included in the effective columns. We can also prove the assumption by a number of random data.
OK, is it all of the impact of filter columns position? Let us now look at another group of testing data:
select /*+full(a)*/X from T_PEEKING11 a where c < :v3;
X
COST_CPU
CPU_B
CPU_A
d
512873940
7121440
505752500
e
512876440
7121440
505755000
f
512878940
7121440
505757500
g
512881440
7121440
505760000
d, e
512876440
7121440
505755000
d, f
512878940
7121440
505757500
f, g
512881440
7121440
505760000
d, g ,h
512883940
7121440
505762500
From the calculated EFFCOLNUMS in the set of data, we can easily find the rule: The max query column location position decides the number of effective query filed. This can also become one of our assumptions (8), at the same time, a groups of data can prove it.
From the above assumptions can be drawn:
Formula 8: EFFCOLNUMS = EFFQRYCOLNUM + FLTCOLPOS
EFFCOLNUMS:effective column number
FLTCOLPOS:filter column number
EFFQRYCOLNUM:effective query column. If there is any query column whose position greater than the filter column position, it will be the result of they subtraction max query column position and filter column postion, otherwise, it will be 0.
The QRYCOLNUMS in formula 7 has become EFFCOLNUMS,
Formula 9:TYPFAC = EFFCOLNUMS + 19*FLTCOLPOS + 180
3.4.6. Max filter column position
Note: In case of multi-filter columns, we will first derived the case with only AND, and will finally involve the OR.
After complete of the previous derivation, I should deduced the impact of the number of filter column in the formula. However, the specific changes in the test data let me note that a factor under the conditions of multi-filter columns: the max filter column position.
Look at following set of data: two filter columns, with the same max filter column position:
(all columns has same data type, TABROWS: 1000000; TABBLKS: 1000; EFFCOLNUMS: 5)
COLA
COLB
columns
COST_CPU
CPU_B
cpu_A
1
5
A&E
289621440
7121440
282500000
2
5
B&E
289621440
7121440
282500000
3
5
C&E
289621440
7121440
282500000
4
5
D&E
289621440
7121440
282500000
We can found, the change of position of the columns whose position is less than the max filter column position will not affect the results.
Look at another set of data: two filter columns, which different max filter column position:(all columns has same data type, TABROWS: 1000000; TABBLKS: 1000; EFFCOLNUMS: 5)
COLA
COLB
columns
COST_CPU
CPU_B
cpu_A
1
2
A&B
229771440
7121440
222650000
1
3
A&C
249721440
7121440
242600000
1
4
A&D
269671440
7121440
262550000
1
5
A&E
289621440
7121440
282500000
It’s linear change. We can draw such a conclusion: with multiple AND filter columns, only the max filter column position will affect the COST_CPU, the other columns location will not affect COST_CPU. Based on this conclusion, we first get assumption 8: the FILCOLPOS in formula 8 & 9 is the max filter column position, MAXFLTCOLPOS. However, due to the existence of two uncertain factors: MAXFLTCOLPOS and the number of filter columns, FLTCOLNUM, and the change of FLTCOLNUM is not linear (come from other testing data, will be given later), we can not directly draw the equations to solve. Well, I would like to see based on this assumption, is there a way come out a equation on FLTCOLNUM.
First we get a new formula base on assumption 8:Formula 8-1:EFFCOLNUMS = EFFQRYCOLNUM + MAXFLTCOLPOS
Formula 10:TYPFAC = EFFCOLNUMS + 19*MAXFLTCOLPOS + 180
3.4.7. filter filed number
Filter filed number in the formula is the most special. It can not draw a simple equation. However, by observing-assuming-disproving, finally I find the answers. Look at its impact on COST_CPU:
(all columns has same data type, TABROWS: 1000000; TABBLKS: 1000; EFFCOLNUMS: 5; MAXFILPOS: 5)
FLTCOLNUM
COST_CPU
CPU_B
CPU_A
1
287121440
7121440
280000000
2
289621440
7121440
282500000
3
289746440
7121440
282625000
4
289752690
7121440
282631250
5
289753003
7121440
282631563
We can find the change of this data is not linear. It’s hard for me to get an equation. However, we can try other method to find the rule. We can first calculate the TYPFAC according to the formula 5:
FLTCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
1
287121440
7121440
280000000
280
2.5
2
289621440
7121440
282500000
282.5
0.125
3
289746440
7121440
282625000
282.625
0.00625
4
289752690
7121440
282631250
282.63125
0.000313
5
289753003
7121440
282631563
282.63156
Then subtract in sequence (see the result in blue words). Bingo! There is obvious rule among them: the difference is 20 times! 2.5/20 = 0.125; 0.125/20 = 0.00625; 0.00625/20 = 0.0003125 ~= 0.000313. Following this rule, can get another formula:
Formula 10-1: TYPFAC = 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + 50*(20^(1 – FLTCOLNUM) + 20^(2 – FLTCOLNUM) + … + 20^( FLTCOLNUM- FLTCOLNUM))
= 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + 50*(1/20)^(FLTCOLNUM-1) + 50*(1/20)^(FLTCOLNUM-2) + … + 50(1/20)^(FLTCOLNUM- FLTCOLNUM)
= 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + 50*(1/20)^0 + 50*(1/20)^1 + … + 50*(1/20)^(FLTCOLNUM-1)
From other random data, we can prove this formula, which also proved the assumption 8.
3.4.8. Data type of filter columns
During previous deriving process, in order to eliminate the impact of data types, I have set all the columns as VARCHAR2 (50). Now, it’s time to derive the effect of the data types.
First, it’s easily proved that the data type will not impact on the query columns. (Not list the test data).
Then what we need to derive is the impact of data type filter columns. However, the data types change is not linear. I assume the change is come from CACE ... WHEN:Restricted all of the data type to be CHARS; we found the result is same as VARCHAR2.
Then modify all the data type to be NUMBER,
(all columns has same data type, TABROWS: 1000000; TABBLKS: 1000; EFFCOLNUMS: 5; MAXFILPOS: 5)
FLTCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
1
387121440
7121440
380000000
380
7.5
2
394621440
7121440
387500000
387.5
0.375
3
394996440
7121440
387875000
387.875
0.01875
4
395015190
7121440
387893750
387.89375
0.000938
5
395016128
7121440
387894688
387.89469
Compare the data, it’s easily get the new version of formula 10-1:
Formula 10-2: TYPFAC = 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + 150*(1/20)^0 + 150*(1/20)^1 + … + 150*(1/20)^(FLTCOLNUM-1)
Then deduce the DATE:
FLTCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
1
537121440
7121440
530000000
530
15
2
552121440
7121440
545000000
545
0.75
3
552871440
7121440
545750000
545.75
0.0375
4
552908940
7121440
545787500
545.7875
0.001875
5
552910815
7121440
545789375
545.78938
The new formula become:
Formula 10-3: TYPFAC = 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + 300*(1/20)^0 + 300*(1/20)^1 + … + 300*(1/20)^(FLTCOLNUM-1)
I stopped at here, not deduce other data types. Here we get the new formula:
Formula 11: TYPFAC = 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + COLTYPEFAC*(1/20)^0 + COLTYPEFAC*(1/20)^1 + … + COLTYPEFAC*(1/20)^(FLTCOLNUM-1)
Data Type
COLTYPEFAC
CHAR
50
VARCHAR
50
NUMBER
150
DATE
300
3.4.9. Data length of the filed
The test data tell us the data length of the filed will not impact the COST_CPU.
3.4.10. Computing sequence of filter columns
In the previous derivation, the data types of all columns are uniform, and we also see that the different data type has different coefficient, COLTYPEFAC. What will happen if there is mixture of data types? Start from the special circumstances, the filter column with the largest location position is set to be NUMBER, the other filter columns set to be DATE (It’s my intentional arrangement. If I change the sequence in turn, the result it is hard to be drawn. I will demonstrate this issue in following derivation). Observing the changes:
FLTCOLNUM
COST_CPU
CPU_B
CPU_A
TYPFAC
1
387121440
7121440
380000000
380
15
2
402121440
7121440
395000000
395
0.75
3
402871440
7121440
395750000
395.75
0.0375
4
402908940
7121440
395787500
395.7875
0.001875
5
402910815
7121440
395789375
395.78938
We can be seen, COLTYPEFAC change is based on DATE. I try to calculate the COLTYPEFAC base on DATE formula, and compare data:
FLTCOLNUM
TYPFAC_CAL
1
380
2
387.5
0.375
3
387.875
0.01875
4
387.89375
0.0009375
5
387.8946875
Data change were found before the in power permutation, whose base is the difference between COLTYPEFAC of DATE and NUMBER (through similar methods, we can find the rule between DATE & VARCHAR2 and NUMBER & VARCHAR2). Therefore, we can try to change the formula 11:
Formula 12: TYPFAC = 130 + EFFCOLNUMS + 19*MAXFLTCOLPOS + COLTYPEFAC1*(1/20)^0 + COLTYPEFAC2*(1/20)^1 + … + COLTYPEFACn*(1/20)^(FLTCOLNUM-1)
Base on the new formula, calculate the testing data, they are correct.
We will discuss the filter column process sequence in further in 3.4.13.