[中文]
Author:
fuyuncat
Source:
www.HelloDBA.com
Date:
2010-01-07 09:47:08
Statistics data and histogram data are the basic data for Cost Base Optimizer to calculate the cost. They are correct or not, decide if optimizer can get the best execution plan. I will analyze the statistics data gathering internal process and arithmetic in following papers.
1. Will column's Histogram data be gathtered?
During table statistics data gathering, there are several factors will affect whether oracle will gather and store column's histogram data.
The first one, of course, is that have the columns be specified in analyaze statement or METHOD_OPT of dbms_stats procedure.
- FOR ALL COLUMNS;
- FOR ALL INDEXED COLUMNS;
- FOR ALL HIDDEN COLUMNS;
- FOR COLUMNS <COL1> <COL2> ... ;
For those specified columns, oracle need get below 2 properties for following judgement.
- Unique Column: The column is the ONLY ONE column in unique key;
- Indexed Column: The column is the LEADING column in at least one index.
SQL代码
- -- Here check the column in table/cluster is unique or indexed
- HELLODBA.COM>SELECT COUNT(UNQ) UNQ, COUNT(PFX) PFX FROM
- 2 (SELECT /*+ first_rows(1) leading(cc) */
- 3 CD.TYPE# UNQ, NULL PFX
- 4 FROM SYS.CCOL$ CC, SYS.CDEF$ CD
- 5 WHERE CC.OBJ# = 20
- 6 AND CC.INTCOL# = 1
- 7 AND CD.CON# = CC.CON#
- 8 AND CD.OBJ# = CC.OBJ#
- 9 AND CD.ENABLED IS NOT NULL
- 10 AND CD.INTCOLS = 1
- 11 AND CD.TYPE# IN (2,3)
- 12 AND BITAND(CD.DEFER, 2+4) = 4
- 13 AND ROWNUM < 2
- 14 UNION ALL
- 15 SELECT /*+ first_rows(1) leading(i) */
- 16 CASE WHEN I.INTCOLS = 1 AND BITAND(I.PROPERTY,1) = 1 THEN 3
- 17 ELSE NULL END UNQ,
- 18 CASE WHEN IC.POS# = 1 THEN 1 ELSE NULL END PFX
- 19 FROM SYS.IND$ I, SYS.ICOL$ IC
- 20 WHERE I.BO# = 20
- 21 AND IC.INTCOL# = 1
- 22 AND I.BO# = IC.BO#
- 23 AND I.OBJ# = IC.OBJ#
- 24 AND BITAND(I.FLAGS,1025) = 0
- 25 AND ROWNUM < 2
- 26 );
- UNQ PFX
- ---------- ----------
- 0 1
Here following the important factor it the bucket size setting. Oracle will set 2 kinds of bit flag to decide whether gather histogram data, according to requirement in different setting.
- -AUTO: Oracle determines the columns to collect histograms based on data distribution and the workload of the columns
- The columns be used for equality prediction or equality join prediction in recent 3 minutes, will set EUAL_PRED bit;
- The columns be used for range prediction or like prediction in recent 3 minutes, will set RANGE_PRED bit;
- The maximum bucket number is the distinct value number from sample data, and should not larger than 254.
- -SKEWONLY: Oracle determines the columns to collect histograms based on the data distribution of the columns
- The Non-unqiue columns will set EUAL_PRED bit;
- All of the columns will set RANGE_PRED bit;
- The maximum bucket number is the distinct value number from sample data, and should not larger than 254.
- -REPEAT: Collects histograms only on the columns that already have histograms.
- All of the columns that already have histograms will set EUAL_PRED and RANGE_PRED bits;
- The maximum bucket number is original bucket number;
- -<Number>: User specified size
- All of the columns will set EUAL_PRED and RANGE_PRED bits;
- The maximum bucket number is the specified size;
- -Not set: means use default setting
- All of the columns will set EUAL_PRED and RANGE_PRED bits;
- The maximum bucket number is the default size, 75;
Only when EUAL_PRED and RANGE_PRED bits are set, the column will set GATHER_HISTOGRAM bit.
Note: we can get the column recent usage timestamp and prediction times by querying COL_USAGE$.
SQL代码
- HELLODBA.COM>select timestamp last_usage_time, equality_preds, equijoin_preds, range_preds, like_preds from COL_USAGE$ w;
- LAST_USAG EQUALITY_PREDS EQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS
- --------- -------------- -------------- ----------- ----------
- 07-JAN-10 42745 1713 0 0
For those columns has set GATHER_HISTOGRAM bit, oracle will create their histogram data. However, that not means the histogram data will be stored correpondly, which should be decided by following factors.
There are 2 methods to create histogram data, Frequency and Height Balanced. We will discuss how does oracle choose which one in next paper.
There are the differences of these 2 kinds of histogram data.
- Frequency Histogram: Each value of the column will repet frequently. Oracle create a bucket for each value. Data will be groupped by values. It's normally for those columns with low number of distinct values;
- Height Balanced Histogram: Height of each bucket is close, means balanced. Data will be groupped by bucket number. It's normally for those columns wiht high number of distinct values. However, there may be few values with high frequency in such column, and they may take multiple buckets. Those buckets will be merged (compressed), and such values are named popular values.
For the frequency histogram, oracle will store the histogram data under below conditions:
- The column has set EUAL_PRED bit, and it's actual bucket number less than sample size;
- The column has set RANGE_PRED bit, and oracle will check the data distribution by internal arithmetic. I will demo the arithmetic by a perl script later.
For the Height Balanced histogram, oracle will store the histogram data under below conditions:
- The column has set EUAL_PRED bit, there is popular values in this column;
- The column has set RANGE_PRED bit, and oracle will check the data distribution by internal arithmetic, which is the same arithmetic in frequency histogram.
SQL代码
- HELLODBA.COM>host perl Range_Skew_test.pl
- dif: 16
- th: 16
- The 0 (0) bucket:
- rc: 0
- The 1 (30) bucket:
- In step 16 ... 32
- Totc: 4
- rc: 0
- The 2 (33) bucket:
- In step 32 ... 48
- Totc: 68
- rc: 0
- The 3 (38) bucket:
- rc: 13
- Final totc: 68;
- Check value(totc/arrcnt): 13.6
- Checking passed, need be kept.
This script could be download at here: http://www.Hellodba.com/Download/Range_Skew_test.zip
From user perspective, we treat those column with histogram data be stored finally as the ones with histogram data be gathered.
--- Fuyuncat TBC ---