[中文]
Author:
fuyuncat
Source:
www.HelloDBA.com
Date:
2010-01-20 07:34:44
Sampling
When analyzing objects, to reduce workload, we may analyze with sampling data to estimate the statistics data. This may reduce the accuracy, but will save much of resource and reduce loading in server, which are important for analyzing the huge tables.
When set sampling size, we have 3 choices.
- Set a specified sample size, from 0.000001 to 100;
- set as DBMS_STATS.AUTO_SAMPLE_SIZE or 0, which will let oracle choose the sample size automaticly;
- no setting, null or 100, means will no sampling.
If set as AUTO, Oracle will have a quick estimating table rows base on data blocks, and then set a initial sampling percent dependent on the row number. To get the row number, oracle will scan the sample data iterativly with adjusting the sampling percent. Here is the pseudocode.
SQL代码
- set initial sample pct (NPCT) by Number of Blocks (NBS) # NPCT = LEAST(100, GREATEST(0.000001, (100 * 100 / NBS)))
- loop {
- if NPCT > 25 then
- set NPCT = 100
- if NPCT >= 100 then
- set the query SQL as "select count(*) from table"
- else
- set the query SQL as "select count(*) from table sample(NPCT)"
- execute SQL and get the sample records number (SN)
- if SN > 5000 or NPCT >= 100 then
- exit loop
- else
- if SN >= 441 then
- NPCT = GREATEST((NPCT*7500/SN), (NPCT*10))
- else
- NPCT = NPCT*100;
- }
- Get the estimated records number # N = SN*100/NPCT
We can find the loop from the trace.
SQL代码
- ...
- quick_estimate_rowcnt: 4359 rows seen at 14.7058823529 percent (aprx. 100 blocks)
- quick_estimate_rowcnt: 47582 rows seen at 100 percent (aprx. 680 blocks)
- ...
With the estimated row number, oracle will set the initial percent by below formula. We can see, in AUTO sampling, the table with less 5500 records will almost not be sampled.
PCT = 5500*100/N
For the previous example, it's initial percent is,
PCT = 5500*100/47582 = 11.558992896473456349039552772057
However, with the initial percent (including the manually specified percent), the analyzed statistics data may not accurate enough to be adopted. In such case, oracle will adjust the percent and re-analyze those data. We can find the re-analyzing iterations from the trace.
SQL代码
- Iteration 1, percentage 11.55899289647345634903955277205666008154
- NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME
- Y Y Y Y Y Y Y Y Y Y SYS_NC00015$
- Y Y Y Y Y Y Y Y Y Y OWNER
- Y Y Y Y Y Y Y Y Y Y OBJECT_NAME
- Y Y Y Y Y Y Y Y SUBOBJECT_NAME
- Y Y Y Y Y Y Y Y Y OBJECT_ID
- Y Y Y Y Y Y Y Y DATA_OBJECT_ID
- Y Y Y Y Y Y Y Y OBJECT_TYPE
- Y Y Y Y Y Y Y CREATED
- Y Y Y Y Y Y Y LAST_DDL_TIME
- Y Y Y Y Y Y Y Y TIMESTAMP
- Y Y Y Y Y Y Y Y Y STATUS
- Y Y Y Y Y Y Y Y TEMPORARY
- Y Y Y Y Y Y Y Y GENERATED
- Y Y Y Y Y Y Y Y SECONDARY
- ...
- Iteration 2, percentage 100
- NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME
- Y Y Y Y Y Y Y Y Y SYS_NC00015$
- Y Y Y Y Y Y OWNER
- Y Y Y Y Y Y Y Y Y OBJECT_NAME
- Y Y Y SUBOBJECT_NAME
- Y Y Y Y Y Y Y Y OBJECT_ID
- Y Y Y Y Y Y Y Y DATA_OBJECT_ID
- Y Y Y OBJECT_TYPE
- Y Y Y Y CREATED
- Y Y Y Y LAST_DDL_TIME
- Y Y Y TIMESTAMP
- Y Y Y Y STATUS
- Y Y Y TEMPORARY
- Y Y Y GENERATED
- Y Y Y SECONDARY
- ...
It will trace the gathering flag bits at the begining of each iteration. Oracle will use those bits to decide which statistics data should be analyzed. By comparing the bits of adjacent iterations, we will acknowledge which which statistics data will be re-analyzed.
Such as the sample size (SSIZE), sample number of not null values (SNNV) and calculated number of distinct values (NDV) are the data to be checked to decide will be percent be adjusted. Below are the pseudocode.
SQL代码
- get initial sampling PCT, set as CURR_PCT
- loop
- {
- Get the basic statistics data
- set NEW_PCT = CURR_PCT * 10
- for all columns required histogram
- {
- if SNNV of the column < 441 then
- {
- Need re-analyzed
- NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)
- }
- else
- {
- gather the column histogram with NEW_PCT
- if it's a Height Balanced Histogram and the NDV can not be scaled
- Need re-analyzed
- }
- }
- If sample size of table < 441 then
- {
- Need re-analyzed
- NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)
- }
- evaluate all columns' basic statistics data One By One
- {
- if column required average size or max/min size then
- if the original PCT is auto and column's SNNV < 919 then
- {
- Need re-analyzed
- NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)
- }
- if column required NDV then
- If NDV can not be scaled
- Need re-analyzed
- }
- exit if is not auto sample
- if Need re-analyzed then
- if current sample size >= 441 and NEW_PCT > 25 and NEW_PCT < 50 and CURR_PCT <=5 then
- NEW_PCT = 25
- PREV_PCT = CURR_PCT
- CURR_PCT = NEW_PCT
- else
- exit
- }
We can also find the statistics data must be re-analyzed from the trace.
SQL代码
- ...
- Scaling NDV of OBJECT_NAME
- Need larger sample: sndv = 5128, snnvdv = 5401, nnv = 46726, ndv = 24986
- ...
There is the sample size in both of the dictionary table of table (tab$) and column (col$).
SQL代码
- HELLODBA.COM>select o.name, t.samplesize
- 2 from tab$ t, obj$ o, user$ u
- 3 where t.obj# = o.obj#
- 4 and o.owner# = u.user#
- 5 and u.name = 'DEMO'
- 6 and o.name = 'T_TEST1';
- NAME SAMPLESIZE
- ------------------------------ ----------
- T_TEST1 47582
- HELLODBA.COM>select u.name, o.name, c.name, h.sample_size
- 2 from hist_head$ h, sys.col$ c, obj$ o, user$ u
- 3 where h.obj# = c.obj#
- 4 and h.intcol# = c.intcol#
- 5 and h.obj# = o.obj#
- 6 and o.owner# = u.user#
- 7 and u.name = 'DEMO'
- 8 and o.name = 'T_TEST1';
- NAME NAME NAME SAMPLE_SIZE
- ------------------------------ ------------------------------ ------------------------------ -----------
- DEMO T_TEST1 SYS_NC00015$ 47582
- DEMO T_TEST1 OWNER 5401
- DEMO T_TEST1 OBJECT_NAME 47582
- DEMO T_TEST1 SUBOBJECT_NAME 5401
- DEMO T_TEST1 OBJECT_ID 47582
- DEMO T_TEST1 DATA_OBJECT_ID 5618
- DEMO T_TEST1 OBJECT_TYPE 5401
- DEMO T_TEST1 CREATED 5401
- DEMO T_TEST1 LAST_DDL_TIME 5401
- DEMO T_TEST1 TIMESTAMP 5401
- DEMO T_TEST1 STATUS 5401
- DEMO T_TEST1 TEMPORARY 5401
- DEMO T_TEST1 GENERATED 5401
- DEMO T_TEST1 SECONDARY 5401
- 14 rows selected.
The sample size of the table is the sample size in the last iteration, while the sample size of columns are the size of the analyzed statistics data on the column, including the histogram.
--- Fuyuncat ---