Date: 2010-01-18 06:23:13
Calculating NDV and Density
Density is a import statistics data to affect the selectivity. It's normally derived from Number of Distinct Value (NDV), whose calculation is impacted by histogram with sampling.
When there is Frequency Histogram on the column, the calculation is simple, from the number of rows that the column value is not null, or says, the column rows SIZE.
Density = 1/(2*SIZE)
If the rows SIZE is from the sample data, means it's a sample size (SSZIE). It needs estimate the real SIZE first.
SIZE = SSIZE*100/PCT
And the formula of Density become,
Density = 1/(2*SIZE) = 1/(2*SSIZE*100/PCT) = PCT/(SSIZE*200)
Height Balanced Histogram
With Height Balanced Histogram, the Density is calculated from Bucket sample size and popular value size, as well as their sum of sequal.
To choose which arithmetic, decided by Sample Number of Not null Values (SNNV) and smapling percent. For each range SNNV, if its scale to SDNV is in special limitation, Oracle will use a internal arithmetic named KKESDV scaling to estimate the NDV; Otherwise, if the estimated NNV and SNNV are in special limiatation, it will linear scale the NDV. If both kinds of limitation are exceeded, that means the histogram should be Height Balanced, need be converted to Frequency or adjust the sample percent.
I demo the density calculation of height balanced histogram by perl script.
- HELLODBA.COM>host perl cal_height_balanced_dens.pl
- ssize: 5401; ssizesq: 83601; popcnt: 0; popcntsq: 0; pct: 14.7058823529
- sndv: 348.928852525687; snnv: 5401; nnv: 36726.8000001028
- kksesdv scaling, density=0.00286532951289398
Download the script: http://www.Hellodba.com/Download/cal_height_balanced_dens.zip
KKESDV Scaling arithmetic is a process to adjust the formula in loop. While we can get the linear scaling formula by replacing the factors with their own formula.
$den = 1/($sndv*100/$pct);
= 1/($ssize*(($ssize-$popcnt)/($ssizesq-$popcntsq))*100/$pct) # replace $sndv by formula $ssize*(($ssize-$popcnt)/($ssizesq-$popcntsq))
= 1/($size*(($ssize-$popcnt)/($ssizesq-$popcntsq))) # $size = $ssize*100/$pct
When no histogram, the density is calculated from NDV.
However, if stats data was analyzed from sample data (just get the SNDV), it needs scaling the NDV.
- -If it's an unique column, NDV could be linear scaled.
NDV = ROUND(SNDV*100/PCT) => Density=1/ROUND(SNDV*100/PCT)
- -If it's not an unique column, should check which scaling arithmetic it's suitable for.
- If there is Height Balanced Histogram, the scaling will be same as Height Balanced (NDV = 1/Density).
- If there is Frequency Histogram, will use the kkesdv scaling, the result must be same as SNDV;
- If no histogram, the SNDV and SNNV is count() from the sample data, not calculatd from SSZIE and POPCNT.
Note, all of columns with sampling data will scale the NDV, but just those columns without calculated Density will calculate this data by 1/NDV. Since the histogram data is analyzed before scaling NDV, the Density may have calculated in the histogram analyzing process.
We can find the NDV scaling process from the Trace.
- Scaling NDV of CREATED
- kkesdv_p: sndv=909,snnv=5401,nnv=46726,hind=49,slsv=0
- kkesdv scaling or variant, ndv=911
For those data calculated from sample data, oracle will finally double check them and revise them. For density, it will compare the calculated NDV and table row number (NROWS), if NDV larger than NROWS, it means it's unreasonable, should revise the density as 1/NROWS if it's calculated by 1/NDV.
Note: It will not calculate density&NDV for BLOB, CLOB, BFILE, CFILE, LONG, LONG RAW columns.
--- Fuyuncat ---