[中文]
Author:
fuyuncat
Source:
www.HelloDBA.com
Date:
2010-01-11 06:13:34
2. How to choose histogram, Frequency or Height Balanced?
Frequency and Height Balanced are two methods to create histogram correspond to different data values distribution. Their grouping keys are different, endpoint number meaning are different and other dependent stats data such as density are different.
When choose frenquency or height balanced histogram, oracle relyes on some data, such as Number of Distinct Values(NDV), sample size, density and the finally buckets number. However, to the finally buckets number, it requires to create the histogram. So, oracle will first try to create histogram using one kind of method dependending on some simple conditions, then check it fits its requirements or not. If not matched, oracle will convert the histogram to the other kind of histogram or create new histogram using the other kind of method. We can use below pseudocode descript the process.
SQL代码
- Estimate NumberofDistinctValues (NDV)
- Get the MaxNumberofBucket(MNB) correspond to Bucket SIZE setting
- if NDV < 0.9*2000 then
- Try to create frequency histogram
- if frequency histogram is ok then
- adopt frequency histogram
- if Not created frequency histogram or the created frequency histogram is not ok then
- Try to create height balanced histogram
- if height balanced histogram is ok then
- if not tried frequency histogram and sample NDV less than MNB - 1 then
- Try to create frequency histogram
- if frequency histogram is ok then
- adopt frequency histogram
- else
- adopt height balanced histogram
- If no histogram be adopted then
- adjust sample PCT and repeat the process again
Note: we will discuss the Sample PCT in another paper.
Estimated NDV depends on sample NDV or Numberof Notnull Values(NNV), and the Maximum Number of Bucket(MNB) is calculated from bucket size setting. I use perl script demo the calculation of these 2 values:
SQL代码
- HELLODBA.COM>host perl get_MNB.pl
- Sample percent is 30%
- Sample number of Not Null Values is 100, calculated number of Not Null Values is 333.333333333333
- Sample Number of Distinct Values is 50, Estimated Number of Distinct Values is 50
- SIZE setting is , calculated Maximum Number of Bucket is 75
Perl code download:http://www.Hellodba.com/Download/get_MNB.zip
Then, let's further investigate the histogram creating process and see which one finally be adopted.
Frequency Histogram
Oracle first construct a query to get the Frequency Histogram data. We can find such query by tracing.
SQL代码
- HELLODBA.COM>set serveroutput on
- HELLODBA.COM>begin
- 2 dbms_output.enable(1000000);
- 3 dbms_stats.set_param('TRACE',2049);
- 4 dbms_stats.gather_table_stats('DEMO','T_TEST7',NULL,0,FALSE,'FOR COLUMNS OBJECT_TYPE');
- 5 end;
- 6 /
- select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */ count(*)
- ,count("OBJECT_NAME"),sum(sys_op_opnsize("OBJECT_NAME"))
- ,count("SUBOBJECT_NAME"),sum(sys_op_opnsize("SUBOBJECT_NAME"))
- ,count("OBJECT_ID"),sum(sys_op_opnsize("OBJECT_ID"))
- ,count("DATA_OBJECT_ID"),sum(sys_op_opnsize("DATA_OBJECT_ID"))
- ,count("OBJECT_TYPE"),sum(sys_op_opnsize("OBJECT_TYPE"))
- ,count("CREATED")
- ,count("LAST_DDL_TIME")
- ,count("TIMESTAMP"),sum(sys_op_opnsize("TIMESTAMP"))
- ,count("STATUS"),sum(sys_op_opnsize("STATUS"))
- ,count("TEMPORARY"),sum(sys_op_opnsize("TEMPORARY"))
- ,count("GENERATED"),sum(sys_op_opnsize("GENERATED"))
- ,count("SECONDARY"),sum(sys_op_opnsize("SECONDARY"))
- from "DEMO"."T_TEST7" t
- select substrb(dump(val, 16, 0, 32), 1, 120) ep, cnt
- from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */
- "OBJECT_TYPE" val, count(*) cnt
- from "DEMO"."T_TEST7" t
- where "OBJECT_TYPE" is not null
- group by "OBJECT_TYPE")
- order by nlssort(val, 'NLS_SORT = binary')
- PL/SQL procedure successfully completed.
The 1st query is used to get the average size of the columns, and the 2nd query is the right query to get the Histogram. We can find the frequency histogram buckets are groupped by data values (group by "OBJECT_TYPE") from the query.
The key to judge if it fits to using frequency histogram is buckets number.
- If the buckets number not exceed to the MNB limitation, it's confirmed ok for frequency histogram;
Tips: If we trace the process in 16383 level, we can see it will not try other method after created frequency histogram if matched.
SQL代码
- ...
- Building Histogram for STATUS
- Trying frequency histogram
- select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */"STATUS" val,count(*) cnt from sys.ora_temp_1_ds_7861 t where "STATUS" is not null group by "STATUS") order by nlssort(val,'NLS_SORT = binary')
- Building Histogram for TEMPORARY
- ...
- If the bucket number exceed to MNB, but no larger than 2000, then it will try to convert the histogram to Height Balanced Histogram. The conversion procedure is that, it first create new buckets in average size with MNB buckets number, then fill the new buckets with the old buckets in sequence, once a new bucket is full or more, it will fill the next one. Since the size of the new buckets are in average, they looks like "balanced". We can image such a scenario. We have thousands of apple need be boxed up. First, they were put into hundreds of buckets with different size. But we think it's hard to manage those buckets, we want the apples could be put into same size buckets with less buckets. So, we calculate the new buckets size by using apples number divide buckets number. Then get the buckets in calculated size from store, and move the apples from the old buckets to new buckets one by one.
I also used a perl script demo the conversion.
SQL代码
- HELLODBA.COM>host perl conv_fre_to_height.pl
- The 0 Frequency bucket: 0, 5 will be in the next balanced bucket.
- The 0 Height Bbalanced bucket: 0, 1
- The 1 Frequency bucket: 30, 7 will be in the next balanced bucket.
- The 2 Frequency bucket: 33, 15 will be in the next balanced bucket.
- The 1 Height Bbalanced bucket: 33, 2
- The 3 Frequency bucket: 38, 28 will be in the next balanced bucket.
- The 2 Height Bbalanced bucket: 38, 3
- The 4 Frequency bucket: 80, 50 will be in the next balanced bucket.
- The 3 Height Bbalanced bucket: 80, 6
- ssize: 50; cbktsz: 10; popcnt: 0; popcntsq: 0
- HELLODBA.COM>
TIPS: If histogram be converted, we can find below entries in the trace contents.
SQL代码
- ...
- Building Histogram for TIMESTAMP
- Trying frequency histogram
- select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */"TIMESTAMP" val,count(*) cnt from sys.ora_temp_1_ds_7861 t where "TIMESTAMP" is not null group by "TIMESTAMP") order by nlssort(val,'NLS_SORT = binary')
- Switching to height-balanced histogram: buckets=75,ssize = 5401, popcnt = 0, ssizesq = 83579, popcntsq = 0
- ...
After histogram be converted, it needs to scale the density, which is also to check it's real height balanced or not. If density could be scaled by special methods, the checking passed.
Note: we will discuss the density in another paper.
- If buckets number exceed to 2000, the histogram could not be created as frequncy, should be created as height balanced.
Tips: we can also see the frequency histogram be discarded from the trace.
SQL代码
- ...
- Building Histogram for DATA_OBJECT_ID
- Trying frequency histogram
- select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */"DATA_OBJECT_ID" val,count(*) cnt from "DEMO"."T_TEST1" t where "DATA_OBJECT_ID" is not null group by "DATA_OBJECT_ID") order by val
- Discarding histogram: number of values = 2001, max # of buckects = 75, ssize = 2037
- ...
Height Balanced Histogram
Just like frequency, it also construct a query to generate the basic histogram data. We can also get the query by tracing.
SQL代码
- HELLODBA.COM>set serveroutput on
- HELLODBA.COM>begin
- 2 dbms_output.enable(1000000);
- 3 dbms_stats.set_param('TRACE',2049);
- 4 dbms_stats.gather_table_stats('DEMO','T_TEST1',NULL,0,FALSE,'FOR COLUMNS OBJECT_ID');
- 5 end;
- 6 /
- ...
- select min(minbkt),
- maxbkt,
- substrb(dump(min(val), 16, 0, 32), 1, 120) minval,
- substrb(dump(max(val), 16, 0, 32), 1, 120) maxval,
- sum(rep) sumrep,
- sum(repsq) sumrep sq,
- max(rep) maxrep,
- count(*) bktndv,
- sum(case when rep = 1 then 1 else 0 end) unqrep
- from (select val,
- min(bkt) minbkt,
- max(bkt) maxbkt,
- count(val) rep,
- count(val) * count(val) repsq
- from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */
- "OBJECT_ID" val, ntile(75) over(order by "OBJECT_ID") bkt
- from sys.ora_temp_1_ds_7862 t
- where "OBJECT_ID" is not null)
- group by val)
- group by maxbkt
- order by maxbkt
- PL/SQL procedure successfully completed.
This query is made up of 3 level query.
- The inner query is to use the ntile() function to define MNB buckets (we call such buckets as average buckets) in average size. Each records will be labeled the bucket number. The average size is depend on records number, not consider the column data values. So, multiple records with same data values may be labled with different bucket number;
- The middle query to group the inner query result by data values. As we said, multiple records with same data values may be labled with different bucket number, here will also get the maximum and minimum bucket number of the data values. Besides, records repetitions of a values and sequal of the repetitions will also be goetten.
- The outer query will group the middle query result by values's maximum bucket number, means, each value will be fallen under the maximum bucket if it's records be labeled multiple buckets. And the maximum values of each bucket is called as endpoint value.
There may be such a special value in Height Balanced Histogram. It have so many repetition records that will occupy multiple buckets, whose endpoint values are that value. Such value will be called as popular value. And the duplicated buckets (the buckets whose values are assigned to another bucket with larger number) will be merged, say, compressed (Which will make the histogram not absolute balanced).
Note: In height balanced histogram, it finally need create an additional buckt to hold the minimum values as endpoint value, we call such bucket as 0# bucket.
Popular value is the key to judge if the histogram really balanced or not. If all of data values are popular, the histogram must be converted to Frequency histogram (In such case, each value will occupy a bucket after compressed. The conversion is simple, just need to convert the endpoint number --- It's the accumulative records number of the data values in frequency histogram, while is the average bucket number in height balanced histogram. The size of average bucket is the average size of total records number. So it's no hard to convert the endpoint number from height balanced to frequency). Otherwise, it need to scale the density to double check.
I demostrated the popular value checking by perl script.
SQL代码
- HELLODBA.COM>host perl check_popular.pl
- Height Balanced Histogram:
- minbkt maxbkt minval maxval sumrep sumrepsq maxrep
- 1 1 1 2 2 2 1
- 1 4 3 4 9 65 8
- 5 5 5 5 3 9 3
- 6 6 6 8 4 6 2
- 7 8 9 10 6 20 4
- Value 3 is a popular value
- HELLODBA.COM>
Download the script: http://www.Hellodba.com/Download/check_popular.zip
From above analysis, we know it will create frequncy histogram under two kinds conditions, otherwise, it will create height balanced histogram.
- We can get the conclusion that Height Balanced histogram buckets number is limited by MNB from above process, and the MNB could never be larger than 254 (even when you special the bucket size in option, it will raise exception when it be set larger than this value) from its calculation. Plus with the 0# bucket, the maximum bucket number of height balanced could never be larger than 255. That is to say, if the buckets number is larger than 255, it should not be height balanced, but frequency.
- When mached 3 requirements, it will definately create frequency histogram. The 1st is than the Number of Distint Values less than MNB; the 2nd is that NDV is equal to actual buckets number (we can see it from the basic query); The 3rd one is that the density is calculated by frequency formula.
- The others are height balanced histograms.
Notes: It is a special case when NDV equal to MNB. It matched the requirments of frequency and height balanced both. Oracle will class it as height balanced. We can see it's created as frequency from the trace, but be defined as height balaned in related view.
We can find the judgement from the defination of DBA_TAB_COLS.
SQL代码
- HELLODBA.COM>select to_char(substr(definition,adapt
- 2 instr(definition, 'case when nvl(h.row_cnt,0)'),
- 3 313)) histogram_def
- 4 from view_definition
- 5 where view_name = 'DBA_TAB_COLS';
- HISTOGRAM_DEF
- -----------------------------------------------------------------------------
- case when nvl(h.row_cnt,0) = 0 then 'NONE'
- when (h.bucket_cnt > 255
- or
- (h.bucket_cnt > h.distcnt and h.row_cnt = h.distcnt
- and h.density*h.bucket_cnt*2 <= 1))
- then 'FREQUENCY'
- else 'HEIGHT BALANCED'
- end,
Note: In data dictionary hist_head$, bucket_cnt is MNB, and row_cnt is the actual bucket number, which are easily confused.
Conclusion
Frequency histogram is more suitable for those columns with small distinct values (<= 2000) with data value uniformally distributed; or the colums of tiny table. Otherwise, it must adopt Height Balanced histogram.
Note: One of requirement to create histogram is that null value records should not be too many, which will be discussed in another paper.
--- Fuyuncat TBC ---