CBO Statistics data gathering and internal arithmetic analysis -- Sampling

Date: 2010-01-20 07:34:44

Share to

Sampling

    When analyzing objects, to reduce workload, we may analyze with sampling data to estimate the statistics data. This may reduce the accuracy, but will save much of resource and reduce loading in server, which are important for analyzing the huge tables.

    When set sampling size, we have 3 choices.

Set a specified sample size, from 0.000001 to 100;

set as DBMS_STATS.AUTO_SAMPLE_SIZE or 0, which will let oracle choose the sample size automaticly;

no setting, null or 100, means will no sampling.

     If set as AUTO, Oracle will have a quick estimating table rows base on data blocks, and then set a initial sampling percent dependent on the row number. To get the row number, oracle will scan the sample data iterativly with adjusting the sampling percent. Here is the pseudocode.

SQL代码

set initial sample pct (NPCT) by Number of Blocks (NBS) # NPCT = LEAST(100, GREATEST(0.000001, (100 * 100 / NBS)))

loop {

  if NPCT > 25 then

    set NPCT = 100

  if NPCT >= 100 then

    set the query SQL as "select count(*) from table"

  else

    set the query SQL as "select count(*) from table sample(NPCT)"

  execute SQL and get the sample records number (SN)

  if SN > 5000 or NPCT >= 100 then

    exit loop

  else

    if SN >= 441 then

      NPCT = GREATEST((NPCT*7500/SN), (NPCT*10))

    else

      NPCT = NPCT*100;

}

Get the estimated records number # N = SN*100/NPCT

    We can find the loop from the trace.

SQL代码

...

quick_estimate_rowcnt: 4359 rows seen at  14.7058823529 percent (aprx. 100 blocks)

quick_estimate_rowcnt: 47582 rows seen at 100 percent (aprx. 680 blocks)

...

    With the estimated row number, oracle will set the initial percent by below formula. We can see, in AUTO sampling, the table with less 5500 records will almost not be sampled.
    PCT = 5500*100/N

    For the previous example, it's initial percent is,
    PCT = 5500*100/47582 = 11.558992896473456349039552772057

    However, with the initial percent (including the manually specified percent), the analyzed statistics data may not accurate enough to be adopted. In such case, oracle will adjust the percent and re-analyze those data. We can find the re-analyzing iterations from the trace.

SQL代码

Iteration 1, percentage 11.55899289647345634903955277205666008154

NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME

  Y    Y    Y    Y    Y    Y    Y    Y    Y              Y    SYS_NC00015$

       Y    Y    Y    Y    Y    Y    Y    Y    Y         Y    OWNER

       Y    Y    Y    Y    Y    Y    Y    Y    Y         Y    OBJECT_NAME

  Y    Y    Y    Y    Y    Y    Y                        Y    SUBOBJECT_NAME

       Y    Y    Y    Y    Y    Y    Y    Y              Y    OBJECT_ID

  Y    Y    Y    Y    Y    Y    Y                        Y    DATA_OBJECT_ID

  Y    Y    Y    Y    Y    Y    Y                        Y    OBJECT_TYPE

       Y         Y    Y    Y    Y         Y              Y    CREATED

       Y         Y    Y    Y    Y         Y              Y    LAST_DDL_TIME

  Y    Y    Y    Y    Y    Y    Y                        Y    TIMESTAMP

  Y    Y    Y    Y    Y    Y    Y         Y              Y    STATUS

  Y    Y    Y    Y    Y    Y    Y                        Y    TEMPORARY

  Y    Y    Y    Y    Y    Y    Y                        Y    GENERATED

  Y    Y    Y    Y    Y    Y    Y                        Y    SECONDARY

...

Iteration 2, percentage 100

NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME

  Y    Y         Y    Y    Y    Y    Y    Y              Y    SYS_NC00015$

                           Y    Y    Y    Y    Y         Y    OWNER

       Y         Y    Y    Y    Y    Y    Y    Y         Y    OBJECT_NAME

                           Y    Y                        Y    SUBOBJECT_NAME

       Y         Y    Y    Y    Y    Y    Y              Y    OBJECT_ID

  Y    Y    Y    Y    Y    Y    Y                        Y    DATA_OBJECT_ID

                           Y    Y                        Y    OBJECT_TYPE

                           Y    Y         Y              Y    CREATED

                           Y    Y         Y              Y    LAST_DDL_TIME

                           Y    Y                        Y    TIMESTAMP

                           Y    Y         Y              Y    STATUS

                           Y    Y                        Y    TEMPORARY

                           Y    Y                        Y    GENERATED

                           Y    Y                        Y    SECONDARY

...

     It will trace the gathering flag bits at the begining of each iteration. Oracle will use those bits to decide which statistics data should be analyzed. By comparing the bits of adjacent iterations, we will acknowledge which which statistics data will be re-analyzed.

    Such as the sample size (SSIZE), sample number of not null values (SNNV) and calculated number of distinct values (NDV) are the data to be checked to decide will be percent be adjusted. Below are the pseudocode.

SQL代码

get initial sampling PCT, set as CURR_PCT

loop

{

  Get the basic statistics data

  set NEW_PCT = CURR_PCT * 10

  for all columns required histogram

  {

    if SNNV of the column < 441 then

    {

      Need re-analyzed

      NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)

    }

    else

    {

      gather the column histogram with NEW_PCT

      if it's a Height Balanced Histogram and the NDV can not be scaled

        Need re-analyzed

    }

  }

  If sample size of table < 441 then

  {

    Need re-analyzed

    NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)

  }

  evaluate all columns' basic statistics data One By One

  {

    if column required average size or max/min size then

      if the original PCT is auto and column's SNNV < 919 then

      {

        Need re-analyzed

        NEW_PCT = GREATEST(PCT*441*1.25/GREATEST(SNNV, 1), NEW_PCT)

      }



    if column required NDV then

      If NDV can not be scaled

        Need re-analyzed

  }



  exit if is not auto sample

  if Need re-analyzed then

    if current sample size >= 441 and NEW_PCT > 25 and NEW_PCT < 50 and CURR_PCT <=5 then

      NEW_PCT = 25

    PREV_PCT = CURR_PCT

    CURR_PCT = NEW_PCT

  else

    exit

}

    We can also find the statistics data must be re-analyzed from the trace.

SQL代码

...

Scaling NDV of OBJECT_NAME

Need larger sample: sndv = 5128, snnvdv = 5401, nnv = 46726, ndv = 24986

...

    There is the sample size in both of the dictionary table of table (tab$) and column (col$).

SQL代码

HELLODBA.COM>select o.name, t.samplesize

  2    from tab$ t, obj$ o, user$ u

  3   where t.obj# = o.obj#

  4     and o.owner# = u.user#

  5     and u.name = 'DEMO'

  6     and o.name = 'T_TEST1';



NAME                           SAMPLESIZE

------------------------------ ----------

T_TEST1                             47582



HELLODBA.COM>select u.name, o.name, c.name, h.sample_size

  2    from hist_head$ h, sys.col$ c, obj$ o, user$ u

  3   where h.obj# = c.obj#

  4     and h.intcol# = c.intcol#

  5     and h.obj# = o.obj#

  6     and o.owner# = u.user#

  7     and u.name = 'DEMO'

  8     and o.name = 'T_TEST1';



NAME                           NAME                           NAME                           SAMPLE_SIZE

------------------------------ ------------------------------ ------------------------------ -----------

DEMO                           T_TEST1                        SYS_NC00015$                         47582

DEMO                           T_TEST1                        OWNER                                 5401

DEMO                           T_TEST1                        OBJECT_NAME                          47582

DEMO                           T_TEST1                        SUBOBJECT_NAME                        5401

DEMO                           T_TEST1                        OBJECT_ID                            47582

DEMO                           T_TEST1                        DATA_OBJECT_ID                        5618

DEMO                           T_TEST1                        OBJECT_TYPE                           5401

DEMO                           T_TEST1                        CREATED                               5401

DEMO                           T_TEST1                        LAST_DDL_TIME                         5401

DEMO                           T_TEST1                        TIMESTAMP                             5401

DEMO                           T_TEST1                        STATUS                                5401

DEMO                           T_TEST1                        TEMPORARY                             5401

DEMO                           T_TEST1                        GENERATED                             5401

DEMO                           T_TEST1                        SECONDARY                             5401



14 rows selected.

    The sample size of the table is the sample size in the last iteration, while the sample size of columns are the size of the analyzed statistics data on the column, including the histogram.

--- Fuyuncat ---

Previous：CBO Statistics data gathering and internal arithmetic analysis -- Calculating NDV and Density

Next：NULL IS NOT NULL in execution plan

Return To this Category