Chinar Aliyev`s blog

November 6, 2018

Understanding Hybrid Histogram

Filed under: CBO,Hybrid Histogram,statistics — Chinar Aliyev @ 11:19 am

In this blog post we are going to explore important properties of hybrid histogram (HH). The concept of bucket and average bucket sizes will be explained. These will help us to understand the creation and nature of HH. We will test same table (with same data) in Oracle Database 12.2.0.1 and 18.3.0.0. In addition, we will review how the defect has been solved in the construction of HH. We will clear how the fix “25994960: CARDINALITY MISESTIMATE FROM HYBRID HISTOGRAM” improves the efficiency of HH. It actually, prevents losing the most popular value(s) from the histogram, we will see how it has been done. The Oracle Database 18c contains the fix but 12.2 does not. Therefore, we are performing the following test case in 12.2 and 18c
Let`s see the case:


CREATE TABLE t1 AS
SELECT
       CASE
         WHEN level =960 and level <=997 then 997
         ELSE level
       END AS col
FROM   dual 
CONNECT BY level <=1000; --> comment to avoid WordPress format issue 

The table data

 SELECT   col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt
    FROM   t1
GROUP BY   col
  HAVING   COUNT ( * ) > 1
ORDER BY   1;

       COL   COUNT(*)        CNT
---------- ---------- ----------
         1         23        638
         2         17        638
         3         24        638
         4         24        638
         5         24        638
         6         18        638
         7         16        638
         8         24        638
         9         14        638
        10         37        638
        11         14        638
        12         13        638
        13         16        638
        14         24        638
        15         21        638
        16         15        638
        17         20        638
        18         19        638
        19         25        638
        20         29        638
        21         25        638
        22         21        638
        23         19        638
        24         23        638
        25         15        638
        26         23        638
        27         23        638
        28         19        638
        29         15        638
       997         38        638

30 rows selected.

SQL>   SELECT   MIN (col), MAX (col), cnt
    FROM   (  SELECT   col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt
                FROM   t1
            GROUP BY   col
              HAVING   COUNT ( * ) = 1
            ORDER BY   1)
GROUP BY   cnt;

  2    3    4    5    6    7
  MIN(COL)   MAX(COL)        CNT
---------- ---------- ----------
       601       1000        362

SQL> 

As you see the table contains 638 rows (number of distinct values for these rows is 30) with frequencies greater than one and 362 rows (number of distinct values for these rows is 362) with one frequency. So there is 392 distinct value in the table column. We are using same table data in both 12.2 and 18.3 Oracle Databases to generate HH.

begin
        dbms_stats.gather_table_stats(
                ownname          => null,
                tabname          => 't1',
                method_opt       => 'for columns col size 80'
        );
end;
/

The histogram data from Oracle DB 12.2

SQL> select
  2          endpoint_value                                                            value,
        endpoint_number,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'COL'
order by
        endpoint_value
; 
     VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT
---------- --------------- ----------- ---------------------
         1              23          23                    23
         2              40          17                    17
         3              64          24                    24
         4              88          24                    24
         5             112          24                    24
         6             130          18                    18
         7             146          16                    16
         8             170          24                    24
         9             184          14                    14
        10             221          37                    37
        11             235          14                    14
        12             248          13                    13
        13             264          16                    16
        14             288          24                    24
        15             309          21                    21
        16             324          15                    15
        17             344          20                    20
        18             363          19                    19
        19             388          25                    25
        20             417          29                    29
        21             442          25                    25
        22             463          21                    21
        23             482          19                    19
        24             505          23                    23
        25             520          15                    15
        26             543          23                    23
        27             566          23                    23
        28             585          19                    19
        29             600          15                    15
       607             607           7                     1
       614             614           7                     1
       621             621           7                     1
       628             628           7                     1
       635             635           7                     1
       642             642           7                     1
       649             649           7                     1
       655             655           6                     1
       662             662           7                     1
       669             669           7                     1
       676             676           7                     1
       683             683           7                     1
       690             690           7                     1
       697             697           7                     1
       704             704           7                     1
       711             711           7                     1
       718             718           7                     1
       725             725           7                     1
       732             732           7                     1
       739             739           7                     1
       745             745           6                     1
       752             752           7                     1
       759             759           7                     1
       766             766           7                     1
       773             773           7                     1
       780             780           7                     1
       787             787           7                     1
       794             794           7                     1
       801             801           7                     1
       808             808           7                     1
       815             815           7                     1
       822             822           7                     1
       828             828           6                     1
       835             835           7                     1
       842             842           7                     1
       849             849           7                     1
       856             856           7                     1
       863             863           7                     1
       870             870           7                     1
       877             877           7                     1
       884             884           7                     1
       891             891           7                     1
       898             898           7                     1
       905             905           7                     1
       911             911           6                     1
       918             918           7                     1
       925             925           7                     1
       932             932           7                     1
       939             939           7                     1
       946             946           7                     1
      1000            1000          54                     1

80 rows selected.

SQL> SELECT   num_distinct, num_buckets, density,histogram
  FROM   user_tab_col_statistics
 WHERE   table_name = 'T1' AND column_name = 'COL'  2    3
  4  ;

NUM_DISTINCT NUM_BUCKETS    DENSITY HISTOGRAM
------------ ----------- ---------- ---------------
         392          80    .001969 HYBRID

Now the histogram data from Oracle DB 18.3

SQL> select
  2          endpoint_value                                                            value,
        endpoint_number,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'COL'
order by
        endpoint_value  3    4    5    6    7    8    9   10   11   12
 13  ;

     VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT
---------- --------------- ----------- ---------------------
         1              23          23                    23
         2              40          17                    17
         3              64          24                    24
         4              88          24                    24
         5             112          24                    24
         6             130          18                    18
         7             146          16                    16
         8             170          24                    24
         9             184          14                    14
        10             221          37                    37
        11             235          14                    14
        12             248          13                    13
        13             264          16                    16
        14             288          24                    24
        15             309          21                    21
        16             324          15                    15
        17             344          20                    20
        18             363          19                    19
        19             388          25                    25
        20             417          29                    29
        21             442          25                    25
        22             463          21                    21
        23             482          19                    19
        24             505          23                    23
        25             520          15                    15
        26             543          23                    23
        27             566          23                    23
        28             585          19                    19
        29             600          15                    15
       608             608           8                     1
       615             615           7                     1
       622             622           7                     1
       629             629           7                     1
       637             637           8                     1
       644             644           7                     1
       651             651           7                     1
       658             658           7                     1
       666             666           8                     1
       673             673           7                     1
       680             680           7                     1
       687             687           7                     1
       695             695           8                     1
       702             702           7                     1
       709             709           7                     1
       716             716           7                     1
       724             724           8                     1
       731             731           7                     1
       738             738           7                     1
       745             745           7                     1
       752             752           7                     1
       760             760           8                     1
       767             767           7                     1
       774             774           7                     1
       781             781           7                     1
       789             789           8                     1
       796             796           7                     1
       803             803           7                     1
       810             810           7                     1
       818             818           8                     1
       825             825           7                     1
       832             832           7                     1
       839             839           7                     1
       847             847           8                     1
       854             854           7                     1
       861             861           7                     1
       868             868           7                     1
       876             876           8                     1
       883             883           7                     1
       890             890           7                     1
       897             897           7                     1
       905             905           8                     1
       912             912           7                     1
       919             919           7                     1
       926             926           7                     1
       933             933           7                     1
       941             941           8                     1
       948             948           7                     1
       955             955           7                     1
       997             997          42                    38
      1000            1000           3                     1

80 rows selected.

SQL> SELECT   num_distinct, num_buckets, density,histogram
  FROM   user_tab_col_statistics
 WHERE   table_name = 'T1' AND column_name = 'COL'
  2    3    4  ;

NUM_DISTINCT NUM_BUCKETS    DENSITY HISTOGRAM
------------ ----------- ---------- ---------------
         392          80      .0019 HYBRID

As you see the 12c HH does not include the most frequent value (997) but 18c HH does. Although the table data are same the histogram data are different in the both databases and that is why the densities also are different. To understand the problem we need to trace statistics gathering process in both database.
Trace files in both database shows that the same SQL statement and same data were used to create HH

12c trace:

DBMS_STATS: Building Histogram for COL
DBMS_STATS:  bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80
DBMS_STATS:  Trying hybrid histogram 
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over()) popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  from (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case when freq > ((sum(freq) over())/80)  then 1  else 0 end) pop from  (select /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val, count("COL") freq  from "SYSTEM"."T1" t  where "COL" is not null  group by "COL")) order by val
DBMS_STATS:  > cdn 1000, popFreq 638, popCnt 30, bktSize 6.91836734693877551020408163265306122449, bktSzFrc .91836734693877551020408163265306122449
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv  TRUE, ndv 392, selNdv 15, selFreq 343, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram 

18c trace:

DBMS_STATS: Building Histogram for COL
DBMS_STATS:  bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80
DBMS_STATS:  Trying hybrid histogram 
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over()) popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  from (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case when freq > ((sum(freq) over())/80)  then 1  else 0 end) pop from  (select /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val, count("COL") freq  from "CHINAR"."T1" t  where "COL" is not null  group by "COL")) order by val

(SQL_HH)

DBMS_STATS:  > cdn 1000, popFreq 638, popCnt 30, bktSize 7.24, bktSzFrc .24
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv  TRUE, ndv 392, selNdv 16, selFreq 381, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram

It is clearly seen that both trace files reports that there cardinality (cdn), popular frequency (popFreq), popular value count (popCnt) and average bucket size are same except the bucket size (original). The bucket size was calculated 6.9183 in 12c and 7.24 in 18c. And this is the answer that is why 12c histogram does not contain the most frequent value(s).

Now, what is the bucket size and how is it calculated?

In order to create HH oracle firstly calculates the bucket size and then uses it against the data returned by the (SQL_HH) SQL statement. Let’s quick review that data. In both Oracle database version the same data is retrieved:

  SELECT   val,
           freq,
           cdn,
           ndv,
           (SUM (pop) OVER ()) popcnt,
           (SUM (pop * freq) OVER ()) popfreq
    FROM   (SELECT   val,
                     freq,
                     (SUM (freq) OVER ()) cdn,
                     (COUNT ( * ) OVER ()) ndv,
                     (CASE
                          WHEN freq > ( (SUM (freq) OVER ()) / 80) THEN 1
                          ELSE 0
                      END)
                         pop
              FROM   (  SELECT"COL"
                                     val,
                                 COUNT ("COL") freq
                          FROM   "CHINAR"."T1" t
                         WHERE   "COL" IS NOT NULL
                      GROUP BY   "COL"))
ORDER BY   val;

       VAL       FREQ        CDN        NDV     POPCNT    POPFREQ
---------- ---------- ---------- ---------- ---------- ----------
         1         23       1000        392         30        638
         2         17       1000        392         30        638
         3         24       1000        392         30        638
         4         24       1000        392         30        638
         5         24       1000        392         30        638
         6         18       1000        392         30        638
         7         16       1000        392         30        638
         8         24       1000        392         30        638
         9         14       1000        392         30        638
        10         37       1000        392         30        638
        11         14       1000        392         30        638
        12         13       1000        392         30        638
        13         16       1000        392         30        638
        14         24       1000        392         30        638
        15         21       1000        392         30        638
        16         15       1000        392         30        638
        17         20       1000        392         30        638
        18         19       1000        392         30        638
        19         25       1000        392         30        638
        20         29       1000        392         30        638
        21         25       1000        392         30        638
        22         21       1000        392         30        638
        23         19       1000        392         30        638
        24         23       1000        392         30        638
        25         15       1000        392         30        638
        26         23       1000        392         30        638
        27         23       1000        392         30        638
        28         19       1000        392         30        638
        29         15       1000        392         30        638
       601          1       1000        392         30        638
       602          1       1000        392         30        638
       603          1       1000        392         30        638
       604          1       1000        392         30        638
       605          1       1000        392         30        638
       606          1       1000        392         30        638
       607          1       1000        392         30        638
       608          1       1000        392         30        638
.................................................................
       947          1       1000        392         30        638
       948          1       1000        392         30        638
       949          1       1000        392         30        638
       950          1       1000        392         30        638
       951          1       1000        392         30        638
       952          1       1000        392         30        638
       953          1       1000        392         30        638
       954          1       1000        392         30        638
       955          1       1000        392         30        638
       956          1       1000        392         30        638
       957          1       1000        392         30        638
       958          1       1000        392         30        638
       959          1       1000        392         30        638
       997         38       1000        392         30        638
       998          1       1000        392         30        638
       999          1       1000        392         30        638
      1000          1       1000        392         30        638

392 rows selected.

The bucket size should be calculated properly in order to include the most frequent values as much as possible. Where does the bucket and its size come from? We firstly will explain it.

There we have CDN (number of rows in the table or histogram cardinality when using sample data), number of popular values (PopCnt) and their frequencies – popular frequencies (PopFreq). So, it means There (NDV-PopCnt) number of distinct values of unpopular values and (CDN-PopFreq) unpopular rows (or their total frequencies). Also, we need to create N (in our case 80) number of buckets and it should contain all top-frequent values. Each popular value should be located in a separate bucket, therefore, we have to reserve PopCnt number of buckets for popular values. It means there are (N-PopCnt) number of buckets for unpopular rows. We actually distribute the table (column) rows into two types of buckets: popular and unpopular buckets. So, each unpopular bucket will contain:

 Unpopular_rows/unpopular_buckets = (CDN-PopFreq) / (N-PopCnt)     (F.1)

This formula explains the concept of the bucket size. So, the bucket size is calculated based on the formula (F.1).
But the minimum size of the popular bucket will be:

  PopbktSize = min (〖PopValue〗_i) , i = [1..PopCnt]   

Now, let see the above two histograms and try to understand how bucket size have been identified of them.

 BktSize_18c =  (CDN-PopFreq) / (N-PopCnt)=(100-638)/(80-30)=7.24 

As you see it is the same with the number reported in the 18c trace file. But why the 12c bucket is different?
The answer is that the formula (F.1) identifies the bucket size for HH properly, however there is an important thing need to be considered. The (SQL_HH) returns us a dataset to create HH, but the firstly, the minimum value of the column has to be inserted to the first bucket (as a separate bucket) of HH regardless its popularity and size. So, the minimum value is the first bucket of our HH and its size equal to the frequency of that minimum value. It means we have to consider it in our formula (F.1). In addition the last bucket of the HH is the maximum column value and its size is determined based on size of previous buckets and number of rows in the table.
So, if the minimum column value is popular then we actually, already included it in the formula (F.1), otherwise if it is not a popular value then we have to exclude it in the formula (F.1). Thus, the final bucket size will be:

  If the minimum value is popular then BktSize = ((CDN-PopFreq))/((N-PopCnt) ) else
      BktSize = ((CDN-PopFreq- freq of the minimum value))/((N-PopCnt-1) )

(F.2)

In our case the minimum value is 1 and its frequency is 23, as you know average bucket size is 13 that is why it is a popular value and therefore bucket size should be 7.24 based on the formula (F.2).

But 12c does not consider that fact, it locates the minimum value in the first bucket of the HH and then reduces the bucket size by subtracting its frequency, although it is already included. So:

BktSize_12c =  (CDN-PopFreq-freq of the minimum value) / (N-PopCnt-1)=(1000-638-23)/(80-30-1)= 6.91836…

The undersized bucket prevents to be reached (included) the most popular values, So that, smaller (un-proper) bucket size causes to create 80 (N) number of buckets before reaching the most popular value. If you see the histogram data in 12c, the 79 number of buckets has been created when database reached the value 946. It stopped the process due to reaching the maximum bucket count.

It should be noted that the formula (F.2) is used to identify the initial bucket size, but unpopular bucket size could be even less than that bucket size defined by the (F.2). You can see that if you look at the blog post by Jonathan Lewis. In this example, the bucket size calculated as same in the both version of Oracle Database:

DBMS_STATS:  > cdn 800, popFreq 434, popCnt 5, bktSize 52.14285714285714285714285714285714285714, bktSzFrc .14285714285714285714285714285714285714
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 13, mnb 13, ssize 800, min_ssize 2500, appr_ndv  TRUE, ndv 22, selNdv 1, selFreq 109, pct 100, avg_bktsize 62, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 527

The bucket size 52.14 and histograms contain buckets that size less than 52.14, this is because there are very less distinct values after creating 4 buckets (value=19 in 18c histogram). So 13-4=9 buckets had to be created but, we had 22-13 (until 19 there are 13 distinct value) = 9 distinct value after creating the 4 buckets, thus all 9 distinct values should be included in the HH as separate bucket. The another differences of the two histogram is that in 18c Oracle Database preserves last bucket for the maximum value but 12c does not. It was another drawback of the 12c that initial preserving may help database to include most popular values, otherwise merging/removing last bucket can cause the losing of the popular value.

Note: The formula (F.1) also helps us to understand the concept of selectivity for the HH. Each unpopular buckets could have different size, for each unpopular distinct value we will have:

   Rows per unpopular distinct value=unpopular_rows/unpopular_ndv=(CDN-PopFreq) / (NDV-PopCnt)

Summary

The bucket size for hybrid histogram is calculated based on (F.2). It is initial bucket size based on this and the dataset (returned by SQL_HH) the database determines hybrid histogram. But the average bucket size is used to identify popularity of the distinct value. The minimum and maximum values are located in separate buckets. The first bucket contains only minimum column value and its size is the frequency of the minimum value.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: