In this blog post we are going to explore important properties of hybrid histogram (HH). The concept of bucket and average bucket sizes will be explained. These will help us to understand the creation and nature of HH. We will test same table (with same data) in Oracle Database 12.2.0.1 and 18.3.0.0. In addition, we will review how the defect has been solved in the construction of HH. We will clear how the fix **“25994960: CARDINALITY MISESTIMATE FROM HYBRID HISTOGRAM”** improves the efficiency of HH. It actually, prevents losing the most popular value(s) from the histogram, we will see how it has been done. The Oracle Database 18c contains the fix but 12.2 does not. Therefore, we are performing the following test case in 12.2 and 18c

Let`s see the case:

CREATE TABLE t1 AS SELECT CASE WHEN level <=600 then TRUNC(DBMS_RANDOM.value(1,30)) when level >=960 and level <=997 then 997 ELSE level END AS col FROM dual CONNECT BY level <=1000;

The table data

SELECT col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt FROM t1 GROUP BY col HAVING COUNT ( * ) > 1 ORDER BY 1; COL COUNT(*) CNT ---------- ---------- ---------- 1 23 638 2 17 638 3 24 638 4 24 638 5 24 638 6 18 638 7 16 638 8 24 638 9 14 638 10 37 638 11 14 638 12 13 638 13 16 638 14 24 638 15 21 638 16 15 638 17 20 638 18 19 638 19 25 638 20 29 638 21 25 638 22 21 638 23 19 638 24 23 638 25 15 638 26 23 638 27 23 638 28 19 638 29 15 638 997 38 638 30 rows selected. SQL> SELECT MIN (col), MAX (col), cnt FROM ( SELECT col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt FROM t1 GROUP BY col HAVING COUNT ( * ) = 1 ORDER BY 1) GROUP BY cnt; 2 3 4 5 6 7 MIN(COL) MAX(COL) CNT ---------- ---------- ---------- 601 1000 362 SQL>

As you see the table contains 638 rows (number of distinct values for these rows is 30) with frequencies greater than one and 362 rows (number of distinct values for these rows is 362) with one frequency. So there is 392 distinct value in the table column. We are using same table data in both 12.2 and 18.3 Oracle Databases to generate HH.

begin dbms_stats.gather_table_stats( ownname => null, tabname => 't1', method_opt => 'for columns col size 80' ); end; /

The histogram data from Oracle DB 12.2

SQL> select 2 endpoint_value value, endpoint_number, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size, endpoint_repeat_count from user_tab_histograms where table_name = 'T1' and column_name = 'COL' order by endpoint_value ; VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT ---------- --------------- ----------- --------------------- 1 23 23 23 2 40 17 17 3 64 24 24 4 88 24 24 5 112 24 24 6 130 18 18 7 146 16 16 8 170 24 24 9 184 14 14 10 221 37 37 11 235 14 14 12 248 13 13 13 264 16 16 14 288 24 24 15 309 21 21 16 324 15 15 17 344 20 20 18 363 19 19 19 388 25 25 20 417 29 29 21 442 25 25 22 463 21 21 23 482 19 19 24 505 23 23 25 520 15 15 26 543 23 23 27 566 23 23 28 585 19 19 29 600 15 15 607 607 7 1 614 614 7 1 621 621 7 1 628 628 7 1 635 635 7 1 642 642 7 1 649 649 7 1 655 655 6 1 662 662 7 1 669 669 7 1 676 676 7 1 683 683 7 1 690 690 7 1 697 697 7 1 704 704 7 1 711 711 7 1 718 718 7 1 725 725 7 1 732 732 7 1 739 739 7 1 745 745 6 1 752 752 7 1 759 759 7 1 766 766 7 1 773 773 7 1 780 780 7 1 787 787 7 1 794 794 7 1 801 801 7 1 808 808 7 1 815 815 7 1 822 822 7 1 828 828 6 1 835 835 7 1 842 842 7 1 849 849 7 1 856 856 7 1 863 863 7 1 870 870 7 1 877 877 7 1 884 884 7 1 891 891 7 1 898 898 7 1 905 905 7 1 911 911 6 1 918 918 7 1 925 925 7 1 932 932 7 1 939 939 7 1 946 946 7 1 1000 1000 54 1 80 rows selected. SQL> SELECT num_distinct, num_buckets, density,histogram FROM user_tab_col_statistics WHERE table_name = 'T1' AND column_name = 'COL' 2 3 4 ; NUM_DISTINCT NUM_BUCKETS DENSITY HISTOGRAM ------------ ----------- ---------- --------------- 392 80 .001969 HYBRID

Now the histogram data from Oracle DB 18.3

SQL> select 2 endpoint_value value, endpoint_number, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size, endpoint_repeat_count from user_tab_histograms where table_name = 'T1' and column_name = 'COL' order by endpoint_value 3 4 5 6 7 8 9 10 11 12 13 ; VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT ---------- --------------- ----------- --------------------- 1 23 23 23 2 40 17 17 3 64 24 24 4 88 24 24 5 112 24 24 6 130 18 18 7 146 16 16 8 170 24 24 9 184 14 14 10 221 37 37 11 235 14 14 12 248 13 13 13 264 16 16 14 288 24 24 15 309 21 21 16 324 15 15 17 344 20 20 18 363 19 19 19 388 25 25 20 417 29 29 21 442 25 25 22 463 21 21 23 482 19 19 24 505 23 23 25 520 15 15 26 543 23 23 27 566 23 23 28 585 19 19 29 600 15 15 608 608 8 1 615 615 7 1 622 622 7 1 629 629 7 1 637 637 8 1 644 644 7 1 651 651 7 1 658 658 7 1 666 666 8 1 673 673 7 1 680 680 7 1 687 687 7 1 695 695 8 1 702 702 7 1 709 709 7 1 716 716 7 1 724 724 8 1 731 731 7 1 738 738 7 1 745 745 7 1 752 752 7 1 760 760 8 1 767 767 7 1 774 774 7 1 781 781 7 1 789 789 8 1 796 796 7 1 803 803 7 1 810 810 7 1 818 818 8 1 825 825 7 1 832 832 7 1 839 839 7 1 847 847 8 1 854 854 7 1 861 861 7 1 868 868 7 1 876 876 8 1 883 883 7 1 890 890 7 1 897 897 7 1 905 905 8 1 912 912 7 1 919 919 7 1 926 926 7 1 933 933 7 1 941 941 8 1 948 948 7 1 955 955 7 1 997 997 42 38 1000 1000 3 1 80 rows selected. SQL> SELECT num_distinct, num_buckets, density,histogram FROM user_tab_col_statistics WHERE table_name = 'T1' AND column_name = 'COL' 2 3 4 ; NUM_DISTINCT NUM_BUCKETS DENSITY HISTOGRAM ------------ ----------- ---------- --------------- 392 80 .0019 HYBRID

As you see the 12c HH does not include the most frequent value (**997**) but 18c HH does. Although the table data are same the histogram data are different in the both databases and that is why the densities also are different. To understand the problem we need to trace statistics gathering process in both database.

Trace files in both database shows that the same SQL statement and same data were used to create HH

**12c trace:**

DBMS_STATS: Building Histogram for COL DBMS_STATS: bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80 DBMS_STATS: Trying hybrid histogram DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv, (sum(pop) over()) popcnt, (sum(pop*freq) over()) popfreq, substrb(dump(max(val) over(),16,0,64),1,240) maxval, substrb(dump(min(val) over(),16,0,64),1,240) minval from (select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case when freq > ((sum(freq) over())/80) then 1 else 0 end) pop from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */ "COL" val, count("COL") freq from "SYSTEM"."T1" t where "COL" is not null group by "COL")) order by val DBMS_STATS: > cdn 1000, popFreq 638, popCnt 30, bktSize 6.91836734693877551020408163265306122449, bktSzFrc .91836734693877551020408163265306122449 DBMS_STATS: Evaluating hybrid histogram: cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv TRUE, ndv 392, selNdv 15, selFreq 343, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE DBMS_STATS: Histogram gathering flags: 7 DBMS_STATS: Accepting histogram

**18c trace:**

DBMS_STATS: Building Histogram for COL DBMS_STATS: bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80 DBMS_STATS: Trying hybrid histogram DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv, (sum(pop) over()) popcnt, (sum(pop*freq) over()) popfreq, substrb(dump(max(val) over(),16,0,64),1,240) maxval, substrb(dump(min(val) over(),16,0,64),1,240) minval from (select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case when freq > ((sum(freq) over())/80) then 1 else 0 end) pop from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */ "COL" val, count("COL") freq from "CHINAR"."T1" t where "COL" is not null group by "COL")) order by val (SQL_HH) DBMS_STATS: > cdn 1000, popFreq 638, popCnt 30, bktSize 7.24, bktSzFrc .24 DBMS_STATS: Evaluating hybrid histogram: cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv TRUE, ndv 392, selNdv 16, selFreq 381, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE DBMS_STATS: Histogram gathering flags: 7 DBMS_STATS: Accepting histogram

It is clearly seen that both trace files reports that there cardinality (**cdn**), popular frequency (**popFreq**), popular value count (**popCnt**) and average bucket size are same except the **bucket size** (original). The bucket size was calculated **6.9183** in 12c and **7.24** in 18c. And this is the answer that is why 12c histogram does not contain the most frequent value(s).

Now, what is the bucket size and how is it calculated?

In order to create HH oracle firstly calculates the bucket size and then uses it against the data returned by the (**SQL_HH**) SQL statement. Let’s quick review that data. In both Oracle database version the same data is retrieved:

SELECT val, freq, cdn, ndv, (SUM (pop) OVER ()) popcnt, (SUM (pop * freq) OVER ()) popfreq FROM (SELECT val, freq, (SUM (freq) OVER ()) cdn, (COUNT ( * ) OVER ()) ndv, (CASE WHEN freq > ( (SUM (freq) OVER ()) / 80) THEN 1 ELSE 0 END) pop FROM ( SELECT"COL" val, COUNT ("COL") freq FROM "CHINAR"."T1" t WHERE "COL" IS NOT NULL GROUP BY "COL")) ORDER BY val; VAL FREQ CDN NDV POPCNT POPFREQ ---------- ---------- ---------- ---------- ---------- ---------- 1 23 1000 392 30 638 2 17 1000 392 30 638 3 24 1000 392 30 638 4 24 1000 392 30 638 5 24 1000 392 30 638 6 18 1000 392 30 638 7 16 1000 392 30 638 8 24 1000 392 30 638 9 14 1000 392 30 638 10 37 1000 392 30 638 11 14 1000 392 30 638 12 13 1000 392 30 638 13 16 1000 392 30 638 14 24 1000 392 30 638 15 21 1000 392 30 638 16 15 1000 392 30 638 17 20 1000 392 30 638 18 19 1000 392 30 638 19 25 1000 392 30 638 20 29 1000 392 30 638 21 25 1000 392 30 638 22 21 1000 392 30 638 23 19 1000 392 30 638 24 23 1000 392 30 638 25 15 1000 392 30 638 26 23 1000 392 30 638 27 23 1000 392 30 638 28 19 1000 392 30 638 29 15 1000 392 30 638 601 1 1000 392 30 638 602 1 1000 392 30 638 603 1 1000 392 30 638 604 1 1000 392 30 638 605 1 1000 392 30 638 606 1 1000 392 30 638 607 1 1000 392 30 638 608 1 1000 392 30 638 ................................................................. 947 1 1000 392 30 638 948 1 1000 392 30 638 949 1 1000 392 30 638 950 1 1000 392 30 638 951 1 1000 392 30 638 952 1 1000 392 30 638 953 1 1000 392 30 638 954 1 1000 392 30 638 955 1 1000 392 30 638 956 1 1000 392 30 638 957 1 1000 392 30 638 958 1 1000 392 30 638 959 1 1000 392 30 638 997 38 1000 392 30 638 998 1 1000 392 30 638 999 1 1000 392 30 638 1000 1 1000 392 30 638 392 rows selected.

The bucket size should be calculated properly in order to include the most frequent values as much as possible. Where does the bucket and its size come from? We firstly will explain it.

There we have **CDN **(number of rows in the table or histogram cardinality when using sample data), number of popular values (**PopCnt**) and their frequencies – popular frequencies (**PopFreq**). So, it means There (**NDV-PopCnt**) number of distinct values of unpopular values and (**CDN-PopFreq**) unpopular rows (or their total frequencies). Also, we need to create **N **(in our case 80) number of buckets and it should contain all top-frequent values. Each popular value should be located in a separate bucket, therefore, we have to reserve **PopCnt **number of buckets for popular values. It means there are **(N-PopCnt)** number of buckets for unpopular rows. We actually distribute the table (column) rows into two types of buckets: popular and unpopular buckets. So, each unpopular bucket will contain:

Unpopular_rows/unpopular_buckets = (CDN-PopFreq) / (N-PopCnt) (F.1)

This formula explains the concept of the bucket size. So, the bucket size is calculated based on the formula (F.1).

But **the minimum size of the popular bucket** will be:

PopbktSize = min (〖PopValue〗_i) , i = [1..PopCnt]

Now, let see the above two histograms and try to understand how bucket size have been identified of them.

BktSize_18c = (CDN-PopFreq) / (N-PopCnt)=(100-638)/(80-30)=7.24

As you see it is the same with the number reported in the 18c trace file. But why the 12c bucket is different?

The answer is that the formula (F.1) identifies the bucket size for HH properly, however there is an important thing need to be considered. The **(SQL_HH)** returns us a dataset to create HH, but the firstly, the minimum value of the column has to be inserted to the first bucket (as a separate bucket) of HH regardless its popularity and size. So, the minimum value is the first bucket of our HH and its size equal to the frequency of that minimum value. It means we have to consider it in our formula (F.1). In addition the last bucket of the HH is the maximum column value and its size is determined based on size of previous buckets and number of rows in the table.

So, if the minimum column value is popular then we actually, already included it in the formula (F.1), otherwise if it is not a popular value then we have to exclude it in the formula (F.1). Thus, the final bucket size will be:

If the minimum value is popular then BktSize = ((CDN-PopFreq))/((N-PopCnt) ) else BktSize = ((CDN-PopFreq- freq of the minimum value))/((N-PopCnt-1) ) (F.2)

In our case the minimum value is **1** and its frequency is **23**, as you know average bucket size is **13 **that is why it is a popular value and therefore bucket size should be **7.24** based on the formula **(F.2)**.

But 12c does not consider that fact, it locates the minimum value in the first bucket of the HH and then reduces the bucket size by subtracting its frequency, although it is already included. So:

BktSize_12c = (CDN-PopFreq-freq of the minimum value) / (N-PopCnt-1)=(1000-638-23)/(80-30-1)= 6.91836…

The undersized bucket prevents to be reached (included) the most popular values, So that, smaller (un-proper) bucket size causes to create 80 (**N**) number of buckets before reaching the most popular value. If you see the histogram data in 12c, the **79 **number of buckets has been created when database reached the value **946**. It stopped the process due to reaching the maximum bucket count.

It should be noted that the formula **(F.2)** is used to identify the initial bucket size, but unpopular bucket size could be even less than that bucket size defined by the (F.2). You can see that if you look at the blog post by Jonathan Lewis. In this example, the bucket size calculated as same in the both version of Oracle Database:

DBMS_STATS: > cdn 800, popFreq 434, popCnt 5, bktSize 52.14285714285714285714285714285714285714, bktSzFrc .14285714285714285714285714285714285714 DBMS_STATS: Evaluating hybrid histogram: cht.count 13, mnb 13, ssize 800, min_ssize 2500, appr_ndv TRUE, ndv 22, selNdv 1, selFreq 109, pct 100, avg_bktsize 62, csr.hreq TRUE, normalize TRUE DBMS_STATS: Histogram gathering flags: 527

The bucket size 52.14 and histograms contain buckets that size less than 52.14, this is because there are very less distinct values after creating 4 buckets (value=19 in 18c histogram). So 13-4=9 buckets had to be created but, we had 22-13 (until 19 there are 13 distinct value) = 9 distinct value after creating the 4 buckets, thus all 9 distinct values should be included in the HH as separate bucket. The another differences of the two histogram is that in 18c Oracle Database preserves last bucket for the maximum value but 12c does not. It was another drawback of the 12c that initial preserving may help database to include most popular values, otherwise merging/removing last bucket can cause the losing of the popular value.

**Note:** The formula (F.1) also helps us to understand the concept of selectivity for the HH. Each unpopular buckets could have different size, for each unpopular distinct value we will have:

Rows per unpopular distinct value=unpopular_rows/unpopular_ndv=(CDN-PopFreq) / (NDV-PopCnt)

**Summary**

The bucket size for hybrid histogram is calculated based on **(F.2)**. It is initial bucket size based on this and the dataset (returned by SQL_HH) the database determines hybrid histogram. But the average bucket size is used to identify popularity of the distinct value. The minimum and maximum values are located in separate buckets. The first bucket contains only minimum column value and its size is the frequency of the minimum value.

Hi, this is a very interesting article. I was trying to duplicate the test case described at the beginning, specifically:

CREATE TABLE t1 AS

SELECT

CASE

WHEN level =960 and level <=997 then 997

ELSE level

END AS col

FROM dual

CONNECT BY level ” character added

CREATE TABLE t1 AS

SELECT

CASE

WHEN level >=960 and level <=997 then 997 /* ie "level BETWEEN 960 and 997 " */

ELSE level

END AS col

FROM dual

CONNECT BY level <=1000;

Even so, the table as created above, will have 1000 rows, each of them having a distinct and unique value, with the exception of 997 which would be duplicated 38 times.

There is no way we would get the skewed data as described further below in your article, by using that simple "connect by" statement.

Please post the SQL statement used to create the skewed data which was used in your test case. Thank you and best regards, I am looking forward to your next articles and posts.

Comment by Tibi — January 22, 2019 @ 2:24 pm |

Hello, it appears that wordpress is removing certain characters from comments, especially when encountering the “greater than” symbol. My previous post was somehow partially truncated because of that character. Cheers.

Comment by Tibi — January 22, 2019 @ 2:31 pm |

Thanks Tibi for the comment. Yes, you are right, there was a wordpress issue when I posted the SQL, I have just corrected it, so the data that I tried to generate by produced the following SQL:

Comment by Chinar Aliyev — January 23, 2019 @ 6:20 am |

Thank you for the additional information.

Comment by Tibi — January 23, 2019 @ 10:01 am |