Chinar Aliyev`s blog

December 7, 2021

Disjunctive predicates and column correlation

Filed under: Cardinality Estimation,statistics,theory,Tuning — Chinar Aliyev @ 7:32 am

Jonathan Lewis has written an interesting article talks about the OR predicate and column correlation. It is mentioned that if there are correlation between the columns and disjunctive predicate (OR) then optimizer does not use column group statistics and causes improper cardinality estimation. So, as per the blog post, the created column group statistics could have been used to calculate selectivity for the given predicate.

In this blog post we will review the case , share my thoughts about it and provide a solution for that.

The test case:

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > 
)
select
        mod(rownum,100)                 n1,
        mod(rownum,100)                 n2,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator 
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns (n1, n2) size 1'
        );
end;
/

select
        column_name, num_distinct, data_default
from
        user_tab_cols
where
        table_name = 'T1'
/

COLUMN_NAME                      NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ ----------------------------------------
PADDING                                     1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS            100 SYS_OP_COMBINED_HASH("N1","N2")
N1                                        100
N2                                        100
V1                                      10000

Now the query and its explain plan:

(Q1)
select  *
from    t1
where
        n1 = 50
or      n2 = 50;

----------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   199 | 23482 |    49   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   199 | 23482 |    49   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("N1"=50 OR "N2"=50)

First of all, to calculate selectivity for (n1=50 or n2=50) predicate optimizer did not use column group statistic (NDV). In my opinion, it is an expected behavior and even optimizer could not use it. By definition, when we try to create column group-CG (n1, n2), then it can be used only for equal and conjunctive predicates (AND condition), not for disjunctive one (OR condition). It comes from its default definition, and it is actually designed for AND predicate(s).

So, when CG is created, the below SQL is executed to add a virtual column to the table:

Alter table t1 add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS as (sys_op_combined_hash (n1, n2)) virtual BY USER for  statistics)

Based on values of n1 and n2 columns, the unique hash values are produced, logically it could be thought as grouping of the given columns values like:

SELECT COUNT (*) ndv
FROM (SELECT n1, n2
FROM t1
GROUP BY n1, n2); 

       NDV
----------
       100

Grouping brings meaning of AND predicate. Of course, just having the NDV for the group is not enough therefore, optimizer should know the frequency of each member of that group, so histogram of that virtual column. Means the pair of (sys_op_combined_hash (n1, n2), Freq), which is enough to calculate better selectivity for (n1=c1 and n2=c2) predicate.

Now, lets think about a possible solution for OR predicate: (n1=50 or n2=50). How can we solve it and how optimizer could have solved it by existing precomputed statistics, if we are able to provide such information for the query engine?

It could be solved by extended statistics as below:


BEGIN
    DBMS_STATS.gather_table_stats (
        NULL,
        't1',
        method_opt   => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 1 (case when n1=50 or n2=50 then SYS_OP_COMBINED_HASH("N1","N2") else (-1)  END )');
END;
/

PL/SQL procedure successfully completed.

SELECT   column_name, num_distinct, density,data_default
  FROM   user_tab_cols
 WHERE   table_name = 'T1';


COLUMN_NAME                         NUM_DISTINCT    DENSITY DATA_DEFAULT
----------------------------------- ------------ ---------- ----------------------------------------------------------------------------------------------------
PADDING                                        1          1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS               100        .01 SYS_OP_COMBINED_HASH("N1","N2")
SYS_STUK23LXW5QQR4EUT#YPS1TXFO                 2         .5 CASE  WHEN ("N1"=50 OR "N2"=50) THEN SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) E
N1                                           100        .01
N2                                           100        .01
V1                                         10000      .0001

and explain plan:

EXPLAIN PLAN
    FOR

        SELECT   *
          FROM   t1
         WHERE   n1 = 50 OR n2 = 50;

SELECT   * FROM table (DBMS_XPLAN.display); 

Explained.

SQL>

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   199 | 23482 |    49   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   199 | 23482 |    49   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("N1"=50 OR "N2"=50)

the estimation did not change. To take advantage of the new CG, we should adapt our query (Q1) to the below one:



EXPLAIN PLAN
    FOR

        SELECT   *
          FROM   t1
         WHERE   "SYS_STUK23LXW5QQR4EUT#YPS1TXFO" =    
                     sys_op_combined_hash (50, 50);
(Q2)
SELECT   * FROM table (DBMS_XPLAN.display);

----------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  5000 |   600K|    49   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |  5000 |   600K|    49   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(CASE  WHEN ("N1"=50 OR "N2"=50) THEN
              SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) END =8211477593589776077)

Even worst estimation! Due to number of distinct values of the SYS_STUK23LXW5QQR4EUT#YPS1TXFO column is 2 (so density 0.5) and we got 10000*0.5=5000! What to do?

To fix this, we need to gather a histogram for the SYS_STUK23LXW5QQR4EUT#YPS1TXFO virtual column as:



BEGIN
    DBMS_STATS.gather_table_stats (
        NULL,
        't1',
        method_opt   => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 254 (case when n1=50 or n2=50 then SYS_OP_COMBINED_HASH("N1","N2") else (-1)  END )');
END;
/

PL/SQL procedure successfully completed.

SELECT   column_name,
         num_distinct,
         data_default,
         density,
         histogram
  FROM   user_tab_cols
 WHERE   table_name = 'T1';

COLUMN_NAME                         NUM_DISTINCT DATA_DEFAULT                                                                                            DENSITY HISTOGRAM
----------------------------------- ------------ ---------------------------------------------------------------------------------------------------- ---------- ---------------
PADDING                                        1                                                                                                               1 NONE
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS               100 SYS_OP_COMBINED_HASH("N1","N2")                                                                             .01 NONE
SYS_STUK23LXW5QQR4EUT#YPS1TXFO                 2 CASE  WHEN ("N1"=50 OR "N2"=50) THEN SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) E                         .00005 FREQUENCY
N1                                           100                                                                                                             .01 NONE
N2                                           100                                                                                                             .01 NONE
V1                                         10000                                                                                                           .0001 NONE

And now:

EXPLAIN PLAN
    FOR
        SELECT   *
          FROM   t1
         WHERE   "SYS_STUK23LXW5QQR4EUT#YPS1TXFO" =
                     sys_op_combined_hash (50, 50);
SELECT   * FROM table (DBMS_XPLAN.display);

----------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 12300 |    49   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 12300 |    49   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(CASE  WHEN ("N1"=50 OR "N2"=50) THEN
              SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) END =8211477593589776077)

We got exactly what we wanted, Correct cardinality estimation! But my purpose here is not to change the query, instead give an idea how to deal with it.

So, till now we are able to fix the problem for specific conditions, means for specific column values. Now we want to give a fix for general problem, we don’t want to create a virtual column for each pair of values.

How to estimate the cardinality of (N1=C1 OR N2=C2) predicate, here C1 and C2 are given any values.

To solve this problem, we can start/perform the similar approach what we did above, firstly need to create below CG:


BEGIN
    DBMS_STATS.gather_table_stats (
        NULL,
        't1',
        method_opt   => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 254 (case when n1=n1 or n2=n2 then SYS_OP_COMBINED_HASH("N1","N2") else (-1)  END )');
END; 
/

PL/SQL procedure successfully completed.

SELECT   column_name,
         num_distinct,
         data_default,
         density,
         histogram
  FROM   user_tab_cols
 WHERE   table_name = 'T1';

HISTOGRAM
----------------------------------- ------------ ---------------------------------------------------------------------------------------------------- ---------- ---------------
PADDING                                        1                                                                                                               1 NONE
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS               100 SYS_OP_COMBINED_HASH("N1","N2")                                                                             .01 NONE
SYS_STUK23LXW5QQR4EUT#YPS1TXFO                 2 CASE  WHEN ("N1"=50 OR "N2"=50) THEN SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) E                             .5 NONE
SYS_STUE6IT0GYR7ZIWPS_N3EJJJND               100 CASE  WHEN ("N1"="N1" OR "N2"="N2") THEN SYS_OP_COMBINED_HASH("N1","N2") ELSE (-                         .00005 FREQUENCY
N1                                           100                                                                                                             .01 NONE
N2                                           100                                                                                                             .01 NONE
V1                                         10000                                                                                                           .0001 NONE

So, it will give us correct estimation any (N1=C OR N2=C) predicates, here C is constant value for our case/data. The above virtual column (SYS_STUE6IT0GYR7ZIWPS_N3EJJJND)- CG2 has a FREQUENCY histogram with 100 buckets, then we will get:


EXPLAIN PLAN
    FOR
        SELECT   *
          FROM   t1
         WHERE   "SYS_STUE6IT0GYR7ZIWPS_N3EJJJND" =  --> this is the new virtual column
                     sys_op_combined_hash (10, 10);
SELECT   * FROM table (DBMS_XPLAN.display);

----------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 13000 |    49   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 13000 |    49   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(CASE  WHEN ("N1"="N1" OR "N2"="N2") THEN
              SYS_OP_COMBINED_HASH("N1","N2") ELSE (-1) END =13703473795421408760)

But our solution should cover all (N1=C1 OR N2=C2) predicates, which C1/C2 are any given values (so, C1<>C2 is also possible).

In this case we can use below simple formula:

There i and j indicate position of the column values within the column group. For example, in our case (10,10) or (50,50) are in the same group and that is why their selectivity for OR predicate could be calculated based on the above CG2(SYS_STUE6IT0GYR7ZIWPS_N3EJJJND), because these column values are truly correlated. if you select other pairs like (10,50), in this case these column values are not correlated at all.

Now, how can we identify for any given (c1, c2) values their i and j are same, in other words they are truly correlated? The answer comes from histogram data of the CG2. So, if there any data for that given values in the histogram, then they are correlated, and we could check it as below:


SELECT   *
  FROM   user_tab_histograms
 WHERE       table_name = 'T1'
         AND column_name = 'SYS_STUE6IT0GYR7ZIWPS_N3EJJJND'  --> the new virtual column
         AND endpoint_actual_value = sys_op_combined_hash (c1[i], c2[j]);

Optimizer can use histogram data to identify that.

In summary, we have been able to provide the precomputed statistics using column groups to calculate OR predicate selectivity properly, which is possible to implement the same/similar approach to be done by optimizer automatically.

There are several other blogs posts I have written demonstrate how to use extended statistics to solve join cardinality problem when filter condition causes skew.

Join cardinality Estimation Methods – How histograms (Frequency, Top-Frequency, Hybrid and their combinations) are used to estimate Join Size, but the effect of the filter predicate was not investigated in this paper.

Join Skew Handling Subject to Filter Conditions – Gives an idea how to use extended statistics to solve join size estimation in case of skew. In this case it is required to create an extended statistics per filter conditions/values.

Join Skew Handling Subject to Filter Conditions2 – Gives an idea and demonstrates how to use extended statistics to solve the problem without creating CG for each filter conditions/values. So, this is the generalized solution.

Leave a Comment

May 23, 2020

Serial Hash Join Performance

Filed under: CBO,Join — Chinar Aliyev @ 12:42 pm

A while ago I was involved to look at the Hash Join (HJ) performance and its properties deeply to take some measurements. To demonstrate HJ performance characteristics and understand how things interact with each other, five tests have been given. In this blog post the following questions are trying to be addressed:

• How does Oracle DB partition the input relations (build, probe)? What factors does the number of partitions depend on? How the relations should be partitioned to improve performance of the HJ?
• How does Oracle calculate size of cluster(slot)? How much is it optimal?
• How can we identify the minimum memory requirements for one-pass execution based on the initial given parameters (num_rows, num_blocks, available memory, block size)?
• Why one-pass HJ execution is slower than multi-pass (in some cases)?, or why two one-pass join can give different response time? (we are talking about time for HASH JOIN operator only, not for time to acquire data from its inputs)

Annotations:

• T1, T2 are tables to be joined. |T1|, |T2|: size of the tables in blocks
• |M|: available memory size in blocks for HJ. (it is sub part of hash area like 0.9*HASH_AREA_SIZE, so after preserving some space for hash table overhead, bit vector, partition table, blocks for input/output etc.)
• NP: actual number of partitions identified by database
• |Pi|: Size of one partition in blocks
• C: cluster size in blocks
• HF1, HF2: hash functions

We are looking at the following simple query:

SELECT   *
  FROM   t1, t2
 WHERE   t1.id = t2.id;

And it is assumed that there is no skew on join columns. In fact, only skew of build table`s column influence the HJ performance, so it doesn’t matter there is skew on join column of probe table. Also, we consider |M|<|T1|< |T2|. It means One-pass or Multi-pass HJ join will be performed. In this case the T1 table is partitioned with using a hash function (HF1), the database should partition T1 table in a way that each partition should not exceeds available memory and should try to keep many partitions in the memory as much as possible.
The maximum possible number of partitions is |M| but, Oracle uses multiblock read to write and read operation efficiently. In a single I/O request, C (cluster size) number of blocks are read or written at times. Therefore, maximum number of partitions will be:

MAX(NP)=|M|/C (F.1)

and it is called fan-out [2]. As it is seen it depends on cluster size and we need to understand how cluster size and factual number of partitions are identified by the database before starting join itself. So, we have upper limit for NP, we also can identify lower limit for NP:

MIN(NP)=|T1|/|M| (F.2)

So, NP should be identified in a way that the following condition should be true.

|T1|/|M|< NP < |M|/C (F.3)

Although, the C is still unknown, according to (F.3) the database tries to identify optimal/possible NP, because we have upper (even C=1 can be considered) and lower limits based on estimated size of build table and available memory. As it is mentioned NP is always is power of two [1], so that:

|T1|/|M|< NP = power(2, x) < |M|/C => log(|T1|/|M|,2) < |M|/C (F.4)

The NP could only be identified based on (F.4) if we have enough memory. So, the following condition (based on F.2) should be true to be able to use the formula (F.4)

|T1|/|M|<|M|/C (F.5)

Otherwise it leads to multi-pass execution. For example, if |M|=16, C=1, |T1|=506 then |T1|/|M|=31>16 exceeds MAX(NP) which leads to Multi-pass execution. That is why from (F.5) we get:

|T1|C < M*M => M > SQRT(|T1|C) (F.6)

If the (F.6) condition does not meet, then database selects cluster size smaller like C=1 and small number of partitions with large sizes to perform the join.
Now it is time to talk about the optimal cluster size for the HJ. Until that time we have identified minimum memory for hash join to perform One-pass execution via (F.6) and been able to identify possible NP via (F.4) – even until that time C=1 could be considered.
The optimal cluster size should be identified as big as possible to meet the condition (F.4). It means:

C<|M|/log(|T1|/|M|,2) <=> C<|M|/NP (F.7)

The following section we will use these formulas to check the database`s decisions and optimal values

The Test cases
The environment and data:

OS: Red Hat EL 7.3 64bit
DB: Oracle DB Enterprise Edition 19.3.0.0
The tables:
create table t1
tablespace  users
as
select  1000+rownum as id
      ,trunc(dbms_random.value(0,1000)) c1
      , rpad('x', 10000) as f1
      , rpad('y', 10000) as f2
from dual
connect by
        level <= 3e4
;

create table t2
tablespace  users
as
select  10001+rownum as id
      , trunc(sysdate)-trunc(level/1000) dat
      , round(level/10,2) amount
      , rpad('a', 10000) as f1
      , rpad('b', 10000) as f2
from dual
connect by
        level <= 6e4
;

SELECT   segment_name, blocks, bytes / 1024 / 1024 mb
  FROM   user_segments
 WHERE   segment_name IN ('T1', 'T2');

SEGMENT     BLOCKS         MB
------- ---------- ----------
T1           60416        472
T2          120832        944

execute dbms_stats.gather_table_stats(null,'t1');
execute dbms_stats.gather_table_stats(null,'t2');

SELECT   *
  FROM   t1, t2
 WHERE   t1.id = t2.id;

Before execution of the all test cases, the database buffer cache and Linux cache had been flushed out.

The TEST1

We are setting PGA as 20M and look at the execution statistics.

alter system set pga_aggregate_target=20M;
alter session set tracefile_identifier='PGA_20M_CASE1';

alter session set events '10053 trace name context forever';
alter session set events '10104 trace name context forever';

SELECT   *
  FROM   t1, t2
 WHERE   t1.id = t2.id;

alter session set events '10104 trace name context off';
alter session set events '10053 trace name context off';

SQL execution stats:
SQL_ID  a8dmqqcbak2uu, child number 0
-------------------------------------
    SELECT   *       FROM   t1, t2      WHERE   t1.id = t2.id                                                                               

Plan hash value: 1838229974                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20999 |00:00:10.23 |     270K|    269K|  89418 |       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |  30000 |  20999 |00:00:10.23 |     270K|    269K|  89418 |   244M|    14M|   12M (1)|     710M|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  30000 |  30000 |00:00:00.29 |   90004 |  60002 |      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T2   |      1 |  60000 |  60000 |00:00:00.51 |     180K|    120K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T1"."ID"="T2"."ID")

As it is seen, the SQL has been done with One-pass execution. Let’s look at the HJ trace file:

kxhfInit(HJ[1]): enter
kxhfInit(HJ[1]): exit
*** RowSrcId: 1 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 7760702
Memory for slot table: 7569408
Calculated overhead for partitions and row/slot managers: 191294
Hash-join fanout: 128
Number of partitions: 128
Number of slots: 132
Multiblock IO: 7
Block size(KB): 8
Cluster (slot) size(KB): 56
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 1
Maximum possible row length: 8071
Estimated build size (KB): 229
Estimated Build Row Length (includes overhead): 8031
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
kxhfAddChunk(HJ[1]): add chunk 0 (sz=256) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7f052684cb70, slotTab=0x7f052684cd90) added
kxhfWrite(HJ[1]): hash-join is spilling to disk

*** 2020-05-16T06:33:09.460663-04:00
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=8023)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 8023
  Revised build size: 235036KB

......
--the first pair of partitions
*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:193 size=(28 slots, 1568K)
 PROBE PARTITION: nrows:406 size=(58 slots, 3248K)
kxhfAddChunk(HJ[1]): add chunk 0 (sz=256) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7f6e33d37398, slotTab=0x7f6e33d375b8) added
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 196:
Number of rows left to be iterated over (start of function): 193
Number of rows iterated over this function call: 193
Number of rows left to be iterated over (end of function): 0
### Hash table ###

.........................
--the last pair of partitions
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:274 size=(39 slots, 2184K)
 PROBE PARTITION: nrows:491 size=(69 slots, 3864K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 273:
Number of rows left to be iterated over (start of function): 274
Number of rows iterated over this function call: 274
Number of rows left to be iterated over (end of function): 0
### Hash table ###

According to (F.4):

235036/ (7569408/1024)<NP < 7569408/1024/8/C  => 32 < NP< 924/C (upper limit is defined by size of C).

Although NP = 64 could have been enough to perform One-pass execution, the NP has been selected as 128. In general, large number of partitions allows DB to keep some of them in memory at the end of the build phase and helps to deal with skew problem efficiently. But in the test, it is not a such case.
In contrast, there is a problem with large number of partitions and small amount of memory, we will look at it on the next sections.

According to (F.7) cluster size will be:

C < (7760702/1024/8)/128 => C< 7.4

and optimal C will be 7. In older versions the number of clusters was controlled by _hash_multiblock_io_count parameter. Even if you set this parameter, it will not influence the join process by default (in case of CPU costing and automatic work area management is enabled)

The TEST2

In this case we have reduced PGA to 15M and performed same test.

SQL_ID  a8dmqqcbak2uu, child number 0
-------------------------------------
    SELECT   *       FROM   t1, t2      WHERE   t1.id = t2.id                                                                               

Plan hash value: 1838229974                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20999 |00:00:10.13 |     270K|    310K|  88914 |       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |  30000 |  20999 |00:00:10.13 |     270K|    310K|  88914 |   244M|    14M| 6576K (2)|     706M|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  30000 |  30000 |00:00:00.28 |   90004 |  60002 |      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T2   |      1 |  60000 |  60000 |00:00:00.51 |     180K|    120K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T1"."ID"="T2"."ID")

HJ stats:

Join Type: INNER join
Original hash-area size: 4061596
Memory for slot table: 3899392
Calculated overhead for partitions and row/slot managers: 162204
Hash-join fanout: 64
Number of partitions: 64
Number of slots: 68
Multiblock IO: 7
Block size(KB): 8
Cluster (slot) size(KB): 56
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 2
Maximum possible row length: 8071
Estimated build size (KB): 229
Estimated Build Row Length (includes overhead): 8031
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
kxhfAddChunk(HJ[1]): add chunk 0 (sz=128) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7fcdbe6a0908, slotTab=0x7fcdbe6a0b28) added
kxhfWrite(HJ[1]): hash-join is spilling to disk

*** 2020-05-16T07:31:50.750203-04:00
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=8023)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 8023
  Revised build size: 235036KB

According to (F.4) :

235036/(3899392/1024)< NP< (3899392/1024/8/C)=> 62< NP < 476/C.

So, NP was selected as 64. Also, from (f.7) :

C < (3899392/1024/8/64) = 7.4

C=7 was selected.

As you noticed that the SQL was executed in two-pass mode, but it was supposed to be executed in one-pass mode. Because the condition (F.6) is ture, so that:

M=3899392 < SQRT(235036*8*7)= 3628KB= 3715072!

Why did that happen? We have had enough memory (68 slots) to store each of 64 partitions.
There are the following correlated reasons to lead multi-pass join:

1) The rows between partitions were not evenly distributed

*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 64
Number of partitions left in memory: 0
Total number of rows in in-memory partitions: 0
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 2716
### Partition Distribution ###
Partition:0    rows:486        clusters:0      slots:0      kept=0
Partition:1    rows:488        clusters:0      slots:0      kept=0
Partition:2    rows:490        clusters:0      slots:0      kept=0
Partition:3    rows:504        clusters:0      slots:0      kept=0
Partition:4    rows:468        clusters:0      slots:0      kept=0
Partition:5    rows:465        clusters:0      slots:0      kept=0
Partition:6    rows:447        clusters:0      slots:0      kept=0
Partition:7    rows:455        clusters:0      slots:0      kept=0
Partition:8    rows:437        clusters:0      slots:0      kept=0
Partition:9    rows:484        clusters:0      slots:0      kept=0
Partition:10   rows:461        clusters:0      slots:0      kept=0
Partition:11   rows:456        clusters:0      slots:0      kept=0
Partition:12   rows:483        clusters:0      slots:0      kept=0
Partition:13   rows:485        clusters:0      slots:0      kept=0
Partition:14   rows:442        clusters:0      slots:0      kept=0
Partition:15   rows:479        clusters:0      slots:0      kept=0
Partition:16   rows:507        clusters:0      slots:0      kept=0
Partition:17   rows:460        clusters:0      slots:0      kept=0
Partition:18   rows:494        clusters:0      slots:0      kept=0
Partition:19   rows:467        clusters:0      slots:0      kept=0
Partition:20   rows:463        clusters:0      slots:0      kept=0
Partition:21   rows:508        clusters:0      slots:0      kept=0
Partition:22   rows:493        clusters:0      slots:0      kept=0
Partition:23   rows:448        clusters:0      slots:0      kept=0
Partition:24   rows:482        clusters:0      slots:0      kept=0
Partition:25   rows:447        clusters:0      slots:0      kept=0
Partition:26   rows:467        clusters:0      slots:0      kept=0
Partition:27   rows:503        clusters:0      slots:0      kept=0
Partition:28   rows:419        clusters:0      slots:0      kept=0
Partition:29   rows:476        clusters:0      slots:0      kept=0
Partition:30   rows:490        clusters:0      slots:0      kept=0
Partition:31   rows:437        clusters:0      slots:0      kept=0
Partition:32   rows:434        clusters:0      slots:0      kept=0
Partition:33   rows:448        clusters:0      slots:0      kept=0
Partition:34   rows:462        clusters:0      slots:0      kept=0
Partition:35   rows:470        clusters:0      slots:0      kept=0
Partition:36   rows:486        clusters:0      slots:0      kept=0
Partition:37   rows:451        clusters:0      slots:0      kept=0
Partition:38   rows:457        clusters:0      slots:0      kept=0
Partition:39   rows:476        clusters:0      slots:0      kept=0
Partition:40   rows:508        clusters:0      slots:0      kept=0
Partition:41   rows:454        clusters:0      slots:0      kept=0
Partition:42   rows:460        clusters:0      slots:0      kept=0
Partition:43   rows:529        clusters:0      slots:0      kept=0
Partition:44   rows:474        clusters:0      slots:0      kept=0
Partition:45   rows:465        clusters:0      slots:0      kept=0
Partition:46   rows:486        clusters:0      slots:0      kept=0
Partition:47   rows:423        clusters:0      slots:0      kept=0
Partition:48   rows:467        clusters:0      slots:0      kept=0
Partition:49   rows:454        clusters:0      slots:0      kept=0
Partition:50   rows:459        clusters:0      slots:0      kept=0
Partition:51   rows:474        clusters:0      slots:0      kept=0
Partition:52   rows:446        clusters:0      slots:0      kept=0
Partition:53   rows:429        clusters:0      slots:0      kept=0
Partition:54   rows:478        clusters:0      slots:0      kept=0
Partition:55   rows:479        clusters:0      slots:0      kept=0
Partition:56   rows:476        clusters:0      slots:0      kept=0
Partition:57   rows:447        clusters:0      slots:0      kept=0
Partition:58   rows:440        clusters:0      slots:0      kept=0
Partition:59   rows:486        clusters:0      slots:0      kept=0
Partition:60   rows:501        clusters:0      slots:0      kept=0
Partition:61   rows:447        clusters:0      slots:0      kept=0
Partition:62   rows:464        clusters:0      slots:0      kept=0
Partition:63   rows:479        clusters:0      slots:0      kept=0

The smaller partition has 419 rows (Partition:28) and the largest has 529 rows (Partition:43). There is an enough gap between the partitions. Despite fact that there is no skew on join column, but we have skew after partitioned the rows. It is due to hash collision of the function which has partitioned the table rows. It indicates that it is not easy to attain the perfect hash function to distribute rows evenly. This is one big problem in HJ processing. Now think about if your data (join column) is skewed or build table is an intermediate result set or join skew take place then how would be the result? To deal with such cases optimizer should produce more partitions (lots of) as it has been mentioned above.

2) Multi-Pass (2 pass) join was not used for all partitions, so that 13 out of 64 partitions were joined with one-pass, for example:

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:419 size=(59 slots, 3304K)
 PROBE PARTITION: nrows:870 size=(123 slots, 6888K)
kxhfAddChunk(HJ[1]): add chunk 0 (sz=128) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7fcdbdea11c8, slotTab=0x7fcdbdea13e8) added
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 413:
Number of rows left to be iterated over (start of function): 419
Number of rows iterated over this function call: 419
Number of rows left to be iterated over (end of function): 0

kxhfResetIter(HJ[1]): 0x7fcdbe5eac70
qerhnFetchPhase2(HJ[1]): building a hash table
kxhfResize(HJ[1]): enter, resize to 65 slots (numAlloc=62, max=64)
kxhfResize(HJ[1]): exit, resized to 65 slots (numAlloc=62, max=65)
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Reusing hash-table. Number of buckets: 256
*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:423 size=(60 slots, 3360K)
 PROBE PARTITION: nrows:887 size=(125 slots, 7000K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 420:
Number of rows left to be iterated over (start of function): 423
Number of rows iterated over this function call: 423
Number of rows left to be iterated over (end of function): 0

The first two-pass join was:

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:459 size=(65 slots, 3640K)
 PROBE PARTITION: nrows:941 size=(133 slots, 7448K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 448:
Number of rows left to be iterated over (start of function): 459
Number of rows iterated over this function call: 456
Number of rows left to be iterated over (end of function): 3

And the last two-pass join:

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:529 size=(75 slots, 4200K)
 PROBE PARTITION: nrows:939 size=(132 slots, 7392K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 448:
Number of rows left to be iterated over (start of function): 529
Number of rows iterated over this function call: 456
Number of rows left to be iterated over (end of function): 73

As you noted that even only the few rows (73) from the biggest partition were not be able to be included to the first iteration. The reason is that Oracle considered 68 slots for hash table but, in fact it used only 64 slots (448 blocks) for hash table. Even 68 slots would not be enough due to 75 slots should have been required to perform one-pass join. It indicates that the memory calculated based on F.6 should be increased a little bit to provide one-pass join.

3) It is not directly related to that case; the point is optimizer uses average row length for estimating build table. Different rows in different partitions can have different row sizes. If some of partitions can have more rows than other and their row lengths are greater than average, then these partitions will not fit the memory (calculated by optimizer).
Note: Although, only little memory needs to be increased for one-pass execution, but it is not simple to get it from auto PGA mode. For example, with PGA target = 16M,17M,18M and 19M, the size of HASH_AREA_SIZE was changed only for PGA=19M significantly, set to 7760702, the size of hash table (memory slot) was 7569408. For 16M,17M,18M HASH_AREA_SIZE was constant as 4061596 and hash table size was 3899392.

The TEST3

In this case, we are performing the same test but with manual work area mode. According the TEST1, the database used HASH_AREA_SIZE as 7760702 and the same amount of memory is going to be used for that case also.

alter session set tracefile_identifier='HA_1';

alter session set hash_area_size=7760702;

alter session set workarea_size_policy=manual;

alter session set events '10053 trace name context forever';
alter session set events '10104 trace name context forever';

SELECT   *
  FROM   t1, t2
 WHERE   t1.id = t2.id;

SQL_ID  a8dmqqcbak2uu, child number 0
-------------------------------------
    SELECT   *       FROM   t1, t2      WHERE   t1.id = t2.id                                                                               

Plan hash value: 1838229974                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20999 |00:00:05.44 |     270K|    632K|  89110 |       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |  30000 |  20999 |00:00:05.44 |     270K|    632K|  89110 |   244M|    14M| 7839K (8)|    1330M|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  30000 |  30000 |00:00:00.26 |   90004 |  60002 |      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T2   |      1 |  60000 |  60000 |00:00:00.51 |     180K|    120K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T1"."ID"="T2"."ID")                                                                                                          

HJ statistics:
kxhfInit(HJ[1]): enter
kxhfInit(HJ[1]): exit
*** RowSrcId: 1 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 7429892
Memory for slot table: 7135232
Calculated overhead for partitions and row/slot managers: 294660
Hash-join fanout: 8
Number of partitions: 8
Number of slots: 13
Multiblock IO: 67
Block size(KB): 8
Cluster (slot) size(KB): 536
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 256
Per partition bit vector length(KB): 32
Maximum possible row length: 8071
Estimated build size (KB): 229
Estimated Build Row Length (includes overhead): 8031
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
kxhfAddChunk(HJ[1]): add chunk 0 (sz=13) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7fed50509270, slotTab=0x7fed50509138) added
kxhfWrite(HJ[1]): hash-join is spilling to disk

*** 2020-05-18T08:34:56.297588-04:00
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=8023)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 8023
  Revised build size: 235036KB

*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:3727 size=(55 slots, 29480K)
 PROBE PARTITION: nrows:7519 size=(111 slots, 59496K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 536:
Number of rows left to be iterated over (start of function): 3727
Number of rows iterated over this function call: 545
Number of rows left to be iterated over (end of function): 3182
### Hash table ###

.........

kxhfResetIter(HJ[1]): 0x7fed507e8fd0
qerhnFetchPhase2(HJ[1]): building a hash table
*** RowSrcId: 1 HASH JOIN GET FLUSHED PARTITIONS (PHASE 2) ***
Getting a pair of flushed partitions.
 BUILD PARTITION: nrows:3840 size=(57 slots, 30552K)
 PROBE PARTITION: nrows:7639 size=(113 slots, 60568K)
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 2) ***
Number of blocks that may be used to build the hash table 536:
Number of rows left to be iterated over (start of function): 3840
Number of rows iterated over this function call: 545
Number of rows left to be iterated over (end of function): 3295

The performance of the TEST3 is much better than TEST1. Although the TEST3 was done in multi-pass (8) mode, it gave us ~2 times better response time. As execution plans tell us the TEST3 did lots of reads (due to 8 passes for probe table) than TEST1 (write requests is comparable) but time spent is very less. Of course, cluster size influenced to get that result. In TEST1 small cluster size and large number of partitions were used, but in TEST3 large cluster size but small number of partitions were used, that is why response time reduced.
Before continuing thinking lets have a quick look at the SQL trace files:

For TEST1:

SELECT   *
      FROM   t1, t2
     WHERE   t1.id = t2.id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     1401      2.53      10.26     269423     270009          0       20999
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1403      2.53      10.27     269423     270009          0       20999

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
     20999      20999      20999  HASH JOIN  (cr=270065 pr=269434 pw=89418 time=10231586 us starts=1 cost=81831 size=480930000 card=30000)
     30000      30000      30000   TABLE ACCESS FULL T1 (cr=90004 pr=60002 pw=0 time=291160 us starts=1 cost=10474 size=240330000 card=30000)
     60000      60000      60000   TABLE ACCESS FULL T2 (cr=180005 pr=120003 pw=0 time=510333 us starts=1 cost=20929 size=481200000 card=60000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    1401        0.00          0.00
  db file sequential read                         5        0.00          0.00
  PGA memory operation                           72        0.00          0.00
  direct path read                             1428        0.00          0.36
  direct path write temp                      12774        0.01          8.41
  direct path read temp                       12774        0.00          0.11
  SQL*Net more data to client                     4        0.00          0.00
  SQL*Net message from client                  1401        0.00          0.18
********************************************************************************

For TEST3:

    SELECT   *
      FROM   t1, t2
     WHERE   t1.id = t2.id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     1401      2.50       5.49     632858     270009          0       20999
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1403      2.50       5.49     632858     270009          0       20999

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
     20999      20999      20999  HASH JOIN  (cr=270009 pr=632858 pw=89110 time=5443216 us starts=1 cost=75687 size=480930000 card=30000)
     30000      30000      30000   TABLE ACCESS FULL T1 (cr=90004 pr=60002 pw=0 time=264528 us starts=1 cost=10474 size=240330000 card=30000)
     60000      60000      60000   TABLE ACCESS FULL T2 (cr=180005 pr=120003 pw=0 time=514099 us starts=1 cost=20929 size=481200000 card=60000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    1401        0.00          0.00
  Disk file operations I/O                        2        0.00          0.00
  db file sequential read                         5        0.00          0.00
  PGA memory operation                           17        0.00          0.00
  direct path read                             1428        0.00          0.41
  direct path write temp                       1330        0.02          3.61
  direct path read temp                        6759        0.00          0.36
  SQL*Net more data to client                     4        0.00          0.00
  SQL*Net message from client                  1401        0.00          0.18
********************************************************************************

As you see, problem is not related to (re)reading of the data (rereading probe table), in contrast it is related to writing corresponding partition data to their partitions on the temporary tablespace.
From the SQL trace files it is easily could be seen how the process was done (how the events appeared in sequence)

For TEST1:

-- writing build partitions to the disk in the build phase
WAIT #140201128746576: nam='PGA memory operation' ela= 7 p1=262144 p2=1 p3=0 obj#=79274 tim=105618054693
kxhfWrite(HJ[1]): hash-join is spilling to disk
WAIT #140201128746576: nam='direct path write temp' ela= 4189 file number=201 first dba=65408 block cnt=7 obj#=79274 tim=105618059045
WAIT #140201128746576: nam='direct path write temp' ela= 605 file number=201 first dba=65422 block cnt=7 obj#=79274 tim=105618059915
WAIT #140201128746576: nam='direct path write temp' ela= 410 file number=201 first dba=65436 block cnt=7 obj#=79274 tim=105618060373
WAIT #140201128746576: nam='direct path read' ela= 118 file number=7 first dba=1410 block cnt=126 obj#=79274 tim=105618060605
WAIT #140201128746576: nam='direct path write temp' ela= 433 file number=201 first dba=65450 block cnt=7 obj#=79274 tim=105618061192
WAIT #140201128746576: nam='direct path write temp' ela= 432 file number=201 first dba=65464 block cnt=7 obj#=79274 tim=105618061697
WAIT #140201128746576: nam='direct path write temp' ela= 421 file number=201 first dba=65478 block cnt=7 obj#=79274 tim=105618062212
WAIT #140201128746576: nam='direct path write temp' ela= 416 file number=201 first dba=65492 block cnt=7 obj#=79274 tim=105618062749
WAIT #140201128746576: nam='direct path write temp' ela= 428 file number=201 first dba=65506 block cnt=7 obj#=79274 tim=105618063245
WAIT #140201128746576: nam='direct path write temp' ela= 417 file number=201 first dba=65520 block cnt=7 obj#=79274 tim=105618063806
WAIT #140201128746576: nam='direct path write temp' ela= 760 file number=201 first dba=68096 block cnt=7 obj#=79274 tim=105618064613
WAIT #140201128746576: nam='direct path write temp' ela= 439 file number=201 first dba=68110 block cnt=7 obj#=79274 tim=105618065146
WAIT #140201128746576: nam='direct path read' ela= 424 file number=7 first dba=1538 block cnt=126 obj#=79274 tim=105618065604
WAIT #140201128746576: nam='direct path write temp' ela= 3772 file number=201 first dba=65429 block cnt=7 obj#=79274 tim=105618069473
WAIT #140201128746576: nam='direct path write temp' ela= 416 file number=201 first dba=68131 block cnt=7 obj#=79274 tim=105618070027
WAIT #140201128746576: nam='direct path write temp' ela= 458 file number=201 first dba=68145 block cnt=7 obj#=79274 tim=105618070523
WAIT #140201128746576: nam='direct path write temp' ela= 444 file number=201 first dba=68159 block cnt=7 obj#=79274 tim=105618071097
WAIT #140201128746576: nam='direct path write temp' ela= 787 file number=201 first dba=68173 block cnt=7 obj#=79274 tim=105618071946
WAIT #140201128746576: nam='direct path write temp' ela= 503 file number=201 first dba=68187 block cnt=7 obj#=79274 tim=105618072529
WAIT #140201128746576: nam='direct path write temp' ela= 435 file number=201 first dba=68201 block cnt=7 obj#=79274 tim=105618073058
WAIT #140201128746576: nam='direct path read' ela= 120 file number=7 first dba=1666 block cnt=126 obj#=79274 tim=105618073258
.....

-- writing probe partitions to the disk in the parttioning phase
qerhnProbeChooseRowP (rwsid=1): inmem=0,mm=0,nf=0
WAIT #140201128746576: nam='direct path read' ela= 15 file number=7 first dba=60577 block cnt=15 obj#=79276 tim=105621371741
WAIT #140201128746576: nam='direct path read' ela= 14 file number=7 first dba=60593 block cnt=15 obj#=79276 tim=105621371820
WAIT #140201128746576: nam='PGA memory operation' ela= 5 p1=262144 p2=1 p3=0 obj#=79276 tim=105621371877
WAIT #140201128746576: nam='direct path read' ela= 14 file number=7 first dba=60609 block cnt=15 obj#=79276 tim=105621371939
WAIT #140201128746576: nam='PGA memory operation' ela= 3 p1=262144 p2=1 p3=0 obj#=79276 tim=105621371957
WAIT #140201128746576: nam='PGA memory operation' ela= 4 p1=262144 p2=1 p3=0 obj#=79276 tim=105621372055
WAIT #140201128746576: nam='direct path read' ela= 15 file number=7 first dba=60625 block cnt=15 obj#=79276 tim=105621372095
WAIT #140201128746576: nam='PGA memory operation' ela= 3 p1=262144 p2=1 p3=0 obj#=79276 tim=105621372144
WAIT #140201128746576: nam='PGA memory operation' ela= 2 p1=262144 p2=1 p3=0 obj#=79276 tim=105621372197
WAIT #140201128746576: nam='direct path read' ela= 14 file number=7 first dba=60641 block cnt=15 obj#=79276 tim=105621372229
WAIT #140201128746576: nam='PGA memory operation' ela= 3 p1=262144 p2=1 p3=0 obj#=79276 tim=105621372290
WAIT #140201128746576: nam='direct path read' ela= 14 file number=7 first dba=60657 block cnt=15 obj#=79276 tim=105621372328
WAIT #140201128746576: nam='PGA memory operation' ela= 3 p1=262144 p2=1 p3=0 obj#=79276 tim=105621372401
WAIT #140201128746576: nam='direct path read' ela= 1996 file number=7 first dba=60674 block cnt=126 obj#=79276 tim=105621374414
WAIT #140201128746576: nam='PGA memory operation' ela= 5 p1=262144 p2=1 p3=0 obj#=79276 tim=105621374513
WAIT #140201128746576: nam='direct path read' ela= 789 file number=7 first dba=60802 block cnt=126 obj#=79276 tim=105621375334
WAIT #140201128746576: nam='PGA memory operation' ela= 5 p1=262144 p2=1 p3=0 obj#=79276 tim=105621375421
WAIT #140201128746576: nam='PGA memory operation' ela= 3 p1=262144 p2=1 p3=0 obj#=79276 tim=105621375501
WAIT #140201128746576: nam='PGA memory operation' ela= 1 p1=262144 p2=1 p3=0 obj#=79276 tim=105621375573
WAIT #140201128746576: nam='PGA memory operation' ela= 2 p1=262144 p2=1 p3=0 obj#=79276 tim=105621375649
WAIT #140201128746576: nam='direct path read' ela= 402 file number=7 first dba=60930 block cnt=126 obj#=79276 tim=105621376707
WAIT #140201128746576: nam='direct path read' ela= 85 file number=7 first dba=61058 block cnt=126 obj#=79276 tim=105621377463
WAIT #140201128746576: nam='direct path read' ela= 922 file number=7 first dba=61186 block cnt=126 obj#=79276 tim=105621378868
WAIT #140201128746576: nam='direct path read' ela= 82 file number=7 first dba=61314 block cnt=126 obj#=79276 tim=105621379505
WAIT #140201128746576: nam='direct path read' ela= 857 file number=7 first dba=61442 block cnt=126 obj#=79276 tim=105621380924
WAIT #140201128746576: nam='direct path write temp' ela= 3811 file number=201 first dba=100571 block cnt=7 obj#=79276 tim=105621385152
WAIT #140201128746576: nam='direct path write temp' ela= 439 file number=201 first dba=100585 block cnt=7 obj#=79276 tim=105621385712
WAIT #140201128746576: nam='direct path read' ela= 109 file number=7 first dba=61570 block cnt=126 obj#=79276 tim=105621385865
WAIT #140201128746576: nam='direct path write temp' ela= 741 file number=201 first dba=100599 block cnt=7 obj#=79276 tim=105621386954
WAIT #140201128746576: nam='direct path write temp' ela= 403 file number=201 first dba=104199 block cnt=7 obj#=79276 tim=105621387473
WAIT #140201128746576: nam='direct path read' ela= 365 file number=7 first dba=61698 block cnt=126 obj#=79276 tim=105621387866
WAIT #140201128746576: nam='direct path write temp' ela= 2993 file number=201 first dba=104213 block cnt=7 obj#=79276 tim=105621390923
WAIT #140201128746576: nam='direct path write temp' ela= 396 file number=201 first dba=104227 block cnt=7 obj#=79276 tim=105621391431
WAIT #140201128746576: nam='direct path write temp' ela= 428 file number=201 first dba=104241 block cnt=7 obj#=79276 tim=105621391938
WAIT #140201128746576: nam='direct path write temp' ela= 385 file number=201 first dba=104255 block cnt=7 obj#=79276 tim=105621392361
WAIT #140201128746576: nam='direct path write temp' ela= 434 file number=201 first dba=104269 block cnt=7 obj#=79276 tim=105621392869
WAIT #140201128746576: nam='direct path write temp' ela= 836 file number=201 first dba=104283 block cnt=7 obj#=79276 tim=105621393742
WAIT #140201128746576: nam='direct path write temp' ela= 396 file number=201 first dba=104297 block cnt=7 obj#=79276 tim=105621394223
WAIT #140201128746576: nam='direct path write temp' ela= 396 file number=201 first dba=104311 block cnt=7 obj#=79276 tim=105621394727
WAIT #140201128746576: nam='direct path write temp' ela= 386 file number=201 first dba=107783 block cnt=7 obj#=79276 tim=105621395223
WAIT #140201128746576: nam='direct path read' ela= 103 file number=7 first dba=61826 block cnt=126 obj#=79276 tim=105621395391

For TEST3

-- writing build partitions to the disk in the build phase
kxhfWrite(HJ[1]): hash-join is spilling to disk
WAIT #139718328414072: nam='Disk file operations I/O' ela= 17 FileOperation=2 fileno=201 filetype=6 obj#=79274 tim=107049866623
WAIT #139718328414072: nam='direct path write temp' ela= 3020 file number=201 first dba=52992 block cnt=67 obj#=79274 tim=107049869662
WAIT #139718328414072: nam='direct path read' ela= 130 file number=7 first dba=1410 block cnt=126 obj#=79274 tim=107049870022
WAIT #139718328414072: nam='direct path write temp' ela= 1596 file number=201 first dba=60288 block cnt=67 obj#=79274 tim=107049871744
WAIT #139718328414072: nam='direct path write temp' ela= 1440 file number=201 first dba=115968 block cnt=67 obj#=79274 tim=107049873676
WAIT #139718328414072: nam='direct path read' ela= 401 file number=7 first dba=1538 block cnt=126 obj#=79274 tim=107049874260
WAIT #139718328414072: nam='direct path write temp' ela= 3495 file number=201 first dba=46464 block cnt=67 obj#=79274 tim=107049878141
WAIT #139718328414072: nam='direct path read' ela= 117 file number=7 first dba=1666 block cnt=126 obj#=79274 tim=107049878477
WAIT #139718328414072: nam='direct path read' ela= 395 file number=7 first dba=1794 block cnt=126 obj#=79274 tim=107049879366
WAIT #139718328414072: nam='direct path read' ela= 101 file number=7 first dba=1922 block cnt=126 obj#=79274 tim=107049879897
WAIT #139718328414072: nam='direct path read' ela= 608 file number=7 first dba=2050 block cnt=126 obj#=79274 tim=107049880970
WAIT #139718328414072: nam='direct path read' ela= 105 file number=7 first dba=2178 block cnt=126 obj#=79274 tim=107049881641
WAIT #139718328414072: nam='direct path write temp' ela= 3701 file number=201 first dba=50560 block cnt=67 obj#=79274 tim=107049885625
WAIT #139718328414072: nam='direct path read' ela= 430 file number=7 first dba=2306 block cnt=126 obj#=79274 tim=107049886431
WAIT #139718328414072: nam='direct path write temp' ela= 4135 file number=201 first dba=56192 block cnt=67 obj#=79274 tim=107049890712
WAIT #139718328414072: nam='direct path write temp' ela= 1642 file number=201 first dba=302464 block cnt=67 obj#=79274 tim=107049892638
WAIT #139718328414072: nam='direct path read' ela= 113 file number=7 first dba=2434 block cnt=126 obj#=79274 tim=107049893046
WAIT #139718328414072: nam='direct path write temp' ela= 1705 file number=201 first dba=345472 block cnt=67 obj#=79274 tim=107049894940
WAIT #139718328414072: nam='direct path write temp' ela= 2164 file number=201 first dba=347520 block cnt=67 obj#=79274 tim=107049897494
WAIT #139718328414072: nam='direct path write temp' ela= 1802 file number=201 first dba=48640 block cnt=67 obj#=79274 tim=107049899455
WAIT #139718328414072: nam='direct path read' ela= 382 file number=7 first dba=2562 block cnt=126 obj#=79274 tim=107049899940
WAIT #139718328414072: nam='direct path write temp' ela= 3674 file number=201 first dba=130688 block cnt=67 obj#=79274 tim=107049903779
WAIT #139718328414072: nam='direct path read' ela= 132 file number=7 first dba=2690 block cnt=126 obj#=79274 tim=107049904267
WAIT #139718328414072: nam='direct path read' ela= 393 file number=7 first dba=2818 block cnt=126 obj#=79274 tim=107049905009
WAIT #139718328414072: nam='direct path read' ela= 102 file number=7 first dba=2946 block cnt=126 obj#=79274 tim=107049905460
WAIT #139718328414072: nam='direct path write temp' ela= 3827 file number=201 first dba=102272 block cnt=67 obj#=79274 tim=107049909502
WAIT #139718328414072: nam='direct path read' ela= 423 file number=7 first dba=3074 block cnt=126 obj#=79274 tim=107049910232
WAIT #139718328414072: nam='direct path read' ela= 104 file number=7 first dba=3202 block cnt=126 obj#=79274 tim=107049910697
WAIT #139718328414072: nam='direct path write temp' ela= 3154 file number=201 first dba=80000 block cnt=67 obj#=79274 tim=107049913967
WAIT #139718328414072: nam='direct path read' ela= 367 file number=7 first dba=3330 block cnt=126 obj#=79274 tim=107049914727
WAIT #139718328414072: nam='direct path write temp' ela= 4121 file number=201 first dba=128640 block cnt=67 obj#=79274 tim=107049919272
WAIT #139718328414072: nam='direct path read' ela= 134 file number=7 first dba=3458 block cnt=126 obj#=79274 tim=107049919497
...

-- writing probe partitions to the disk in the parttioning phase
qerhnProbeChooseRowP (rwsid=1): inmem=0,mm=0,nf=0
WAIT #139718328414072: nam='direct path read' ela= 12 file number=7 first dba=60577 block cnt=15 obj#=79276 tim=107051448147
WAIT #139718328414072: nam='direct path read' ela= 13 file number=7 first dba=60593 block cnt=15 obj#=79276 tim=107051448269
WAIT #139718328414072: nam='direct path read' ela= 12 file number=7 first dba=60609 block cnt=15 obj#=79276 tim=107051448400
WAIT #139718328414072: nam='direct path read' ela= 12 file number=7 first dba=60625 block cnt=15 obj#=79276 tim=107051448457
WAIT #139718328414072: nam='direct path read' ela= 11 file number=7 first dba=60641 block cnt=15 obj#=79276 tim=107051448523
WAIT #139718328414072: nam='direct path read' ela= 12 file number=7 first dba=60657 block cnt=15 obj#=79276 tim=107051448581
WAIT #139718328414072: nam='direct path read' ela= 1942 file number=7 first dba=60674 block cnt=126 obj#=79276 tim=107051450573
WAIT #139718328414072: nam='direct path read' ela= 742 file number=7 first dba=60802 block cnt=126 obj#=79276 tim=107051451384
WAIT #139718328414072: nam='direct path read' ela= 1449 file number=7 first dba=60930 block cnt=126 obj#=79276 tim=107051453208
WAIT #139718328414072: nam='direct path read' ela= 115 file number=7 first dba=61058 block cnt=126 obj#=79276 tim=107051453698
WAIT #139718328414072: nam='direct path read' ela= 1077 file number=7 first dba=61186 block cnt=126 obj#=79276 tim=107051455171
WAIT #139718328414072: nam='direct path read' ela= 306 file number=7 first dba=61314 block cnt=126 obj#=79276 tim=107051455845
WAIT #139718328414072: nam='direct path read' ela= 992 file number=7 first dba=61442 block cnt=126 obj#=79276 tim=107051457246
WAIT #139718328414072: nam='direct path write temp' ela= 5045 file number=201 first dba=198144 block cnt=67 obj#=79276 tim=107051462570
WAIT #139718328414072: nam='direct path read' ela= 134 file number=7 first dba=61570 block cnt=126 obj#=79276 tim=107051462975
WAIT #139718328414072: nam='direct path write temp' ela= 1591 file number=201 first dba=193664 block cnt=67 obj#=79276 tim=107051465002
WAIT #139718328414072: nam='direct path read' ela= 356 file number=7 first dba=61698 block cnt=126 obj#=79276 tim=107051465458
WAIT #139718328414072: nam='direct path write temp' ela= 3392 file number=201 first dba=189568 block cnt=67 obj#=79276 tim=107051469260
WAIT #139718328414072: nam='direct path read' ela= 111 file number=7 first dba=61826 block cnt=126 obj#=79276 tim=107051469503
WAIT #139718328414072: nam='direct path write temp' ela= 1699 file number=201 first dba=185600 block cnt=67 obj#=79276 tim=107051471362
WAIT #139718328414072: nam='direct path write temp' ela= 1534 file number=201 first dba=180992 block cnt=67 obj#=79276 tim=107051473045
WAIT #139718328414072: nam='direct path write temp' ela= 1866 file number=201 first dba=177280 block cnt=67 obj#=79276 tim=107051475071
WAIT #139718328414072: nam='direct path write temp' ela= 1486 file number=201 first dba=173824 block cnt=67 obj#=79276 tim=107051476803
WAIT #139718328414072: nam='direct path read' ela= 349 file number=7 first dba=61954 block cnt=126 obj#=79276 tim=107051477330
WAIT #139718328414072: nam='direct path write temp' ela= 3576 file number=201 first dba=169856 block cnt=67 obj#=79276 tim=107051481211
WAIT #139718328414072: nam='direct path read' ela= 108 file number=7 first dba=62082 block cnt=126 obj#=79276 tim=107051481564
WAIT #139718328414072: nam='direct path read' ela= 371 file number=7 first dba=62210 block cnt=126 obj#=79276 tim=107051482291
WAIT #139718328414072: nam='direct path read' ela= 95 file number=7 first dba=62338 block cnt=126 obj#=79276 tim=107051482745
WAIT #139718328414072: nam='direct path read' ela= 650 file number=7 first dba=62466 block cnt=126 obj#=79276 tim=107051483723
WAIT #139718328414072: nam='direct path read' ela= 254 file number=7 first dba=62594 block cnt=126 obj#=79276 tim=107051484330
WAIT #139718328414072: nam='direct path write temp' ela= 5174 file number=201 first dba=191616 block cnt=67 obj#=79276 tim=107051489630
WAIT #139718328414072: nam='direct path write temp' ela= 1524 file number=201 first dba=196224 block cnt=67 obj#=79276 tim=107051491352
qerhnUpdateFilterStats(HJ[1]): Disabled bitmap filtering...

The smaller cluster size caused large number of small partitions, we had 128 partitions and each of them at times in memory had almost one slot (in total we had 132 slots). It means, when reading data from the tables (build and probe) with multi block read (as trace file says 126 blocks) was distributed 128 partitions and these blocks filled quickly then these slots had to be written to disk (each of them to separate disk partitions) very often. The DB performed lots of I/O requests with small chunks (7 blocks) and it took time. In this case even the platform supports asynchronous I/O, it is impossible to perform writing asynchronously, because at least 2 slots are required to do that. Same thing is true for TEST3 also. As we have pointed out for the TEST1 the number of partitions could have been selected as 64 blocks instead of 128 and cluster size would be 14 for the improvement.
If you noticed, even TEST2 is better than TEST1, despite of two-pass join. Because the size of the partitions is big and it restricted to perform lots of small write requests to the disk at certain percent.
The TEST1 was true one-pass on disk HJ. So, to get larger cluster size and reduce number of partitions properly as per (F.3) we should increase the memory.
For example if you set pga_aggregate_target=200M and repeat the TEST1 again then:

Join Type: INNER join
Original hash-area size: 14346840
Memory for slot table: 12189696
Calculated overhead for partitions and row/slot managers: 2157144
Hash-join fanout: 32
Number of partitions: 32
Number of slots: 48
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 2048
Per partition bit vector length(KB): 64
Maximum possible row length: 8071
Estimated build size (KB): 229
Estimated Build Row Length (includes overhead): 8031
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
kxhfAddChunk(HJ[1]): add chunk 0 (sz=64) to slot table
kxhfAddChunk(HJ[1]): chunk 0 (lbs=0x7f938e7037e0, slotTab=0x7f938e703a00) added
kxhfWrite(HJ[1]): hash-join is spilling to disk

*** 2020-05-23T02:52:36.290573-04:00
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=8023)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 8023
  Revised build size: 235036KB
SQL_ID  a8dmqqcbak2uu, child number 0
-------------------------------------
    SELECT   *       FROM   t1, t2      WHERE   t1.id = t2.id                                                                               

Plan hash value: 1838229974                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20999 |00:00:06.31 |     270K|    269K|  89559 |       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |  30000 |  20999 |00:00:06.31 |     270K|    269K|  89559 |   244M|    14M|   14M (1)|     723M|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  30000 |  30000 |00:00:00.38 |   90004 |  60002 |      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T2   |      1 |  60000 |  60000 |00:00:00.64 |     180K|    120K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T1"."ID"="T2"."ID")

It is still one-pass and on disk HJ, but with larger cluster size, so small number of larger partitions. None of partitions were kept in the memory but performance improved significantly.

The TEST4

alter system set pga_aggregate_target=600M;
alter session set tracefile_identifier='PGA_600M';
alter session set events '10104 trace name context forever';

SQL_ID  a8dmqqcbak2uu, child number 0
-------------------------------------
    SELECT   *       FROM   t1, t2      WHERE   t1.id = t2.id                                                                               

Plan hash value: 1838229974                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20999 |00:00:05.65 |     270K|    246K|  66526 |       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |  30000 |  20999 |00:00:05.65 |     270K|    246K|  66526 |   244M|    14M|  106M (1)|     540M|
|   2 |   TABLE ACCESS FULL| T1   |      1 |  30000 |  30000 |00:00:00.30 |   90004 |  60002 |      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T2   |      1 |  60000 |  60000 |00:00:00.56 |     180K|    120K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T1"."ID"="T2"."ID")                                                                                                          

    SELECT   *
      FROM   t1, t2
     WHERE   t1.id = t2.id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     1401      1.78       5.69     246531     270533          0       20999
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1403      1.79       5.70     246531     270533          0       20999

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
     20999      20999      20999  HASH JOIN  (cr=270589 pr=246542 pw=66526 time=5650978 us starts=1 cost=65570 size=480930000 card=30000)
     30000      30000      30000   TABLE ACCESS FULL T1 (cr=90004 pr=60002 pw=0 time=300684 us starts=1 cost=10474 size=240330000 card=30000)
     60000      60000      60000   TABLE ACCESS FULL T2 (cr=180529 pr=120003 pw=0 time=561738 us starts=1 cost=20929 size=481200000 card=60000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    1401        0.00          0.00
  PGA memory operation                          202        0.00          0.00
  Disk file operations I/O                        2        0.00          0.00
  db file sequential read                         5        0.00          0.00
  direct path read                             1428        0.01          0.42
  asynch descriptor resize                        9        0.00          0.00
  direct path write temp                       2146        0.01          4.00
  direct path read temp                        2146        0.00          0.10
  SQL*Net more data to client                     4        0.00          0.00
  SQL*Net message from client                  1401        0.00          0.20
********************************************************************************

This test was also done in one-pass mode, but we had 16M for HASH_AREA_SIZE and as it is seen in this case asynch I/O was possible. But still writing partitions to the disk is the dominant part of the process.


Join Type: INNER join
Original hash-area size: 16443992
Memory for slot table: 12189696
Calculated overhead for partitions and row/slot managers: 4254296
Hash-join fanout: 32
Number of partitions: 32
Number of slots: 48
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 4096
Per partition bit vector length(KB): 128
Maximum possible row length: 8071
Estimated build size (KB): 229
Estimated Build Row Length (includes overhead): 8031
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
…
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=8023)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 8023
  Revised build size: 235036KB

The number of partitions reduced and cluster size increased. According to the (F.4):

235036/(12189696/1024)<NP<12189696/1024/8/C => 19.7< NP < 1488/C => NP=32
--and then from (F.7) we get:
C < 12189696/1024/8/32 = 46.5.

Although it should have been possible to select cluster size as 40 cluster, but oracle selected it as 31. Also, quite enough memory allowed the database to keep some partitions completely in memory the end of build phase and gave the improvement.

*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 32
Number of partitions left in memory: 12
Total number of rows in in-memory partitions: 11225
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 49352
### Partition Distribution ###
Partition:0    rows:974        clusters:0      slots:0      kept=0
Partition:1    rows:994        clusters:0      slots:0      kept=0
Partition:2    rows:933        clusters:0      slots:0      kept=0
Partition:3    rows:902        clusters:0      slots:0      kept=0
Partition:4    rows:921        clusters:0      slots:0      kept=0
Partition:5    rows:917        clusters:0      slots:0      kept=0
Partition:6    rows:968        clusters:0      slots:0      kept=0
Partition:7    rows:921        clusters:0      slots:0      kept=0
Partition:8    rows:967        clusters:0      slots:0      kept=0
Partition:9    rows:961        clusters:0      slots:0      kept=0
Partition:10   rows:971        clusters:0      slots:0      kept=0
Partition:11   rows:941        clusters:0      slots:0      kept=0
Partition:12   rows:929        clusters:0      slots:0      kept=0
Partition:13   rows:970        clusters:0      slots:0      kept=0
Partition:14   rows:895        clusters:0      slots:0      kept=0
Partition:15   rows:927        clusters:0      slots:0      kept=0
Partition:16   rows:882        clusters:0      slots:0      kept=0
Partition:17   rows:932        clusters:0      slots:0      kept=0
Partition:18   rows:937        clusters:0      slots:0      kept=0
Partition:19   rows:933        clusters:0      slots:0      kept=0
Partition:20   rows:962        clusters:31     slots:1      kept=1
Partition:21   rows:989        clusters:32     slots:1      kept=1
Partition:22   rows:939        clusters:30     slots:1      kept=1
Partition:23   rows:909        clusters:29     slots:1      kept=1
Partition:24   rows:921        clusters:30     slots:1      kept=1
Partition:25   rows:933        clusters:30     slots:1      kept=1
Partition:26   rows:875        clusters:28     slots:1      kept=1
Partition:27   rows:957        clusters:31     slots:1      kept=1
Partition:28   rows:923        clusters:30     slots:1      kept=1
Partition:29   rows:926        clusters:30     slots:1      kept=1
Partition:30   rows:948        clusters:31     slots:3      kept=1
Partition:31   rows:943        clusters:30     slots:14     kept=1
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Revised number of hash buckets (after flushing): 11225

The TEST5

Of course, Oracle does not always follow our above assumption. This test we will see that high number of partitions are selected.


create table t4
tablespace  users
as
select  10001+rownum as id
      , trunc(sysdate)-trunc(level/1000) dat
      , round(level/10,2) amount
      , rpad('x', 10000) as f1
      , rpad('y', 10000) as f2
      , rpad('x', 10000) as f3
      , rpad('y', 10000) as f4
      , rpad('x', 10000) as f5
      , rpad('y', 10000) as f6
from dual
connect by
        level <= 6e4
;

create table t5 tablespace users parallel 2 as  select * from t4 ;
alter table t5 parallel 1;

alter system set pga_aggregate_target=1g;

SQL_ID  a1ajqys4at9sj, child number 0
-------------------------------------
    SELECT   *       FROM   t4, t5      WHERE   t4.id = t5.id                                                                               

Plan hash value: 3579674918                                                                                                                 

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |    960K|00:03:46.81 |    3198K|   3022K|   1420K|       |       |          |         |
|*  1 |  HASH JOIN         |      |      1 |    954K|    960K|00:03:46.81 |    3198K|   3022K|   1420K|  5692M|    56M|  103M (1)|      11G|
|   2 |   TABLE ACCESS FULL| T4   |      1 |    240K|    240K|00:00:04.39 |    1599K|    800K|      0 |       |       |          |         |
|   3 |   TABLE ACCESS FULL| T5   |      1 |    240K|    240K|00:00:04.50 |    1599K|    800K|      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                         

   1 - access("T4"."ID"="T5"."ID")                                                                                                          

*** RowSrcId: 1 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 101918155
Memory for slot table: 97517568
Calculated overhead for partitions and row/slot managers: 4400587
Hash-join fanout: 256
Number of partitions: 256
Number of slots: 384
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 4096
Per partition bit vector length(KB): 16
Maximum possible row length: 24079
Estimated build size (KB): 5505
Estimated Build Row Length (includes overhead): 24054
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[1]): phase=BUILD
WAIT #140460853627488: nam='Disk file operations I/O' ela= 19 FileOperation=2 fileno=7 filetype=2 obj#=79792 tim=1728278840
WAIT #140460853627488: nam='db file sequential read' ela= 451 file#=7 block#=181378 blocks=1 obj#=79792 tim=1728279303
....
kxhfSetPhase(HJ[1]): phase=PROBE_1
qerhnFetch(HJ[1]): max build row length (mbl=24041)
*** RowSrcId: 1 END OF BUILD (PHASE 1) ***
  Revised row length: 24041
  Revised build size: 5634579KB

********************************************************************************

    SELECT   *
      FROM   t4, t5
     WHERE   t4.id = t5.id

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    64001     39.30     240.09    3022397    3198089          0      960000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    64003     39.31     240.10    3022397    3198089          0      960000

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
    960000     960000     960000  HASH JOIN  (cr=3198145 pr=3022407 pw=1420978 time=232736486 us starts=1 cost=980133 size=45875317488 card=954781)
    240000     240000     240000   TABLE ACCESS FULL T4 (cr=1599058 pr=800668 pw=0 time=5110328 us starts=1 cost=217453 size=5765760000 card=240000)
    240000     240000     240000   TABLE ACCESS FULL T5 (cr=1599031 pr=800751 pw=0 time=5080920 us starts=1 cost=217190 size=5765760000 card=240000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   64001        0.00          0.05
  PGA memory operation                          252        0.00          0.00
  Disk file operations I/O                        2        0.00          0.00
  db file sequential read                      2360        0.02          0.15
  direct path read                            12528        0.01          4.16
  asynch descriptor resize                        3        0.00          0.00
  direct path write temp                      45838        0.06        178.93
  direct path read temp                       45838        0.03         26.99
  SQL*Net more data to client                    11        0.00          0.00
  SQL*Net message from client                 64001        0.02         28.75
********************************************************************************

NP could have been calculated as:

NP > 5634579/(101918155/1024)=56.61, so NP=64.

But high number of partitions caused small cluster size. if database decides to produce large number of small partitions then the NP should be large enough for keep some of them into the memory at the end of the build phase. In our last case none of 256 partitions was not stored in the memory, in contrast C was small and got inefficient i/o. if NP was selected as 64 then C would be big and i/o time would be reduced.
To improve one-pass HJ performance the minimum memory defined by (F.6) should be increased. And this increasing will cause two things:

1) If the increased memory not sufficiently big but it would allow Oracle to select bigger cluster size and small number of partitions and as a result HJ performance will be improved.
2) If the memory sufficiently increased, then oracle will be able to define optimal cluster size and number of partitions and some of them will be kept in the memory (the writing partitions to the disk part will be reduced), as a result significant improvement will be gain.

To increase memory the following could be considered (roughly):

MEM = SQRT(T1*C) + k*|Pi| = SQRT(T1*C) + k|T1|/NP
--If you want to keep k number of partitions in the memory
Or
MEM = SQRT(T1*C) +p*|T1|
--If you want to keep p percent of |T1| in the memory

Summary

The One-pass HJ is effective if there is enough memory to store some partitions in it. Multi-pass join might give better response time than pure One-pass join (which is on disk) in certain situation. Manual work area management is not good of course, it does not use available memory efficiently, but depending on conditions in some cases, it could be much efficient than pure one-pass auto join. Oracle tries to produce large number of partitions with small sizes in general. It is good if we have enough memory to keep some partitions in memory and reduce writing of full build/probe tables to the temporary segment. Otherwise as we see, it caused lots of small I/O requests to write data to the disk. Re-reading might not be the main problem of the HJ. Because in this case small number of but large partitions could be used to take advantage of I/O. We saw that the build/partition phase might be most dominant part of the HJ not the join phase itself.
Finally, number of partitions and cluster size are identified by runtime engine during the HJ, but available memory depends on activities in the DB (active work areas) so memory pressure plays important role during the HJ execution.

Useful materials:

[1] Lewis, Jonathan “Cost-Based Oracle Fundamentals”, 2006
[2] Prabhaker “GP” Gongloor, Sameer Patkar. Oracle Technical Report
“Hash Joins Implementation and Tuning Release 7.3” ACTA March 1997

Leave a Comment

April 29, 2019

Parallel Shared Hash Join

Filed under: CBO,Join,Parallel Execution,Skew,Tuning — Chinar Aliyev @ 5:36 am

Starting with Oracle Database 18c the new Parallel Hash Join mechanism has been implemented -Parallel Shared Hash Join (PSHJ), but not documented yet. It can be controlled by PQ_DISTRIBUTE hint. Let`s see the following example:

OS: REL7 64bit
DB version: 18.3.0.0

create table t1
compress
as
select  rownum as id
      , 2*rownum id1
      , rpad('x', 100) as f1
from dual
connect by
        level &lt;= 1.5e6
;
create table t2
compress
as
select
        case when rownum &lt;= 1e5 then 1
         when rownum &lt;= 7e5 then 2 else rownum end as id
      , rpad('x', 100) as f1
from  dual
connect by
        level &lt;= 2e6;

execute dbms_stats.gather_table_stats(null,'t2',method_opt=&gt;'for all columns size skewonly');

The query and its explain plan:
explain plan for
select count(t_2f) from (
select /*+  parallel(2)
            leading(t1 t2)
            use_hash(t1 t2)
            pq_distribute(t2 hash hash) */
            t1.f1 t1_f1,t2.f1 t2_f1
            , regexp_replace(t2.f1, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t2.f1, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_2f
 from t1,t2
where t1.id=t2.id)
;
(Query1)
select * from table(dbms_xplan.display());
Plan hash value: 2644066362                                                                                           

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     1 |   112 |   954   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |   112 |            |          |        |      |            |
|   2 |   PX COORDINATOR             |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)       | :TQ10002 |     1 |   112 |            |          |  Q1,02 | P-&gt;S | QC (RAND)  |
|   4 |     SORT AGGREGATE           |          |     1 |   112 |            |          |  Q1,02 | PCWP |            |
|*  5 |      HASH JOIN               |          |  2014K|   215M|   954   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX RECEIVE             |          |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX SEND HYBRID HASH   | :TQ10000 |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,00 | P-&gt;P | HYBRID HASH|
|   8 |         STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|   9 |          PX BLOCK ITERATOR   |          |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL  | T1       |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |       PX RECEIVE             |          |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |        PX SEND HYBRID HASH   | :TQ10001 |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,01 | P-&gt;P | HYBRID HASH|
|  13 |         PX BLOCK ITERATOR    |          |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |          TABLE ACCESS FULL   | T2       |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                                   

   5 - access("T1"."ID"="T2"."ID")                                                                                    

Note
-----
   - Degree of Parallelism is 2 because of hint
(EP1)

But if we change the PQ_DISTRIBUTE hint parameter with following parameters then we will get:
explain plan for
select count(t_2f) from (
select /*+  parallel(2)
            leading(t1 t2)
            use_hash(t1 t2)
            pq_distribute(t2 shared none) */
            t1.f1 t1_f1,t2.f1 t2_f1
            , regexp_replace(t2.f1, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t2.f1, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_2f
 from t1,t2
where t1.id=t2.id)
;
 (Query2)
select * from table(dbms_xplan.display());
Plan hash value: 1703361066                                                                                      

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |   112 |   954   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |   112 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |   112 |            |          |  Q1,00 | P-&gt;S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |   112 |            |          |  Q1,00 | PCWP |            |
|*  5 |      HASH JOIN SHARED   |          |  2014K|   215M|   954   (2)| 00:00:01 |  Q1,00 | PCWP |            |
|   6 |       PX BLOCK ITERATOR |          |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1500K|  8789K|   497   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |       PX BLOCK ITERATOR |          |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,00 | PCWC |            |
|   9 |        TABLE ACCESS FULL| T2       |  2000K|   202M|   451   (2)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------                                                              

   5 - access("T1"."ID"="T2"."ID")                                                                               

Note
-----
   - Degree of Parallelism is 2 because of hint
(EP2)

As you see there is a new operator in the plan – HASH JOIN SHARED and the new additional parameters for the PQ_DISTRIBUTE hint. The hint tells optimizer that t2 table data is not going to be distributed but will be joined a row-source that is shared. In the plan (EP2) there only one PX Slave Set that going to scan of its inputs and to perform the join operation itself. As it is known in the single table replicate mode (PQ_REPLICATE) there is also one PX Slave set which performs scan and join operations but each PX slave scans and gets the whole build table via buffer cache. But in our case which the plan (EP2) describes clearly that each PX slave going to scan only an appropriate part of the build table not whole of the table. But how the join is performed by the same slaves concurrently?

The answer is that after scanning the build table (t1) the hash table is shared between PX slaves – as the HASH JOIN SHARED operation includes the SHARED key word. So according plan (EP2) firstly, each slave scans a part of t1 table and global(shared) hash table is created based on that parts. If it is shared then each PX Slave can safely scan the second table (t2) and can do probe with global hash table in order to complete join.

Before the explain how the global hash table is shared, I want to post here the execution statistics of the (Query2). I have enabled SQL and HASH Join traces together for the Parallel Join. So:
alter session set tracefile_identifier='SHJ';
alter session set events '10104 trace name context forever';
alter session set events '10046 TRACE NAME CONTEXT FOREVER, LEVEL 12';
--- executing Query2

Execution stats(Some parts of TKPROF result) for first slave (p000):
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us starts=0)
         0          0          0   PX COORDINATOR  (cr=0 pr=0 pw=0 time=0 us starts=0)
         0          0          0    PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us starts=0)
         1          1          1     SORT AGGREGATE (cr=2847 pr=0 pw=0 time=25892349 us starts=1)
    772910     772910     772910      HASH JOIN SHARED (cr=2847 pr=0 pw=0 time=935130 us starts=1 cost=954 size=225651216 card=2014743)
    696267     696267     696267       PX BLOCK ITERATOR (cr=1560 pr=0 pw=0 time=31983 us starts=1 cost=497 size=9000000 card=1500000)
    696267     696267     696267        TABLE ACCESS FULL T1 (cr=1560 pr=0 pw=0 time=29498 us starts=12 cost=497 size=9000000 card=1500000)
    846853     846853     846853       PX BLOCK ITERATOR (cr=1287 pr=0 pw=0 time=80051 us starts=1 cost=451 size=212000000 card=2000000)
    846853     846853     846853        TABLE ACCESS FULL T2 (cr=1287 pr=0 pw=0 time=62276 us starts=11 cost=451 size=212000000 card=2000000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Execution Msg                          25        0.01          0.01
  latch: MGA shared context root latch            1        0.00          0.00

Execution stats for second slave (p002):
Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         0          0          0  SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us starts=0)
         0          0          0   PX COORDINATOR  (cr=0 pr=0 pw=0 time=0 us starts=0)
         0          0          0    PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us starts=0)
         1          1          1     SORT AGGREGATE (cr=3559 pr=0 pw=0 time=25907098 us starts=1)
    727090     727090     727090      HASH JOIN SHARED (cr=3559 pr=0 pw=0 time=616521 us starts=1 cost=954 size=225651216 card=2014743)
    803733     803733     803733       PX BLOCK ITERATOR (cr=1812 pr=0 pw=0 time=38598 us starts=1 cost=497 size=9000000 card=1500000)
    803733     803733     803733        TABLE ACCESS FULL T1 (cr=1812 pr=0 pw=0 time=35324 us starts=14 cost=497 size=9000000 card=1500000)
   1153147    1153147    1153147       PX BLOCK ITERATOR (cr=1747 pr=0 pw=0 time=99336 us starts=1 cost=451 size=212000000 card=2000000)
   1153147    1153147    1153147        TABLE ACCESS FULL T2 (cr=1747 pr=0 pw=0 time=77883 us starts=15 cost=451 size=212000000 card=2000000)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  PX Deq: Execution Msg                          31        0.00          0.00
  latch: MGA shared context latch                 1        0.00          0.00

As it is seen that each slave has scanned some part of t1 table (P000=> 696267 rows, p001=> 803733). But the slaves have created a global hash table in order to perform join and produce correct result. If we look at the trace file, for example p000 slave:
*** RowSrcId: 5 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 86608574
Memory for slot table: 19808256
Calculated overhead for partitions and row/slot managers: 66800318
Hash-join fanout: 16
Number of partitions: 16
Number of slots: 78
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 2048
Per partition bit vector length(KB): 128
Maximum possible row length: 136
Estimated build size (KB): 14
Estimated Build Row Length (includes overhead): 20
# Immutable Flags:
  Evaluate Left Input Row Vector
  Evaluate Right Input Row Vector
# Mutable Flags:
  Not BUFFER(execution) output of the join for PQ
  IO sync
kxhfSetPhase(HJ[5]): phase=BUILD
kxhfAddChunk(HJ[5]): add chunk 0 (sz=128) to slot table
kxhfAddChunk(HJ[5]): chunk 0 (lbs=0x400000ffb748, slotTab=0x400000ffb968) added
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 137 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602485651
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 151 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602494563
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 212 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602505096
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 183 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602514524
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 189 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602520436
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 132 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602530425
.....................
WAIT #139660050842104: nam='PX Deq: Execution Msg' ela= 142 sleeptime/senderid=268566527 passes=1 p3=1829166000 obj#=-1 tim=32602588668
kxhfSetPhase(HJ[5]): phase=PROBE_1
qerhnFetch(HJ[5]): max build row length (mbl=14)
*** RowSrcId: 5 END OF BUILD (PHASE 1) ***
  Revised row length: 14
  Revised build size: 8982KB
kxhfResize(HJ[5]): enter, resize to 52 slots (numAlloc=48, max=78)
kxhfResize(HJ[5]): exit, resized to 52 slots (numAlloc=48, max=52)
  qerhnResizeBuild(HJ[5]):
  Slot table resized: old=78 wanted=52 got=52 unload=0
*** RowSrcId: 5 HASH JOIN RESIZE BUILD (PHASE 1) ***
Total number of partitions: 16
Number of partitions which could fit in memory: 16
Number of partitions left in memory: 16
Total number of slots in in-memory partitions: 48
kxhfResize(HJ[5]): enter, resize to 54 slots (numAlloc=48, max=52)
kxhfResize(HJ[5]): exit, resized to 54 slots (numAlloc=48, max=54)
  set work area size to: 77932K (54 slots)
*** RowSrcId: 5 HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 16
Number of partitions left in memory: 16
Total number of rows in in-memory partitions: 696267
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 1951884

The trace file shows that the slave started to acquire the build table in order to create hash table and consumed rows was 696267 as reported (Total number of rows in in-memory partitions: 696267).

But the next lines indicates that the slave has identified that there are 1500000 (the whole table) number of rows of the build table!:
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Revised number of hash buckets (after flushing): 696267
Allocating new hash table.
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Requested size of hash table: 2097152
Actual size of hash table: 2097152
Number of buckets: 32768
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of rows (may have changed): 1500000

Now the following lines tell us that before creating global hash table the build table is shared between PX Slaves, so both of them started to build the global hash table simultaneously (p000):
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=41 rows=43360 total_rows=43360
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=38 rows=50129 total_rows=93489
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=37 rows=43549 total_rows=137038
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=35 rows=50569 total_rows=187607
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=38 rows=43635 total_rows=231242
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=42 rows=49901 total_rows=281143
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=43 rows=43292 total_rows=324435
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=32 rows=50666 total_rows=375101
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=32 rows=44040 total_rows=419141
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=46 rows=49977 total_rows=469118
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=42 rows=43273 total_rows=512391
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=47 rows=49702 total_rows=562093
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=44 rows=43348 total_rows=605441
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=40 rows=50177 total_rows=655618
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=47 rows=43177 total_rows=698795
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
qerhnBuildHashTable(): done hash-table on partition=0, index=0 last_slot#=33 rows=50728 total_rows=749523
kxhfIterate(HJ[5]): end_iterate, numAlloc=48, maxSlots=54
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
……
### Hash table overall statistics ###
Total buckets: 32768 Empty buckets: 30103 Non-empty buckets: 2665
Total number of rows: 1500000
Maximum number of keys in a bucket: 14
Average number of rows in non-empty buckets: 562.851807
qerhnProbeChooseRowP (rwsid=5): inmem=1,mm=0,nf=1

Although the p000 slave has scanned 696267 rows but it latter processed 749523 rows to build (a part of) global hash table. The trace files of the both PX slaves show that they have already got the same global hash table and now they can perform probe phase.

To share the hash table there is a new shared memory region has been introduced (its definition/metadata is stored in shared pool) and is called  Managed Global Area. Global hash table is created in this area and it is available during the join process. While performing the join operation we can query v$sgastat to see some stuff:
select pool,name,bytes from v$sgastat
where upper(name) like '%MGA%';

pool         name            byes
shared pool  ksm mga segarr2  73728   &lt;-- visible only during the join
shared pool  mga fixtab off   48
shared pool  MGA service      287864
shared pool  ksmmga_hpds1     600     &lt;-- visible only during the join 


This is a new method which is used to share memory between processes. In addition new stats names appeared in v$pgastat (MGA allocated (under PGA),maximum MGA allocated). Also, to protect the new shared memory region new latches and appropriate wait events (like latch: MGA.. ) have been introduced.

There is another question, what about RAC and new PSHJ implementation? Unfortunately, the mechanism does not work when PX slaves are distributed cluster wide because the global hash table could not be shared between cluster nodes yet. It is a part of Shared Pool. So, optimizer can generate the PSHJ plan only when you force to generate PX slaves locally (i.e parallel_force_local=true).

Now, couple of words about the performance of PSHJ. First of all, in this case parallel hash join operation will not suffer its BUFFERED variant. Secondly, there will not be skew problem (my initial thought) even there several tables and different kind of filter predicates are involved in a join. If you look at the t2 table data (id column), you can see that there is significant skew , however optimizer does not apply the automatic skew handling mechanism to the join. Because there is a relationship between DOP and number of top frequent values (NTFV), so that if DOP <= NTFV then optimizer thinks (mostly) that all TFV will already be distributed among all PX Slaves and BROADCASTING TFV is not necessary. But it does not make sense how the hash function is clever the join anyway will suffer of skew due to initial data skew. That is why in my test system the (Query1) is completed within 35 seconds but the (Query2) within 27 seconds (due to no skew problem).

Let`s go back again to the previous post which I explained that in some cases disabling skew detection can be useful. So that, with Automatic Skew handling the mentioned query in the blog post executed within 82 seconds and without the feature execution time was 18 seconds. But with the new PSHJ implementation the query just completed within 6 seconds in the same system due to no processing skew (scan and join) and no distribution (only one slave set performs all operations). The SQL Monitor report is:
SQL Monitoring Report

SQL Text
------------------------------
SELECT COUNT (ff1), COUNT (f2) FROM (SELECT /*+ parallel(10) monitor pq_distribute(t2 shared none)ssq1 */ t1 .f1 ff1, t2.* FROM t1, t2 WHERE t1.c1 = t2.c1 AND chrtp NOT IN ('A', 'E', 'D', 'C'))

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 SQL ID              :  ftx8ugp677rr6
 SQL Execution ID    :  16777216
 Execution Started   :  04/28/2019 17:48:32
 First Refresh Time  :  04/28/2019 17:48:32
 Last Refresh Time   :  04/28/2019 17:48:38
 Duration            :  6s
 Service             :  SYS$USERS
 Program             :  sqlplus@localhost.localdomain (TNS V1-V3)
 Fetch Calls         :  1                                           

Global Stats
==========================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency |  Other   | Fetch | Buffer | Read  | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes |
==========================================================================================
|      62 |      49 |       13 |        0.02 |     0.01 |     1 |     8M | 52192 |  50GB |
==========================================================================================

Parallel Execution Details (DOP=10 , Servers Allocated=10)
===========================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Concurrency |  Other   | Buffer | Read | Read  |     Wait Events      |
|                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes |      (sample #)      |
===========================================================================================================================================
| PX Coordinator | QC    |         |    0.03 |    0.02 |          |             |     0.01 |     14 |      |     . |                      |
| p000           | Set 1 |       1 |    6.12 |    4.84 |     1.29 |        0.00 |          |   849K | 5337 |   5GB | direct path read (1) |
| p001           | Set 1 |       2 |    7.43 |    6.11 |     1.31 |        0.00 |          |   844K | 5279 |   5GB | direct path read (3) |
| p002           | Set 1 |       3 |    5.84 |    4.62 |     1.21 |        0.00 |          |   791K | 4908 |   5GB | direct path read (2) |
| p003           | Set 1 |       4 |    5.70 |    4.48 |     1.21 |        0.00 |          |   770K | 4983 |   5GB |                      |
| p004           | Set 1 |       5 |    6.18 |    4.90 |     1.28 |        0.00 |          |   840K | 5380 |   5GB |                      |
| p005           | Set 1 |       6 |    6.05 |    4.80 |     1.24 |        0.00 |          |   846K | 5322 |   5GB |                      |
| p006           | Set 1 |       7 |    6.07 |    4.83 |     1.24 |        0.00 |          |   792K | 5063 |   5GB | direct path read (1) |
| p007           | Set 1 |       8 |    6.12 |    4.82 |     1.30 |        0.00 |          |   811K | 5295 |   5GB | direct path read (2) |
| p008           | Set 1 |       9 |    6.02 |    4.76 |     1.25 |        0.00 |          |   818K | 5311 |   5GB |                      |
| p009           | Set 1 |      10 |    6.29 |    5.01 |     1.28 |        0.00 |          |   843K | 5314 |   5GB |                      |
===========================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1703361066)
=================================================================================================================================================================
| Id |         Operation         |   Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  | Activity |   Activity Detail    |
|    |                           |          | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |   (%)    |     (# samples)      |
=================================================================================================================================================================
|  0 | SELECT STATEMENT          |          |         |       |         1 |     +6 |     1 |        1 |       |       |     . |          |                      |
|  1 |   SORT AGGREGATE          |          |       1 |       |         1 |     +6 |     1 |        1 |       |       |     . |          |                      |
|  2 |    PX COORDINATOR         |          |         |       |         1 |     +6 |    11 |       10 |       |       |     . |          |                      |
|  3 |     PX SEND QC (RANDOM)   | :TQ10000 |       1 |       |         2 |     +5 |    10 |       10 |       |       |     . |          |                      |
|  4 |      SORT AGGREGATE       |          |       1 |       |         2 |     +5 |    10 |       10 |       |       |     . |          |                      |
|  5 |       HASH JOIN SHARED    |          |     77G |  274K |         7 |     +1 |    10 |       1M |       |       |   6GB |    61.54 | Cpu (32)             |
|  6 |        PX BLOCK ITERATOR  |          |      1M | 78333 |         3 |     +2 |    10 |       1M |       |       |     . |          |                      |
|  7 |         TABLE ACCESS FULL | T1       |      1M | 78333 |         4 |     +1 |   134 |       1M | 20472 |  20GB |     . |    15.38 | Cpu (2)              |
|    |                           |          |         |       |           |        |       |          |       |       |       |          | direct path read (6) |
|  8 |        PX BLOCK ITERATOR  |          |      2M |  121K |         3 |     +4 |    10 |       2M |       |       |     . |          |                      |
|  9 |         TABLE ACCESS FULL | T2       |      2M |  121K |         3 |     +4 |   132 |       2M | 31720 |  31GB |     . |    23.08 | Cpu (9)              |
|    |                           |          |         |       |           |        |       |          |       |       |       |          | direct path read (3) |
=================================================================================================================================================================

All mentioned notes about PSHJ in this post are initial thoughts. It still is not documented and supported.

Finally, I want to say that it seems the PSHJ is true and native approach for Oracle DBMS than previous/conventional parallel join implementations. There is not principal difference between Oracle DB`s conventional parallel Joins, parallel query optimization techniques and that of shared nothing architecture (like Teradata`s Px Query Optimization) but Oracle has flexibility due to shared storage. It did not used to take advantage of shared memory. However, in current releases it is not allowed to use the technique over cluster wide, but I expect that in future, oracle DB`s distributed shared memory architecture will allow it to use.

Comments (5)

April 10, 2019

Automatic Parallel Outer Join Null Handling

Filed under: CBO,Join,Parallel Execution — Chinar Aliyev @ 5:29 am

The problem was described by Randolf in his article. He used two table and demonstrated that if majority of foreign key values are NULL then Parallel Outer Join (POJ) can suffer the skew related to the NULL values. Generally, in practice foreign keys (majority) are not null values or we can see the symptom very occasionally. But the true NULL value skew is caused by outer join itself when more than two tables are involved. In this case null values are automatically generated due to definition of the join. Starting from Oracle Database 18c the mentioned problem is solved automatically. I did a little change on Randolf`s test case in order to demonstrate the null values handling for his (two table) test case as well as three table join.
The test case:

-- OS:OEL 7.3 64bit, DB:Oracle 18c
create table t1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , 2*rownum id1
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 2e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

create table t2
compress
as
select
        rownum as id
      , case when rownum <= 1e5 then 1 
         when rownum <= 5e5 then 2 else rownum end as fk_id_skew
      , rownum as fk_id_uniform
      , case when rownum <= 5e5 then mod(rownum, 2e6) + 1 else null end as fk_id_sparse
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 2e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;


create table t3
compress
as
select
        rownum as id
      ,4*rownum id3
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 2e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

Firstly, let`s see the following query and explain plan when OFE = 12.2.0.1 as:

alter session set optimizer_features_enable='12.2.0.1';

explain plan for
select count(t_2_filler), count(t_3_filler) from
(
select /*+ monitor no_merge parallel(4) leading(t2 t1)
*/ t1.id, t1.filler t1f1,t2.id,t2.filler t2f1
      , regexp_replace(t2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_2_filler
      , regexp_replace(t3.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t3.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_3_filler
      from 
t2 left outer join t1 on t1.id=t2.fk_id_sparse
 left outer join t3 on t1.id1=t3.id3
 )
 ;

(Q1)

select * from table(dbms_xplan.display(null,null,null));

----------------------------------------------------------
Plan hash value: 614020325

--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |     1 |  4004 |  2235   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                  |          |     1 |  4004 |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10004 |     1 |  4004 |            |          |  Q1,04 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE               |          |     1 |  4004 |            |          |  Q1,04 | PCWP |            |
|   5 |      VIEW                        |          |  4008K|    14G|  2235   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|*  6 |       HASH JOIN RIGHT OUTER      |          |  4008K|   848M|  2235   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|   7 |        PX RECEIVE                |          |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   8 |         PX SEND HASH             | :TQ10002 |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,02 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR       |          |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,02 | PCWC |            |
|  10 |           TABLE ACCESS FULL      | T3       |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  11 |        PX RECEIVE                |          |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|  12 |         PX SEND HASH             | :TQ10003 |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,03 | P->P | HASH       |
|* 13 |          HASH JOIN OUTER BUFFERED|          |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |           PX RECEIVE             |          |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |            PX SEND HASH          | :TQ10000 |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | P->P | HASH       |
|  16 |             PX BLOCK ITERATOR    |          |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  17 |              TABLE ACCESS FULL   | T2       |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  18 |           PX RECEIVE             |          |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |            PX SEND HASH          | :TQ10001 |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  20 |             PX BLOCK ITERATOR    |          |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | PCWC |            |
|  21 |              TABLE ACCESS FULL   | T1       |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - access("T1"."ID1"="T3"."ID3"(+))
  13 - access("T1"."ID"(+)="T2"."FK_ID_SPARSE")

Note
-----
   - Degree of Parallelism is 4 because of hint

In my test system it took 103 second to complete. And If we look at the V$PQ_TQSTAT view:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P000                972383  101673907     24.31      24.29         21          0           0
                                             1 P001                940880   98214810     23.52      23.46         25          2           0
                                             1 P002               1018596  106684695     25.46      25.49         23          1           0
                                             1 P003               1068141  112036474     26.70      26.76         29          5           0
                      Consumer               1 P004                124986   13514973      3.12       3.23       1903       1899           0
                                             1 P005                124403   13451856      3.11       3.21       1903       1899           0
                                             1 P006                125324   13551533      3.13       3.24       1903       1899           0
                                             1 P007               3625287  378091524     90.63      90.32       1903       1899           0

                    1 Producer               1 P000                968234   13352477     24.21      24.52         20          1           0
                                             1 P001               1020964   13665164     25.52      25.10         24          2           0
                                             1 P002                972210   13338121     24.31      24.50         23          5           0
                                             1 P003               1038592   14089338     25.96      25.88         23          2           0
                      Consumer               1 P004               1000886   13623627     25.02      25.02       2713       2046           0
                                             1 P005                998520   13592045     24.96      24.96       2710       2053           0
                                             1 P006               1000610   13619103     25.02      25.01       2735       2069           0
                                             1 P007                999984   13610325     25.00      25.00       2080       1985           0

                    2 Producer               1 P004                970775  105866492     24.27      24.27         26          4           0
                                             1 P005                993664  108303963     24.84      24.83         24          5           0
                                             1 P006               1027726  112089217     25.69      25.69         27          5           0
                                             1 P007               1007835  109977478     25.20      25.21         26          6           0
                      Consumer               1 P000               1001034  109172059     25.03      25.03       1964       1913           0
                                             1 P001               1000896  109157355     25.02      25.02       1971       1915           0
                                             1 P002               1000071  109067176     25.00      25.00       1966       1915           0
                                             1 P003                997999  108840560     24.95      24.95       1972       1918           0

                    3 Producer               1 P004                124986   13514946      3.12       3.23          0          0           0
                                             1 P005                124403   13451852      3.11       3.21          0          0           0
                                             1 P006                125324   13551519      3.13       3.24          0          0           0
                                             1 P007               3625287  378091546     90.63      90.32          0          0           0
                      Consumer               1 P000                125012   13517675      3.13       3.23       3256       2469           0
                                             1 P001                124715   13485635      3.12       3.22       4159       2819           0
                                             1 P002                124813   13496259      3.12       3.22       4288       2887           0
                                             1 P003               3625460  378110294     90.64      90.33       3669       2649           0

                    4 Producer               1 P000                     1         44     25.00      25.00       2003        968           0
                                             1 P001                     1         44     25.00      25.00       2765       1285           0
                                             1 P002                     1         44     25.00      25.00       3010       1399           0
                                             1 P003                     1         44     25.00      25.00       1748        752           0
                      Consumer               1 QC                       4        176    100.00     100.00          6          0           0

As you see there is a significant skew between PX slaves due to initial and null values caused by outer join. But if we change the OFE to 18.1.0 and hinted as below then we will get the following plan:

select count(t_2_filler), count(t_3_filler) from
(
select /*+ monitor no_merge parallel(4)
          leading(t2 t1)
          no_swap_join_inputs(T3)
*/ t1.id, t1.filler t1f1,t2.id,t2.filler t2f1
      , regexp_replace(t2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_2_filler
      , regexp_replace(t3.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') || regexp_replace(t3.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as t_3_filler
      from 
t2 left outer join t1 on t1.id=t2.fk_id_sparse
 left outer join t3 on t1.id1=t3.id3
 )
 ;
 (Q2)
select * from table(dbms_xplan.display(null,null,null));

Plan hash value: 3676694016                                                                                                   
                                                                                                                              
------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |          |     1 |  4004 |  2235   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                      |          |     1 |  4004 |            |          |        |      |            |
|   2 |   PX COORDINATOR                     |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)               | :TQ10004 |     1 |  4004 |            |          |  Q1,04 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                   |          |     1 |  4004 |            |          |  Q1,04 | PCWP |            |
|   5 |      VIEW                            |          |  4008K|    14G|  2235   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|*  6 |       HASH JOIN OUTER                |          |  4008K|   848M|  2235   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|   7 |        PX RECEIVE                    |          |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|   8 |         PX SEND HASH (NULL RANDOM)   | :TQ10002 |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  9 |          HASH JOIN OUTER BUFFERED    |          |  4000K|   438M|  1542   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |           PX RECEIVE                 |          |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  11 |            PX SEND HASH (NULL RANDOM)| :TQ10000 |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | P->P | HASH       |
|  12 |             PX BLOCK ITERATOR        |          |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  13 |              TABLE ACCESS FULL       | T2       |  4000K|   392M|   850   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  14 |           PX RECEIVE                 |          |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  15 |            PX SEND HASH              | :TQ10001 |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  16 |             PX BLOCK ITERATOR        |          |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | PCWC |            |
|  17 |              TABLE ACCESS FULL       | T1       |  4000K|    45M|   685   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|  18 |        PX RECEIVE                    |          |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|  19 |         PX SEND HASH                 | :TQ10003 |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,03 | P->P | HASH       |
|  20 |          PX BLOCK ITERATOR           |          |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,03 | PCWC |            |
|  21 |           TABLE ACCESS FULL          | T3       |  4000K|   408M|   687   (1)| 00:00:01 |  Q1,03 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------
                                                                                                                      
Predicate Information (identified by operation id):                                                                           
---------------------------------------------------                                                                           
                                                                                                                              
   6 - access("T1"."ID1"="T3"."ID3"(+))                                                                                       
   9 - access("T1"."ID"(+)="T2"."FK_ID_SPARSE")                                                                               
                                                                                                                              
Note                                                                                                                          
-----                                                                                                                         
   - Degree of Parallelism is 4 because of hint

It is seen that there is a new (modified) version of true HASH distribution (not hybrid) in the plan shape. The no_swap_join_inputs hint is not necessary to be specified here, If we remove the hint, we will get HASH JOIN RIGHT OUTER instead of HASH JOIN OUTER. The key point here is that “Preserved” rowsource (initially it can be a table, but latter can be a result of join) of the join will be checked and the NULL values are distributed RANDOMLY in order to attain uniform distribution, in other words to avoid the join skew. The plan lines 11 and 8 indicate that NULL values accordingly consumed from T2 table and the first outer join will be distributed randomly (other values will be distributed via HASH distribution)
. The query (Q2) took 49 second to complete in my test system.
Let’s display the result of v$PQ_TQSTAT:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P000                973412  101972494     24.34      24.36         32         11           0
                                             1 P001               1029942  108261300     25.75      25.86         32          8           0
                                             1 P002               1013635  105773077     25.34      25.27         32         10           0
                                             1 P003                983011  102602967     24.58      24.51         31          9           0
                      Consumer               1 P004                999987  104651085     25.00      25.00        115        111           0
                                             1 P005                999403  104587864     24.99      24.98        116        112           0
                                             1 P006               1000324  104687541     25.01      25.01        115        111           0
                                             1 P007               1000286  104683348     25.01      25.01        116        112           0

                    1 Producer               1 P000                975402   13325594     24.39      24.48         22          1           0
                                             1 P001               1007163   13717028     25.18      25.19         19          1           0
                                             1 P002               1048553   14051184     26.21      25.81         23          2           0
                                             1 P003                968882   13351294     24.22      24.52         20          3           0
                      Consumer               1 P004               1000886   13623627     25.02      25.02       3118       1458           0
                                             1 P005                998520   13592045     24.96      24.96       2213       1077           0
                                             1 P006               1000610   13619103     25.02      25.01       3460       1582           0
                                             1 P007                999984   13610325     25.00      25.00       3900       1809           0

                    2 Producer               1 P004                999987  104651082     25.00      25.00          0          0           0
                                             1 P005                999403  104587812     24.99      24.98          0          0           0
                                             1 P006               1000324  104687551     25.01      25.01          0          0           0
                                             1 P007               1000286  104683370     25.01      25.01          0          0           0
                      Consumer               1 P000               1000013  104653811     25.00      25.00        175        125           0
                                             1 P001                999715  104621619     24.99      24.99        171        121           0
                                             1 P002                999813  104632267     25.00      25.00        175        124           0
                                             1 P003               1000459  104702118     25.01      25.01        170        123           0

                    3 Producer               1 P004               1042597  113581472     26.06      26.04         24          4           0
                                             1 P005                972928  106168339     24.32      24.34         22          5           0
                                             1 P006               1011547  110318955     25.29      25.29         23          4           0
                                             1 P007                972928  106168360     24.32      24.34         23          2           0
                      Consumer               1 P000               1001034  109172059     25.03      25.03        841        452           0
                                             1 P001               1000896  109157355     25.02      25.02       3606       1840           0
                                             1 P002               1000071  109067176     25.00      25.00       3849       1772           0
                                             1 P003                997999  108840536     24.95      24.95       3844       1792           0

                    4 Producer               1 P000                     1         44     25.00      25.00       2827       1263           0
                                             1 P001                     1         44     25.00      25.00       6032       2857           0
                                             1 P002                     1         44     25.00      25.00       5604       2524           0
                                             1 P003                     1         44     25.00      25.00       5728       2596           0
                      Consumer               1 QC                       4        176    100.00     100.00        103         99           0

As you see there is not skew anymore and the query completed very fast than previous one. The automatic null value handling for outer join is controlled by _px_join_skew_null_handling parameter and its default value is true.

Leave a Comment

April 8, 2019

The Performance of Parallel Hash Join and Automatic Skew Handling

Filed under: CBO,Join,Parallel Execution,Skew,Tuning — Chinar Aliyev @ 6:32 am

As it is known that HYBRID HASH distribution method (DM) is designed to prevent broadcasting “huge” number of rows from the left row source of Parallel Join (PJ). Although Oracle favors to redistribute only few numbers of rows via BROADCAST DM but the HYBRID HASH SKEW DM distributes “popular” values via BROADCAST in the case of skew detection in data of probe table. So, if the popular values consume significant portion of build table and high degree of parallelism is requested then depending on your data(volume) and SQL statement the Parallel Hash Join (PHJ) performance can be worse. Before introducing the test case I want to share some new changes have been done in the recent Oracle Database versions that related to Parallel Join Skew Handling (PJSH).
First of all, If there is a histogram for join column (probe table), based on statistics and hidden parameters (_px_join_skew_minfreq, _px_join_skew_ratio see Randolf`s blog post, the article is based on 12cR1) the database identifies that there is not “enough” skew to enable skew handling, then you cannot enable PJSH even via PQ_SKEW.
But there can be cases which statistics might not describe the data properly and you want to enable PJSH. For this purpose starting from Oracle Database 12cR2 there is a new parameter _px_join_skewed_values_count . The parameter allows us to set top-n frequent values (not popular values, in contrast to decision based on histogram) for optimizer to consider them as skewed values. The default value is 0. If we set the value different than 0, then Oracle will generate top-n frequent values using dynamic sampling queries (DSQ) and will distribute the values via BROADCAST DM for build table and will read the values ROUND-ROBIN fashion from PROBE table. In addition, the mentioned enhancement has been improved in Oracle 18c, so that we are allowed even to set sampling percent (and also time restriction in order to reduce parse time when required) for DSQ (_px_join_skew_sampling_percent, _px_join_skew_sampling_time_limit).

As Randolf has explained in his blog post after skew detection the database does execute dynamic sampling query in order to find exact values. But Oracle Database 18c afterward it is not always needed to generate DSQ to find the exact skewed values. Oracle finds them from histogram when it is possible (mostly possible, in the case of the full values are stored in the histograms) it is controlled by the _px_join_skew_use_histogram parameter and default value is TRUE.
Now let`s see the below test case:

-- Test system:
-- OS: OEL7 64bit,RAM=187GB, CPU=48
-- DB: Oracle 18.0.0, SGA=20GB, PGA=10G

create table t1 
as
select
        case when level=1 then 1  
             when level>1 and  level<=10 then 2
             when level>10 and level<=300000 then 3
             when level>300000 and level<=650000 then 4
             when level>650000 and level<=1e6 then 5
             else level end c1,
             trunc(sysdate)+mod(level, 2) dat
      , case when mod(level, 2)=0 then 'X' else 'Y' end typ,
      lpad('x',1e4,'*') f1
      ,lpad('x',1e4,'x') f2
from
        dual
connect by
        level <= 1.3e6
;

create table t2 
as
select
        case when level<=100000 then 2
             when level>100000 and level<=200000 then 3
             when level>200000 and level<=300000 then 4
             when level>300000 and level<=400000 then 5
             when level>400000 and level<=500000 then 1
             else level end c1
            ,case when level<=100000 then 'B'
             when level>100000 and level<=200000 then 'C'
             when level>200000 and level<=300000 then 'D'
             when level>300000 and level<=400000 then 'E'
             when level>400000 and level<=500000 then 'A'
             else '0' end chrtp
             ,trunc(sysdate)+mod(level, 2) dat
            ,case when mod(level, 2)=0 then 'X' else 'Y' end typ
            ,lpad('x',1e4,'*') f1
            ,lpad('x',1e4,'x') f2
from
        dual
connect by
        level <= 2e6
;


-- In fact statistics for t1 table may have not been gathered but histogram is necessary for t2 table
execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1',degree=>6);
execute dbms_stats.gather_table_stats(user,'t2',method_opt=>'for all columns size 1 for columns c1 size skewonly',degree=>6);

select table_name tab ,column_name col, histogram hist from user_tab_col_statistics
where  table_name in ('T2','T1')
and column_name='C1'
order by 1;
TAB        COL        HIST
---------- ---------- ----------
T1         C1         NONE
T2         C1         HYBRID

The size of segments:

select segment_name,bytes/1024/1024/1024 gb from user_segments where segment_name in ('T1','T2');

SEGMENT_NA         GB
---------- ----------
T1         19.8789063
T2            30.5625

Now our query and explain plan :

EXPLAIN PLAN    FOR
        SELECT   COUNT (ff1), COUNT (f2)
          FROM   (SELECT /*+ parallel(10) */
                        t1 .f1 ff1, t2.*
                    FROM   t1, t2
                   WHERE   t1.c1 = t2.c1
                           AND chrtp NOT IN ('A', 'E', 'D', 'C'));

SELECT   * FROM table (DBMS_XPLAN.display);
Plan hash value: 1705792393
 
----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |     1 |  8013 |       |   252K  (1)| 00:00:10 |        |      |            |
|   1 |  SORT AGGREGATE                  |          |     1 |  8013 |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |          |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10002 |     1 |  8013 |       |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE               |          |     1 |  8013 |       |            |          |  Q1,02 | PCWP |            |
|*  5 |      HASH JOIN                   |          |  1545K|    11G|   498M|   252K  (1)| 00:00:10 |  Q1,02 | PCWP |            |
|   6 |       PX RECEIVE                 |          |  1300K|  4965M|       | 78333   (1)| 00:00:04 |  Q1,02 | PCWP |            |
|   7 |        PX SEND HYBRID HASH       | :TQ10000 |  1300K|  4965M|       | 78333   (1)| 00:00:04 |  Q1,00 | P->P | HYBRID HASH|
|   8 |         STATISTICS COLLECTOR     |          |       |       |       |            |          |  Q1,00 | PCWC |            |
|   9 |          PX BLOCK ITERATOR       |          |  1300K|  4965M|       | 78333   (1)| 00:00:04 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL      | T1       |  1300K|  4965M|       | 78333   (1)| 00:00:04 |  Q1,00 | PCWP |            |
|  11 |       PX RECEIVE                 |          |  1508K|  5767M|       |   120K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|  12 |        PX SEND HYBRID HASH (SKEW)| :TQ10001 |  1508K|  5767M|       |   120K  (1)| 00:00:05 |  Q1,01 | P->P | HYBRID HASH|
|  13 |         PX BLOCK ITERATOR        |          |  1508K|  5767M|       |   120K  (1)| 00:00:05 |  Q1,01 | PCWC |            |
|* 14 |          TABLE ACCESS FULL       | T2       |  1508K|  5767M|       |   120K  (1)| 00:00:05 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("T1"."C1"="T2"."C1")
  14 - filter("CHRTP"<>'A' AND "CHRTP"<>'E' AND "CHRTP"<>'D' AND "CHRTP"<>'C')
 
Note
-----
   - dynamic statistics used: dynamic sampling (level=6)
   - Degree of Parallelism is 10 because of hint

As it is seen by default the skew has been detected and PJSH enabled. If we look at the optimizer trace file we will see following lines:

kkoBloomFilter: join ndv:0 reduction:0.202548 (limit:0.050000)  rejected because distinct value ratio
skewRatio:10, skewMinFreq:30, minNDV:254, skewThreshold:0.039370
ind:0, csel:0.050916, skew count:1
ind:1, csel:0.048718, skew count:2
ind:2, csel:0.048718, skew count:3
ind:3, csel:0.047802, skew count:4
ind:4, csel:0.046337, skew count:5
ind:5, csel:0.000001, skew count:5
ind:6, csel:0.000001, skew count:5
ind:7, csel:0.000001, skew count:5
ind:8, csel:0.000001, skew count:5
ind:9, csel:0.000001, skew count:5
.................................
ind:253, csel:0.000001, skew count:5
Skewed value count:5 scaling:0 degree:10
kkopqSaveJoinSkewDecision:: Saving join information : qbcname:SEL$F5BB74E1, table name:T2, srcQbcName:SEL$2
kkopqSkewInfo: Set of 5 skewed values retrieved from: histogram
AP: Computing costs for inflection point at min value 0.00

So, in our case the skewed values are extracted from the histogram(hybrid), did not have to be executed DSQ to find them. As it is seen from the trace file, Oracle has looped over the histogram endpoint values and has tried to identify skew values. If selectivity of an endpoint values greater than skewThreshold then the value considered as skewed (or contains _px_join_skew_minfreq percent of total rows). It supports an idea that optimizer can treat non-popular value as skewed in case of small skewThreshold and we can control it via hidden parameters as above mentioned. That is why optimizer calculates selectivity for each endpoint value (ind:0- ind:253).
Let`s see the execution statistics:

SQL Monitoring Report

SQL Text
------------------------------
select count(ff1),count(f2) from ( select /*+ parallel(10) */ t1.f1 ff1,t2.* from t1,t2 where t1.c1=t2.c1 and chrtp not in ('A','E','D','C') )

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)                             
 Instance ID         :  1                                           
 Session             :  .                            
 SQL ID              :  8q3ms7t6cptcz                               
 SQL Execution ID    :  16777217                                    
 Execution Started   :  03/31/2019 17:19:31                         
 First Refresh Time  :  03/31/2019 17:19:31                         
 Last Refresh Time   :  03/31/2019 17:20:53                         
 Duration            :  82s                                         
 Module/Action       :  . 
 Service             :  SYS$USERS                                   
 Program             :  .   
 Fetch Calls         :  1                                           

Global Stats
==============================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Fetch | Buffer | Read | Read  | Write | Write |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs | Bytes | Reqs  | Bytes |
==============================================================================================
|     852 |     320 |      490 |          42 |     1 |     8M | 220K |  90GB |  168K |  40GB |
==============================================================================================

Parallel Execution Details (DOP=10 , Servers Allocated=20)
============================================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Concurrency | Buffer | Read  | Read  | Write | Write |           Wait Events           |
|                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Gets  | Reqs  | Bytes | Reqs  | Bytes |           (sample #)            |
============================================================================================================================================================
| PX Coordinator | QC    |         |    0.09 |    0.09 |     0.00 |             |     14 |     4 | 32768 |       |     . |                                 |
| p000           | Set 1 |       1 |      66 |      23 |       39 |        4.16 |   2664 | 16828 |   4GB | 16816 |   4GB | buffer busy waits (4)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (2) |
|                |       |         |         |         |          |             |        |       |       |       |       | PX Deq: Table Q Normal (1)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (8)       |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (28)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (1)            |
| p001           | Set 1 |       2 |      67 |      23 |       39 |        5.08 |   2652 | 16818 |   4GB | 16812 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (2) |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (11)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (24)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (3)            |
| p002           | Set 1 |       3 |      66 |      23 |       39 |        4.35 |   2654 | 16835 |   4GB | 16818 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (1) |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (11)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (29)     |
| p003           | Set 1 |       4 |      65 |      22 |       39 |        3.70 |   2672 | 16842 |   4GB | 16829 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (3) |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (12)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (20)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (1)            |
| p004           | Set 1 |       5 |      65 |      23 |       39 |        4.05 |   2637 | 16814 |   4GB | 16814 |   4GB | buffer busy waits (4)           |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (12)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (22)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (4)            |
| p005           | Set 1 |       6 |      67 |      23 |       39 |        4.73 |   2646 | 16816 |   4GB | 16810 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (9)       |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (25)     |
| p006           | Set 1 |       7 |      66 |      23 |       40 |        4.07 |   2636 | 16811 |   4GB | 16811 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (1) |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (11)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (29)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (2)            |
| p007           | Set 1 |       8 |      66 |      23 |       39 |        4.19 |   2658 | 16827 |   4GB | 16821 |   4GB | buffer busy waits (1)           |
|                |       |         |         |         |          |             |        |       |       |       |       | latch: cache buffers chains (2) |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (11)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (23)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (3)            |
| p008           | Set 1 |       9 |      65 |      22 |       39 |        3.90 |   2712 | 16860 |   4GB | 16831 |   4GB | buffer busy waits (4)           |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (11)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (27)     |
| p009           | Set 1 |      10 |      66 |      23 |       39 |        4.27 |   2673 | 16838 |   4GB | 16826 |   4GB | buffer busy waits (3)           |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path read temp (10)      |
|                |       |         |         |         |          |             |        |       |       |       |       | direct path write temp (21)     |
|                |       |         |         |         |          |             |        |       |       |       |       | local write wait (1)            |
| p00a           | Set 2 |       1 |      19 |    9.31 |       10 |             |     1M |  6574 |   6GB |       |     . | direct path read (11)           |
| p00b           | Set 2 |       2 |      19 |    8.90 |       11 |             |   729K |  4590 |   4GB |       |     . | direct path read (12)           |
| p00c           | Set 2 |       3 |      20 |    9.44 |       10 |             |   795K |  5137 |   5GB |       |     . | direct path read (10)           |
| p00d           | Set 2 |       4 |      19 |    8.80 |       10 |             |   688K |  4275 |   4GB |       |     . | direct path read (12)           |
| p00e           | Set 2 |       5 |      20 |    9.11 |       11 |             |   777K |  5003 |   5GB |       |     . | direct path read (13)           |
| p00f           | Set 2 |       6 |      20 |      10 |       10 |             |   868K |  5360 |   5GB |       |     . | direct path read (13)           |
| p00g           | Set 2 |       7 |      19 |      10 |     9.26 |             |   961K |  6047 |   6GB |       |     . | direct path read (12)           |
| p00h           | Set 2 |       8 |      20 |      10 |       10 |             |   880K |  5804 |   6GB |       |     . | direct path read (7)            |
| p00i           | Set 2 |       9 |      19 |    8.69 |       10 |             |   697K |  4398 |   4GB |       |     . | direct path read (14)           |
| p00j           | Set 2 |      10 |      20 |      10 |       10 |             |   789K |  5004 |   5GB |       |     . | direct path read (12)           |
============================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1705792393)
==============================================================================================================================================================================================================
| Id |             Operation              |   Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  | Write | Write |  Mem  | Temp  | Activity |         Activity Detail          |
|    |                                    |          | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |           (# samples)            |
==============================================================================================================================================================================================================
|  0 | SELECT STATEMENT                   |          |         |       |        52 |    +31 |     1 |        1 |       |       |       |       |     . |     . |    10.37 | buffer busy waits (31)           |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | latch: cache buffers chains (11) |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | Cpu (8)                          |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | local write wait (15)            |
|  1 |   SORT AGGREGATE                   |          |       1 |       |         1 |    +82 |     1 |        1 |       |       |       |       |     . |     . |          |                                  |
|  2 |    PX COORDINATOR                  |          |         |       |         1 |    +82 |    21 |       10 |     4 | 32768 |       |       |     . |     . |          |                                  |
|  3 |     PX SEND QC (RANDOM)            | :TQ10002 |       1 |       |         2 |    +81 |    10 |       10 |       |       |       |       |     . |     . |          |                                  |
|  4 |      SORT AGGREGATE                |          |       1 |       |        29 |    +54 |    10 |       10 |       |       |       |       |     . |     . |          |                                  |
|  5 |       HASH JOIN                    |          |      2M |  252K |        83 |     +0 |    10 |       1M |  168K |  40GB |  168K |  40GB | 759MB |  41GB |    61.40 | Cpu (31)                         |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read temp (106)      |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path write temp (248)     |
|  6 |        PX RECEIVE                  |          |      1M | 78333 |        53 |     +2 |    10 |      10M |       |       |       |       |     . |     . |     1.28 | Cpu (8)                          |
|  7 |         PX SEND HYBRID HASH        | :TQ10000 |      1M | 78333 |        55 |     +0 |    10 |      10M |       |       |       |       |     . |     . |     2.71 | Cpu (17)                         |
|  8 |          STATISTICS COLLECTOR      |          |         |       |        53 |     +2 |    10 |       1M |       |       |       |       |     . |     . |          |                                  |
|  9 |           PX BLOCK ITERATOR        |          |      1M | 78333 |        53 |     +2 |    10 |       1M |       |       |       |       |     . |     . |          |                                  |
| 10 |            TABLE ACCESS FULL       | T1       |      1M | 78333 |        55 |     +0 |   134 |       1M | 20472 |  20GB |       |       |     . |     . |     4.47 | Cpu (15)                         |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read (13)            |
| 11 |        PX RECEIVE                  |          |      2M |  121K |        17 |    +54 |    10 |       2M |       |       |       |       |     . |     . |     0.32 | Cpu (1)                          |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | PX Deq: Table Q Normal (1)       |
| 12 |         PX SEND HYBRID HASH (SKEW) | :TQ10001 |      2M |  121K |        16 |    +54 |    10 |       2M |       |       |       |       |     . |     . |     0.16 | Cpu (1)                          |
| 13 |          PX BLOCK ITERATOR         |          |      2M |  121K |        16 |    +54 |    10 |       2M |       |       |       |       |     . |     . |          |                                  |
| 14 |           TABLE ACCESS FULL        | T2       |      2M |  121K |        17 |    +53 |   132 |       2M | 31720 |  31GB |       |       |     . |     . |    19.30 | Cpu (18)                         |
|    |                                    |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read (103)           |
==============================================================================================================================================================================================================

The duration of the SQL was 82 second. But significant part (more than 50%) of elapsed time was consumed by I/O waits. In the table T1 the number of skewed values is 1M and therefore if you look at the plan line 7, the PX SEND HYBRID HASH operation BROADCASTED these values and we have got 10M rows in that plan line. That huge number of rows (and the volume) caused a problem for the PHJ in the line 5. The database was forced to read (from disk/temp) and write (to temp) huge amount of data (read 90GB/wrote 40GB) in order to complete the HASH JOIN.
Now if we disable Automatic Skew Handling, we will get the followings:

SQL Monitoring Report

SQL Text
------------------------------
select count(ff1),count(f2) from ( select /*+ parallel(10) no_pq_skew(t2) */ t1.f1 ff1,t2.* from t1,t2 where t1.c1=t2.c1 and chrtp not in ('A','E','D','C') )

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)                             
 Instance ID         :  1                                           
 SQL ID              :  7f7u503r8m4pb                               
 SQL Execution ID    :  16777216                                    
 Execution Started   :  03/31/2019 17:22:13                         
 First Refresh Time  :  03/31/2019 17:22:13                         
 Last Refresh Time   :  03/31/2019 17:22:31                         
 Duration            :  18s                                         
.
Global Stats
===============================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Fetch | Buffer | Read  | Read  | Write | Write |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs  | Bytes | Reqs  | Bytes |
===============================================================================================
|     188 |      70 |      118 |        0.00 |     1 |     8M | 74845 |  56GB | 22650 |   5GB |
===============================================================================================

Parallel Execution Details (DOP=10 , Servers Allocated=20)
=======================================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Concurrency | Buffer | Read | Read  | Write | Write |         Wait Events         |
|                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Gets  | Reqs | Bytes | Reqs  | Bytes |         (sample #)          |
=======================================================================================================================================================
| PX Coordinator | QC    |         |    0.05 |    0.04 |     0.00 |             |     14 |    2 | 16384 |       |     . |                             |
| p000           | Set 1 |       1 |    1.68 |    1.18 |     0.50 |             |      1 |  415 | 101MB |   415 | 101MB | direct path write temp (1)  |
| p001           | Set 1 |       2 |    1.66 |    1.16 |     0.49 |             |      1 |  415 | 101MB |   415 | 101MB |                             |
| p002           | Set 1 |       3 |    1.63 |    1.13 |     0.50 |             |      1 |  412 | 100MB |   411 | 100MB |                             |
| p003           | Set 1 |       4 |    1.63 |    1.16 |     0.47 |        0.00 |      1 |  417 | 101MB |   417 | 101MB | direct path write temp (1)  |
| p004           | Set 1 |       5 |    1.64 |    1.18 |     0.46 |        0.00 |      1 |  417 | 101MB |   417 | 101MB |                             |
| p005           | Set 1 |       6 |      10 |    4.52 |     5.41 |             |      1 | 6326 |   1GB |  6326 |   1GB | direct path write temp (3)  |
| p006           | Set 1 |       7 |    1.62 |    1.18 |     0.44 |        0.00 |      1 |  415 | 101MB |   415 | 101MB |                             |
| p007           | Set 1 |       8 |      10 |    4.42 |     5.41 |        0.00 |      1 | 5548 |   1GB |  5548 |   1GB | direct path read temp (1)   |
|                |       |         |         |         |          |             |        |      |       |       |       | direct path write temp (4)  |
| p008           | Set 1 |       9 |    1.69 |    1.18 |     0.51 |             |      1 |  419 | 101MB |   419 | 101MB | direct path write temp (1)  |
| p009           | Set 1 |      10 |      11 |    5.38 |     5.57 |        0.00 |      1 | 7867 |   2GB |  7867 |   2GB | direct path read temp (1)   |
|                |       |         |         |         |          |             |        |      |       |       |       | direct path write temp (6)  |
| p00a           | Set 2 |       1 |      14 |    4.49 |       10 |             |   725K | 4832 |   5GB |       |     . | direct path read (6)        |
| p00b           | Set 2 |       2 |      15 |    5.03 |       10 |             |   873K | 5692 |   6GB |       |     . | db file sequential read (1) |
|                |       |         |         |         |          |             |        |      |       |       |       | direct path read (8)        |
| p00c           | Set 2 |       3 |      14 |    4.81 |       10 |             |   805K | 5214 |   5GB |       |     . | direct path read (9)        |
| p00d           | Set 2 |       4 |      14 |    4.27 |     9.40 |             |   719K | 4487 |   4GB |       |     . | direct path read (8)        |
| p00e           | Set 2 |       5 |      16 |    4.80 |       11 |             |   800K | 5063 |   5GB |       |     . | direct path read (11)       |
| p00f           | Set 2 |       6 |      14 |    4.94 |     8.73 |             |   915K | 5871 |   6GB |       |     . | direct path read (8)        |
| p00g           | Set 2 |       7 |      14 |    4.76 |     9.26 |             |   893K | 5345 |   5GB |       |     . | direct path read (9)        |
| p00h           | Set 2 |       8 |      16 |    4.85 |       11 |             |   795K | 5148 |   5GB |       |     . | direct path read (8)        |
| p00i           | Set 2 |       9 |      14 |    4.63 |       10 |             |   815K | 5084 |   5GB |       |     . | direct path read (8)        |
| p00j           | Set 2 |      10 |      16 |    4.97 |       11 |             |   864K | 5456 |   5GB |       |     . | direct path read (10)       |
=======================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2644066362)
====================================================================================================================================================================================================
| Id |           Operation           |   Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                               |          | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |         (# samples)         |
====================================================================================================================================================================================================
|  0 | SELECT STATEMENT              |          |         |       |         1 |    +18 |     1 |        1 |       |       |       |       |     . |     . |          |                             |
|  1 |   SORT AGGREGATE              |          |       1 |       |         1 |    +18 |     1 |        1 |       |       |       |       |     . |     . |          |                             |
|  2 |    PX COORDINATOR             |          |         |       |         1 |    +18 |    21 |       10 |     2 | 16384 |       |       |     . |     . |          |                             |
|  3 |     PX SEND QC (RANDOM)       | :TQ10002 |       1 |       |         3 |    +16 |    10 |       10 |       |       |       |       |     . |     . |          |                             |
|  4 |      SORT AGGREGATE           |          |       1 |       |         7 |    +12 |    10 |       10 |       |       |       |       |     . |     . |          |                             |
|  5 |       HASH JOIN               |          |      2M |  252K |        18 |     +1 |    10 |       1M | 22650 |   5GB | 22650 |   5GB |   1GB |   6GB |    15.56 | Cpu (3)                     |
|    |                               |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read temp (2)   |
|    |                               |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path write temp (16) |
|  6 |        PX RECEIVE             |          |      1M | 78333 |        11 |     +2 |    10 |       1M |       |       |       |       |     . |     . |     2.22 | Cpu (3)                     |
|  7 |         PX SEND HYBRID HASH   | :TQ10000 |      1M | 78333 |        11 |     +2 |    10 |       1M |       |       |       |       |     . |     . |     4.44 | Cpu (6)                     |
|  8 |          STATISTICS COLLECTOR |          |         |       |        11 |     +2 |    10 |       1M |       |       |       |       |     . |     . |     0.74 | Cpu (1)                     |
|  9 |           PX BLOCK ITERATOR   |          |      1M | 78333 |        11 |     +2 |    10 |       1M |       |       |       |       |     . |     . |          |                             |
| 10 |            TABLE ACCESS FULL  | T1       |      1M | 78333 |        12 |     +1 |   134 |       1M | 20472 |  20GB |       |       |     . |     . |    39.26 | Cpu (5)                     |
|    |                               |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read (48)       |
| 11 |        PX RECEIVE             |          |      2M |  121K |         7 |    +12 |    10 |       2M |       |       |       |       |     . |     . |          |                             |
| 12 |         PX SEND HYBRID HASH   | :TQ10001 |      2M |  121K |         5 |    +12 |    10 |       2M |       |       |       |       |     . |     . |     2.22 | Cpu (3)                     |
| 13 |          PX BLOCK ITERATOR    |          |      2M |  121K |         5 |    +12 |    10 |       2M |       |       |       |       |     . |     . |          |                             |
| 14 |           TABLE ACCESS FULL   | T2       |      2M |  121K |         8 |    +10 |   132 |       2M | 31720 |  31GB |       |       |     . |     . |    35.56 | Cpu (10)                    |
|    |                               |          |         |       |           |        |       |          |       |       |       |       |       |       |          | db file sequential read (1) |
|    |                               |          |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read (37)       |
====================================================================================================================================================================================================

The Join has been completed within 18 second without skew handling. So, four times faster. Oracle needed only 5GB data to write to disk due to true HASH distribution. There can be a question that if we can reduce DOP but keep Auto Skew Handling then read/write (from/to disk) will be reduced and it can improve the PHJ. That is right, but problem is that if we reduce the DOP then we will have less PX producer slaves to scan PHJ inputs (Build, Probe tables) and it also will increase to acquire the data from these tables and will affect the duration of SQL. So, in such situation disabling Auto Skew Handling may give better performance.To add more, if high DOP is requested in this case redistribution cost can be very high.
Lastly, in the query I have added a filter condition as chrtp not in (‘A’,’E’,’D’,’C’). There is a strong correlation between c1 and chrtp columns. Although, the database identified the values c1=1,2,3,4,5 as skewed but they did not need to be redistributed from T1 table. The database currently does not consider filter predicates during the skew detection process. Thus, you know your data well and it may help you to reduce the data that going to be redistributed/joined by adding additional predicate(s) to the probe/build tables.

Comments (1)

February 27, 2019

The Hash Join Buffered Operation

Filed under: CBO,Join,Parallel Execution — Chinar Aliyev @ 10:16 am

Several years ago Randolf Geist wrote an article about the HASH JOIN BUFFERED operation and precisely demonstrated that only the second row source is buffered (and only proper part of it). But why is it required in the execution plan?. At most a single PX SEND operation can be active concurrently was mentioned as a reason of it. I do not think it is the complete answer. That is right, only single PX SEND operation can be active concurrently, it means(depending on plan shape) only pair of DFO can be active at the time for per DFO Tree. So, it is a general restriction of producer/consumer model but not of Oracle database`s. Regardless of that fact in some cases execution plan contains BUFFERED variant of HASH JOIN but in other cases does not. Let`s see the following SQLs and explain plans. For simplicity we are going to use simple HASH and BROADCAST distribution only.

DB version : 11.2.0.1
create table t1 
as
select
        level as id
      , case when mod(level,2)=0 then TRUNC(DBMS_RANDOM.value(1,10)) 
        else TRUNC(DBMS_RANDOM.value(11,13)) end tp
      , rpad('x', level) as filter
from
        dual
connect by
        level <= 1000
;

create table t2 
as
select
        level as id
       ,trunc(sysdate) dat
       ,round(level/10,2) amount
       ,rpad('x', level) as filter
from
        dual
connect by
        level <= 1500
;

insert into t2 
select id,trunc(sysdate-1) dat,round(id/5,2) amount,filter   from t2;

commit;

execute dbms_stats.gather_table_stats(null,'t1');
execute dbms_stats.gather_table_stats(null,'t2');

explain plan for
select /*+ parallel(2) pq_distribute(t2 hash hash) */ * from t1 , t2
where t1.id=t2.id;

select * from table(dbms_xplan.display());
Plan hash value: 3683239666                                                                                      
                                                                                                                 
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  2000 |  2494K|    67   (2)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED   |          |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR |          |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL| T2       |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------
                                                                                                                                                                                                                                  
Predicate Information (identified by operation id):                                                              
---------------------------------------------------                                                              
                                                                                                                 
   3 - access("T1"."ID"="T2"."ID")                                                                               
                                                                                                                 
Note                                                                                                             
-----                                                                                                            
   - Degree of Parallelism is 2 because of hint 
(EP1)

As you see in the line of 3 the hash join is considered as BUFFERED. But if we change the query as below then explain plan will be:

explain plan for
select count(*) from (
select /*+ no_merge parallel(2) pq_distribute(t2 hash hash) */ * from t1 , t2
where t1.id=t2.id);

Plan hash value: 1086837495                                                                                         
                                                                                                                    
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |     1 |       |    67   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE            |          |     1 |       |            |          |        |      |            |
|   2 |   PX COORDINATOR           |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)     | :TQ10002 |     1 |       |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE         |          |     1 |       |            |          |  Q1,02 | PCWP |            |
|   5 |      VIEW                  |          |  2000 |       |    67   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|*  6 |       HASH JOIN            |          |  2000 | 16000 |    67   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE          |          |  1000 |  4000 |    12   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH       | :TQ10000 |  1000 |  4000 |    12   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR |          |  1000 |  4000 |    12   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T1       |  1000 |  4000 |    12   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE          |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH       | :TQ10001 |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS FULL| T2       |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------
                                                                                                                    
Predicate Information (identified by operation id):                                                                 
---------------------------------------------------                                                                 
                                                                                                                    
   6 - access("T1"."ID"="T2"."ID")                                                                                  
                                                                                                                    
Note                                                                                                                
-----                                                                                                               
   - Degree of Parallelism is 2 because of hint                                                                     
(EP2)

In this case HASH JOIN is not buffered. Also, the behavior of the HASH JOIN is not related to the HASH distribution only. If we use BROADCAST distribution for the Second Row Source then the buffering will be required:
11g Plan shape:

explain plan for
select /*+  parallel(2) pq_distribute(t2 none broadcast) */ * from t1 , t2
where t1.id=t2.id;

Plan hash value: 869483324                                                                                        
                                                                                                                  
------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |  2000 |  2494K|    67   (2)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN             |          |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX BLOCK ITERATOR    |          |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   5 |      TABLE ACCESS FULL   | T1       |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |     BUFFER SORT          |          |       |       |            |          |  Q1,01 | PCWC |            |
|   7 |      PX RECEIVE          |          |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   8 |       PX SEND BROADCAST  | :TQ10000 |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   9 |        PX BLOCK ITERATOR |          |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |         TABLE ACCESS FULL| T2       |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------
                                                                                                                  
Predicate Information (identified by operation id):                                                               
---------------------------------------------------                                                               
                                                                                                                  
   3 - access("T1"."ID"="T2"."ID")                                                                                
                                                                                                                  
Note                                                                                                              
-----                                                                                                             
   - Degree of Parallelism is 2 because of hint
(EP 3)

But the explain plan form 12c (18c) displays exact HASH JOIN BUFFERED operation rather than separate BUFFER SORT and HASH JOIN operation in the plan shape.

EXPLAIN PLAN
    FOR
        SELECT /*+ no_pq_replicate(@"SEL$1" "T2"@"SEL$1") parallel(2) pq_distribute(t2 none broadcast) */
              *
          FROM   t1, t2
         WHERE   t1.id = t2.id;
(SQL4)
Plan hash value: 1233681586                                                                                              
                                                                                                                         
-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |   197K|  1488M|       | 96556   (1)| 00:00:04 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |   197K|  1488M|       | 96556   (1)| 00:00:04 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED   |          |   197K|  1488M|   187M| 96556   (1)| 00:00:04 |  Q1,01 | PCWP |            |
|   4 |     PX BLOCK ITERATOR   |          |   100K|   374M|       | 14663   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   5 |      TABLE ACCESS FULL  | T1       |   100K|   374M|       | 14663   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |     PX RECEIVE          |          |   300K|  1134M|       | 44374   (1)| 00:00:02 |  Q1,01 | PCWP |            |
|   7 |      PX SEND BROADCAST  | :TQ10000 |   300K|  1134M|       | 44374   (1)| 00:00:02 |  Q1,00 | P->P | BROADCAST  |
|   8 |       PX BLOCK ITERATOR |          |   300K|  1134M|       | 44374   (1)| 00:00:02 |  Q1,00 | PCWC |            |
|   9 |        TABLE ACCESS FULL| T2       |   300K|  1134M|       | 44374   (1)| 00:00:02 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------

(EP4)

It is fact that in both plans (EP3, EP4) the Second Row Source must be redistributed, and it forced to perform HASH JOIN in BUFFERED mode. Now, let`s see another query and its plan :

explain plan for
select /*+  parallel(2) pq_distribute(t2 hash hash) */t1.tp,count(*) from t1 , t2
where t1.id=t2.id
group by t1.tp;
(SQL 5)

Plan hash value: 2482649283                                                                                          
                                                                                                                     
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    11 |   121 |    68   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)       | :TQ10003 |    11 |   121 |    68   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY            |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE              |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   5 |      PX SEND HASH           | :TQ10002 |    11 |   121 |    68   (3)| 00:00:01 |  Q1,02 | P->P | HASH       |
|   6 |       HASH GROUP BY         |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|*  7 |        HASH JOIN            |          |  2000 | 22000 |    67   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX RECEIVE          |          |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |          PX SEND HASH       | :TQ10000 |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|  10 |           PX BLOCK ITERATOR |          |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL| T1       |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  12 |         PX RECEIVE          |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  13 |          PX SEND HASH       | :TQ10001 |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  14 |           PX BLOCK ITERATOR |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  15 |            TABLE ACCESS FULL| T2       |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------
(EP5)

But if we add the following hint to the same SQL then explain plan will contain again HASH JOIN BUFFERED operation:

explain plan for
select /*+ parallel(2) no_gby_pushdown(@"SEL$1")  pq_distribute(t2 hash hash) */t1.tp,count(*) from t1 , t2
where t1.id=t2.id
group by t1.tp;

Plan hash value: 548441744                                                                                          
                                                                                                                    
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |    11 |   121 |    68   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR            |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ10003 |    11 |   121 |    68   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY           |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE             |          |  2000 | 22000 |    67   (2)| 00:00:01 |  Q1,03 | PCWP |            |
|   5 |      PX SEND HASH          | :TQ10002 |  2000 | 22000 |    67   (2)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  6 |       HASH JOIN BUFFERED   |          |  2000 | 22000 |    67   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE          |          |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH       | :TQ10000 |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR |          |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T1       |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE          |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH       | :TQ10001 |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS FULL| T2       |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------
(EP6)

So, the SQL is same, same PX slaves and same PX distributions, we just added the no_gby_pushdown hint to prevent group by push down optimization and demonstrate that when the BUFFERED variant of HASH JOIN is required.
All above examples clearly indicate that the appearing of HASH JOIN BUFFERED operation in execution plan is depending on parent operation of parallel HASH JOIN. It means, if the result of parallel HASH JOIN is required to be produced directly in this case BUFFERED is needed, otherwise if the result of PARALLEL HASH JOIN will be processed by other parent operation before producing (redistribute) in this case no BUFFERED is needed. Question is that what is the reason? Before answer the question let`s change the (SQL4) and force optimizer to use BROADCAST distribution for the second row source.

explain plan for
select /*+  parallel(2) pq_distribute(t2 none broadcast) */ t1.tp,count(*) from t1 , t2
where t1.id=t2.id
group by t1.tp;

Plan hash value: 1458162008                                                                                          
                                                                                                                     
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    11 |   121 |    68   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)       | :TQ10002 |    11 |   121 |    68   (3)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY            |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE              |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH           | :TQ10001 |    11 |   121 |    68   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|   6 |       HASH GROUP BY         |          |    11 |   121 |    68   (3)| 00:00:01 |  Q1,01 | PCWP |            |
|*  7 |        HASH JOIN            |          |  2000 | 22000 |    67   (2)| 00:00:01 |  Q1,01 | PCWP |            |
|   8 |         PX BLOCK ITERATOR   |          |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   9 |          TABLE ACCESS FULL  | T1       |  1000 |  7000 |    12   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  10 |         PX RECEIVE          |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  11 |          PX SEND BROADCAST  | :TQ10000 |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|  12 |           PX BLOCK ITERATOR |          |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  13 |            TABLE ACCESS FULL| T2       |  3000 | 12000 |    54   (0)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------

As it is seen, even the second row source is distributed via BROADCAST the buffered variant is not required. Because, the result of the parallel hash join is not directly sent to the parent operation. There is HASH GROUP BY and after that the rows are distributed via HASH distribution.
If you look at the EP1, there are two PX Server set (PX SS). One is producer set (PX SS2) and another one is consumer set (PX SS1). The Parallel HASH JOIN operation in the line 3 is performed by PX SS1 so by consumer slaves. At the same time the parent operation (in the line 2 – PX SEND QC (RANDOM)) is required to produce the result of the HASH JOIN. So, this is consumer salve set and the same time it can not produce rows (it is consumer slave set, consumer salves can not produce rows unless become again producers), in other words they can not play as “PRODUCERS” at the same time. That is why, firstly, the second row source is consumed and buffered then the consumer slaves can become producers in order to produce rows. But, the line 4 in explain plan EP2 allows consumers to change their status and become producers. So, it means, PX SS1 will complete its work as consumer in the line 6 in EP2 and then becomes producer starting from the line 4. So, additional operation after parallel hash join allows consumers to become producers and that is why buffered is not required.
Finally let` see the below simple plan:

explain plan for
select /*+  parallel(2) pq_distribute(t2 broadcast none) */ * from t1 , t2
where t1.id=t2.id;
Plan hash value: 3350059367                                                                                      
                                                                                                                 
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  2000 |  2494K|    67   (2)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  2000 |  2494K|    67   (2)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   497K|    12   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       |  3000 |  2250K|    54   (0)| 00:00:01 |  Q1,01 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

As you see, in this case the PX SS1 will perform its work as consumer at once (in the line 4) when creating in memory hash cluster (build table) and becomes producer slave set (does not need to consume the second row source) then performs parallel hash join and can produce the result.

Summary

We saw that HASH JOIN BUFFERED is needed when result of parallel hash join is required to be produced directly and the second row source is placed in different DFO than HASH JOIN itself. It depends on the parent operation. If the result of HASH JOIN is followed by other operator (like aggregating, grouping, joining) then this operation gives a chance to the consumer slave set to become producer slave set. The mentioned notes are also true for Oracle database 12c afterward.

Comments (7)

December 17, 2018

The Efficiency of Hybrid Histogram

Filed under: CBO,Hybrid Histogram,theory — Chinar Aliyev @ 11:53 am

It is not a practical blog post, just wanted to share my thoughts about the hybrid histogram in this quick post.
If column distribution is expressed by frequency or top-frequency histogram in Oracle database, then it means database has enough information about your data. But there can be a problem if column statistics expressed by HH. Because, it is created via sampling and in this case, there is a chance losing of the most frequent values (MFV). The frequency of MFV might be inaccurate even in the case of sampling catches them. But, if HH is created with 100% sampling then there not doubt about its efficiency for me. This is only histogram that even gives us an opportunity to identify bucket level distribution.
HH is created if the following condition is true:
TOPNFREQ < p*NUM_ROWS (C 1)
where:

p=(1-1/N),
N – number of requested histogram buckets
NUM_ROWS - number of non-null rows in the table  
TOPNFREQ – total frequency of top N frequent values.

There another question arises, if TOPNFREQ < p*NUM_ROWS condition is true, but the numbers TOPNFREQ and p*NUM_ROWS are very close so :

| TOPNFREQ - p*NUM_ROWS| <= 0.013 (C 2)

like that what will happen? So, there is very little difference, but the condition will force to be created HH instead of TFH. May be in this case TFH should have been created and it may have expressed the data very accurately than HH? This is the infliction point I wanted to mention.
Because with very little differences oracle database will create HH but using sampling! In this case the efficiency of HH might be low due to losing of TFV.
The conditions (C 1) and (C 2) contain frequency of TFV but the result HH histogram does not contain it due to sampling (or their frequency will be inaccurate). So, there is inconsistency.
In my opinion, HH is comparable with TFH only when it contains all TFV (top-n). Thus, even the condition (C 2) is true and if the HH contains all TFV then the HH will express the data accurately than TFH.
Another question is how the HH can include all TFV (top-n)? It still is not created as a part of approximate NDV Algorithm. The database uses the following SQL to produce a data set for creating HH:

SELECT   SUBSTRB (DUMP (val,
                          16,
                          0,
                          64), 1, 240)
               ep,
           freq,
           cdn,
           ndv,
           (SUM (pop) OVER ()) popcnt,
           (SUM (pop * freq) OVER ()) popfreq,
           SUBSTRB (DUMP (MAX (val) OVER (),
                          16,
                          0,
                          64), 1, 240)
               maxval,
           SUBSTRB (DUMP (MIN (val) OVER (),
                          16,
                          0,
                          64), 1, 240)
               minval
    FROM   (SELECT   val,
                     freq,
                     (SUM (freq) OVER ()) cdn,
                     (COUNT ( * ) OVER ()) ndv,
                     (CASE
                          WHEN freq > ( (SUM (freq) OVER ()) / num_bucket) THEN 1
                          ELSE 0
                      END)
                         pop
              FROM   (  SELECT /*+  lots of hints  */
                              "COL"
                                     val,
                                 COUNT ("COL") freq
                          FROM   "T1" SAMPLE (.abc...) t
                         WHERE   "COL" IS NOT NULL
                      GROUP BY   "COL"))
ORDER BY   val
(SQL 1)
So, the main part is:
  SELECT /*+ lots of hints */
        "COL" val, COUNT ("COL") freq
    FROM    "T1" SAMPLE (.abc...) t
   WHERE   "COL" IS NOT NULL
GROUP BY   "COL"
(SQL 2)

My suggestion is modifying the (SQL 2) by adding the information about TFV, because during the gathering basic column statistics (approximate NDV) the database already creates (or could create) and stores TFV in a ROWID table. We can see the information about the ROWID table only in the case of creation of TFH. For example:

originally requested bucket num (via method_opt): 20
Computing topn from NDV structures...
TopN Histogram Feasible?: YES
 |-->tot_freq: 9701 tot_nonnull_rows: 10000 bucketnum >> actual: 20                requested: 20 ndv: 319
 tot_freq/((double)tot_nonnull_rows): 0.970100,               (1 - 1/((double)requested_bucketnum)): 0.950000
 --> requested bucket num: 20
-------------------------------------------------------------------------
 Rank          Rowid                     Frequency
-------------------------------------------------------------------------
    1  AAAFzGAABAAANepAAF                   551
    2  AAAFzGAABAAANepAAI                   534
    3  AAAFzGAABAAANepAAE                   532
    4  AAAFzGAABAAANepAAG                   529
    5  AAAFzGAABAAANepAAC                   527
    6  AAAFzGAABAAANepAAj                   521
    7  AAAFzGAABAAANepAAA                   519
    8  AAAFzGAABAAANepAAJ                   511
    9  AAAFzGAABAAANepAAi                   510
   10  AAAFzGAABAAANepAAH                   510
   11  AAAFzGAABAAANepAAt                   509
   12  AAAFzGAABAAANepAAc                   508
   13  AAAFzGAABAAANepAAM                   504
   14  AAAFzGAABAAANepAAO                   500
   15  AAAFzGAABAAANepAAB                   499
   16  AAAFzGAABAAANepAAL                   497
   17  AAAFzGAABAAANepAAD                   494
   18  AAAFzGAABAAANepAAS                   491
   19  AAAFzGAABAAANepAAY                   454
   20  AAAFzGAABAAANe4AAP                     1

Also, it can be seen

DBMS_STATS: Approximate NDV Result 
DBMS_STATS: <process_result>
  <select_list_item>
    <pos>0</pos>
    <value>319</value>
    <rowcnt>10000</rowcnt>
    <qcmsg>0</qcmsg>
    <split>0</split>
    <ndv>319</ndv>
    <nonnulls>10000</nonnulls>
    <rsize>20297</rsize>
    <topncnt>20</topncnt>
    <topn_values>AAAFzGAABAAANepAAF,551,AAAFzGAABAAANepAAI,534,AAAFzGAABAAANepAAE,532,AAAFzGAABAAANepAAG,529,AAAFzGAABAAANepAAC,527,AAAFzGAABAAANepAAj,521,AAAFzGAABAAANepAAA,519,AAAFzGAABAAANepAAJ,511,AAAFzGAABAAANepAAi,510,AAAFzGAABAAANepAAH,510,AAAFzGAABAAANepAAt,509,AAAFzGAABAAANepAAc,508,AAAFzGAABAAANepAAM,504,AAAFzGAABAAANepAAO,500,AAAFzGAABAAANepAAB,499,AAAFzGAABAAANepAAL,497,AAAFzGAABAAANepAAD,494,AAAFzGAABAAANepAAS,491,AAAFzGAABAAANepAAY,454,AAAFzGAABAAANe4AAP,1,</topn_values>
  </select_list_item>
  <select_list_item>
    <pos>1</pos>
    <value>Typ=2 Len=2: c1,2</value>
  </select_list_item>

For the case of HH creation:
originally requested bucket num (via method_opt): 10
Computing approximate topn from native topn structures...
TopN Histogram Feasible?: NO
 |-->tot_freq: 5919 tot_nonnull_rows: 800000 bucketnum >> actual: 10                requested: 10 ndv: 92816
 tot_freq/((double)tot_nonnull_rows): 0.007399,               (1 - 1/((double)requested_bucketnum)): 0.900000

For the case of HH creation:

originally requested bucket num (via method_opt): 10
Computing approximate topn from native topn structures...
TopN Histogram Feasible?: NO
 |-->tot_freq: 5919 tot_nonnull_rows: 800000 bucketnum >> actual: 10                requested: 10 ndv: 92816
 tot_freq/((double)tot_nonnull_rows): 0.007399,               (1 - 1/((double)requested_bucketnum)): 0.900000

So, we do not see the ROWID table explicitly.
It means we have TVF and their frequencies in the ROWID table. For demonstration I used symbolic ROWID_TAB in the following SQLs.
So, the information in the ROWID table can be added to the (SQL 2) as:

So, the information in the ROWID table can be added to the (SQL 2) as:
  SELECT   val, ROUND (freq * 100 / .abc...) freq
  FROM   (  SELECT   "COL" val, COUNT ("COL") freq
              FROM   "T1" SAMPLE (.abc...) t
             WHERE   "COL" IS NOT NULL
                     AND "COL" NOT IN (SELECT val FROM rowid_tab)
          GROUP BY   "COL")
UNION
SELECT   val, freq FROM rowid_tab

(SQL 3)

There we assume the ROWID table is maintained and there .abc… is a sample percent that identified by Oracle Database.
Thus, the (SQL 1) could have been:

SELECT   ...
    FROM   (SELECT   val,
                     freq,
                     (SUM (freq) OVER ()) cdn,
                     (COUNT ( * ) OVER ()) ndv,
                     (CASE
                          WHEN freq > ( (SUM (freq) OVER ()) / num_bucket) THEN 1
                          ELSE 0
                      END)
                         pop
              FROM   (   SELECT   val, ROUND (freq * 100 / .abc...) freq
  FROM   (  SELECT  /*+ lots of hints */ "COL" val, COUNT ("COL") freq
              FROM   "T1" SAMPLE (.abc...) t
             WHERE   "COL" IS NOT NULL
                     AND "COL" NOT IN (SELECT   val FROM rowid_tab)
          GROUP BY   "COL")
UNION
SELECT   val, freq FROM rowid_tab
))
(SQL 4)

In this case we could include all TFV into the histogram with their exact frequencies.
To add more, it is obvious that the popular values are among the TFV. It means all other values included by the sampling (in sampling part of the (SQL 4)) will not contain popular values. To be more accurate, the popular values should be identified using ROWID_TAB.freq value, so, If ROWID_TAB.freq > num_rows/num_bucket then the value is popular.
Thus:


SELECT  ...
    FROM   (SELECT   val,
                     freq,
                     (SUM (freq) OVER ()) cdn,
                     (COUNT ( * ) OVER ()) ndv, 
                         pop
              FROM   (   SELECT   val, ROUND (freq * 100 / .abc...) freq,0 pop
  FROM   (  SELECT  /*+ lots of hints */ "COL" val, COUNT ("COL") freq
              FROM   "T1" SAMPLE (.abc...) t
             WHERE   "COL" IS NOT NULL
                     AND "COL" NOT IN (  SELECT   val FROM rowid_tab)
          GROUP BY   "COL")
UNION
SELECT   val, freq, (CASE
                          WHEN freq > (num_rows / num_bucket) THEN 1
                          ELSE 0
                      END)
                         pop FROM rowid_tab
))

Note: It is just my opinion about the HH, ofcourse, it could be very good if HH also is created as a part of approximate NDV algorithm. Maybe we can see it in a future Oracle Database version.
Several months ago, I talked to Nigel Bayliss(Product manager for Oracle Optimizer) and thanks to him for opening an enhancement request for that:
Bug 27838336 : CREATE HYBRID HISTOGRAMS FROM FULL TABLE SCAN RATHER THAN A SAMPLE

Leave a Comment

November 6, 2018

Understanding Hybrid Histogram

Filed under: CBO,Hybrid Histogram,statistics — Chinar Aliyev @ 11:19 am

In this blog post we are going to explore important properties of hybrid histogram (HH). The concept of bucket and average bucket sizes will be explained. These will help us to understand the creation and nature of HH. We will test same table (with same data) in Oracle Database 12.2.0.1 and 18.3.0.0. In addition, we will review how the defect has been solved in the construction of HH. We will clear how the fix “25994960: CARDINALITY MISESTIMATE FROM HYBRID HISTOGRAM” improves the efficiency of HH. It actually, prevents losing the most popular value(s) from the histogram, we will see how it has been done. The Oracle Database 18c contains the fix but 12.2 does not. Therefore, we are performing the following test case in 12.2 and 18c
Let`s see the case:

CREATE TABLE t1 AS
SELECT
       CASE
         WHEN level <=600 then TRUNC(DBMS_RANDOM.value(1,30))
         when level >=960 and level <=997 then 997
         ELSE level
       END AS col
FROM   dual 
CONNECT BY level <=1000;

The table data

 SELECT   col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt
    FROM   t1
GROUP BY   col
  HAVING   COUNT ( * ) > 1
ORDER BY   1;

       COL   COUNT(*)        CNT
---------- ---------- ----------
         1         23        638
         2         17        638
         3         24        638
         4         24        638
         5         24        638
         6         18        638
         7         16        638
         8         24        638
         9         14        638
        10         37        638
        11         14        638
        12         13        638
        13         16        638
        14         24        638
        15         21        638
        16         15        638
        17         20        638
        18         19        638
        19         25        638
        20         29        638
        21         25        638
        22         21        638
        23         19        638
        24         23        638
        25         15        638
        26         23        638
        27         23        638
        28         19        638
        29         15        638
       997         38        638

30 rows selected.

SQL>   SELECT   MIN (col), MAX (col), cnt
    FROM   (  SELECT   col, COUNT ( * ), SUM (COUNT ( * )) OVER () cnt
                FROM   t1
            GROUP BY   col
              HAVING   COUNT ( * ) = 1
            ORDER BY   1)
GROUP BY   cnt;

  2    3    4    5    6    7
  MIN(COL)   MAX(COL)        CNT
---------- ---------- ----------
       601       1000        362

SQL>

As you see the table contains 638 rows (number of distinct values for these rows is 30) with frequencies greater than one and 362 rows (number of distinct values for these rows is 362) with one frequency. So there is 392 distinct value in the table column. We are using same table data in both 12.2 and 18.3 Oracle Databases to generate HH.

begin
        dbms_stats.gather_table_stats(
                ownname          => null,
                tabname          => 't1',
                method_opt       => 'for columns col size 80'
        );
end;
/

The histogram data from Oracle DB 12.2

SQL> select
  2          endpoint_value                                                            value,
        endpoint_number,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'COL'
order by
        endpoint_value
; 
     VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT
---------- --------------- ----------- ---------------------
         1              23          23                    23
         2              40          17                    17
         3              64          24                    24
         4              88          24                    24
         5             112          24                    24
         6             130          18                    18
         7             146          16                    16
         8             170          24                    24
         9             184          14                    14
        10             221          37                    37
        11             235          14                    14
        12             248          13                    13
        13             264          16                    16
        14             288          24                    24
        15             309          21                    21
        16             324          15                    15
        17             344          20                    20
        18             363          19                    19
        19             388          25                    25
        20             417          29                    29
        21             442          25                    25
        22             463          21                    21
        23             482          19                    19
        24             505          23                    23
        25             520          15                    15
        26             543          23                    23
        27             566          23                    23
        28             585          19                    19
        29             600          15                    15
       607             607           7                     1
       614             614           7                     1
       621             621           7                     1
       628             628           7                     1
       635             635           7                     1
       642             642           7                     1
       649             649           7                     1
       655             655           6                     1
       662             662           7                     1
       669             669           7                     1
       676             676           7                     1
       683             683           7                     1
       690             690           7                     1
       697             697           7                     1
       704             704           7                     1
       711             711           7                     1
       718             718           7                     1
       725             725           7                     1
       732             732           7                     1
       739             739           7                     1
       745             745           6                     1
       752             752           7                     1
       759             759           7                     1
       766             766           7                     1
       773             773           7                     1
       780             780           7                     1
       787             787           7                     1
       794             794           7                     1
       801             801           7                     1
       808             808           7                     1
       815             815           7                     1
       822             822           7                     1
       828             828           6                     1
       835             835           7                     1
       842             842           7                     1
       849             849           7                     1
       856             856           7                     1
       863             863           7                     1
       870             870           7                     1
       877             877           7                     1
       884             884           7                     1
       891             891           7                     1
       898             898           7                     1
       905             905           7                     1
       911             911           6                     1
       918             918           7                     1
       925             925           7                     1
       932             932           7                     1
       939             939           7                     1
       946             946           7                     1
      1000            1000          54                     1

80 rows selected.

SQL> SELECT   num_distinct, num_buckets, density,histogram
  FROM   user_tab_col_statistics
 WHERE   table_name = 'T1' AND column_name = 'COL'  2    3
  4  ;

NUM_DISTINCT NUM_BUCKETS    DENSITY HISTOGRAM
------------ ----------- ---------- ---------------
         392          80    .001969 HYBRID

Now the histogram data from Oracle DB 18.3

SQL> select
  2          endpoint_value                                                            value,
        endpoint_number,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) bucket_size,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name  = 'T1'
and     column_name = 'COL'
order by
        endpoint_value  3    4    5    6    7    8    9   10   11   12
 13  ;

     VALUE ENDPOINT_NUMBER BUCKET_SIZE ENDPOINT_REPEAT_COUNT
---------- --------------- ----------- ---------------------
         1              23          23                    23
         2              40          17                    17
         3              64          24                    24
         4              88          24                    24
         5             112          24                    24
         6             130          18                    18
         7             146          16                    16
         8             170          24                    24
         9             184          14                    14
        10             221          37                    37
        11             235          14                    14
        12             248          13                    13
        13             264          16                    16
        14             288          24                    24
        15             309          21                    21
        16             324          15                    15
        17             344          20                    20
        18             363          19                    19
        19             388          25                    25
        20             417          29                    29
        21             442          25                    25
        22             463          21                    21
        23             482          19                    19
        24             505          23                    23
        25             520          15                    15
        26             543          23                    23
        27             566          23                    23
        28             585          19                    19
        29             600          15                    15
       608             608           8                     1
       615             615           7                     1
       622             622           7                     1
       629             629           7                     1
       637             637           8                     1
       644             644           7                     1
       651             651           7                     1
       658             658           7                     1
       666             666           8                     1
       673             673           7                     1
       680             680           7                     1
       687             687           7                     1
       695             695           8                     1
       702             702           7                     1
       709             709           7                     1
       716             716           7                     1
       724             724           8                     1
       731             731           7                     1
       738             738           7                     1
       745             745           7                     1
       752             752           7                     1
       760             760           8                     1
       767             767           7                     1
       774             774           7                     1
       781             781           7                     1
       789             789           8                     1
       796             796           7                     1
       803             803           7                     1
       810             810           7                     1
       818             818           8                     1
       825             825           7                     1
       832             832           7                     1
       839             839           7                     1
       847             847           8                     1
       854             854           7                     1
       861             861           7                     1
       868             868           7                     1
       876             876           8                     1
       883             883           7                     1
       890             890           7                     1
       897             897           7                     1
       905             905           8                     1
       912             912           7                     1
       919             919           7                     1
       926             926           7                     1
       933             933           7                     1
       941             941           8                     1
       948             948           7                     1
       955             955           7                     1
       997             997          42                    38
      1000            1000           3                     1

80 rows selected.

SQL> SELECT   num_distinct, num_buckets, density,histogram
  FROM   user_tab_col_statistics
 WHERE   table_name = 'T1' AND column_name = 'COL'
  2    3    4  ;

NUM_DISTINCT NUM_BUCKETS    DENSITY HISTOGRAM
------------ ----------- ---------- ---------------
         392          80      .0019 HYBRID

As you see the 12c HH does not include the most frequent value (997) but 18c HH does. Although the table data are same the histogram data are different in the both databases and that is why the densities also are different. To understand the problem we need to trace statistics gathering process in both database.
Trace files in both database shows that the same SQL statement and same data were used to create HH

12c trace:

DBMS_STATS: Building Histogram for COL
DBMS_STATS:  bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80
DBMS_STATS:  Trying hybrid histogram 
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over()) popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  from (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case when freq > ((sum(freq) over())/80)  then 1  else 0 end) pop from  (select /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val, count("COL") freq  from "SYSTEM"."T1" t  where "COL" is not null  group by "COL")) order by val
DBMS_STATS:  > cdn 1000, popFreq 638, popCnt 30, bktSize 6.91836734693877551020408163265306122449, bktSzFrc .91836734693877551020408163265306122449
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv  TRUE, ndv 392, selNdv 15, selFreq 343, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram

18c trace:

DBMS_STATS: Building Histogram for COL
DBMS_STATS:  bktnum=80, nnv=1000, snnv=1000, sndv=392, est_ndv=392, mnb=80
DBMS_STATS:  Trying hybrid histogram 
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over()) popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  from (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case when freq > ((sum(freq) over())/80)  then 1  else 0 end) pop from  (select /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val, count("COL") freq  from "CHINAR"."T1" t  where "COL" is not null  group by "COL")) order by val

(SQL_HH)

DBMS_STATS:  > cdn 1000, popFreq 638, popCnt 30, bktSize 7.24, bktSzFrc .24
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 80, mnb 80, ssize 1000, min_ssize 2500, appr_ndv  TRUE, ndv 392, selNdv 16, selFreq 381, pct 100, avg_bktsize 13, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram

It is clearly seen that both trace files reports that there cardinality (cdn), popular frequency (popFreq), popular value count (popCnt) and average bucket size are same except the bucket size (original). The bucket size was calculated 6.9183 in 12c and 7.24 in 18c. And this is the answer that is why 12c histogram does not contain the most frequent value(s).

Now, what is the bucket size and how is it calculated?

In order to create HH oracle firstly calculates the bucket size and then uses it against the data returned by the (SQL_HH) SQL statement. Let’s quick review that data. In both Oracle database version the same data is retrieved:

  SELECT   val,
           freq,
           cdn,
           ndv,
           (SUM (pop) OVER ()) popcnt,
           (SUM (pop * freq) OVER ()) popfreq
    FROM   (SELECT   val,
                     freq,
                     (SUM (freq) OVER ()) cdn,
                     (COUNT ( * ) OVER ()) ndv,
                     (CASE
                          WHEN freq > ( (SUM (freq) OVER ()) / 80) THEN 1
                          ELSE 0
                      END)
                         pop
              FROM   (  SELECT"COL"
                                     val,
                                 COUNT ("COL") freq
                          FROM   "CHINAR"."T1" t
                         WHERE   "COL" IS NOT NULL
                      GROUP BY   "COL"))
ORDER BY   val;

       VAL       FREQ        CDN        NDV     POPCNT    POPFREQ
---------- ---------- ---------- ---------- ---------- ----------
         1         23       1000        392         30        638
         2         17       1000        392         30        638
         3         24       1000        392         30        638
         4         24       1000        392         30        638
         5         24       1000        392         30        638
         6         18       1000        392         30        638
         7         16       1000        392         30        638
         8         24       1000        392         30        638
         9         14       1000        392         30        638
        10         37       1000        392         30        638
        11         14       1000        392         30        638
        12         13       1000        392         30        638
        13         16       1000        392         30        638
        14         24       1000        392         30        638
        15         21       1000        392         30        638
        16         15       1000        392         30        638
        17         20       1000        392         30        638
        18         19       1000        392         30        638
        19         25       1000        392         30        638
        20         29       1000        392         30        638
        21         25       1000        392         30        638
        22         21       1000        392         30        638
        23         19       1000        392         30        638
        24         23       1000        392         30        638
        25         15       1000        392         30        638
        26         23       1000        392         30        638
        27         23       1000        392         30        638
        28         19       1000        392         30        638
        29         15       1000        392         30        638
       601          1       1000        392         30        638
       602          1       1000        392         30        638
       603          1       1000        392         30        638
       604          1       1000        392         30        638
       605          1       1000        392         30        638
       606          1       1000        392         30        638
       607          1       1000        392         30        638
       608          1       1000        392         30        638
.................................................................
       947          1       1000        392         30        638
       948          1       1000        392         30        638
       949          1       1000        392         30        638
       950          1       1000        392         30        638
       951          1       1000        392         30        638
       952          1       1000        392         30        638
       953          1       1000        392         30        638
       954          1       1000        392         30        638
       955          1       1000        392         30        638
       956          1       1000        392         30        638
       957          1       1000        392         30        638
       958          1       1000        392         30        638
       959          1       1000        392         30        638
       997         38       1000        392         30        638
       998          1       1000        392         30        638
       999          1       1000        392         30        638
      1000          1       1000        392         30        638

392 rows selected.

The bucket size should be calculated properly in order to include the most frequent values as much as possible. Where does the bucket and its size come from? We firstly will explain it.

There we have CDN (number of rows in the table or histogram cardinality when using sample data), number of popular values (PopCnt) and their frequencies – popular frequencies (PopFreq). So, it means There (NDV-PopCnt) number of distinct values of unpopular values and (CDN-PopFreq) unpopular rows (or their total frequencies). Also, we need to create N (in our case 80) number of buckets and it should contain all top-frequent values. Each popular value should be located in a separate bucket, therefore, we have to reserve PopCnt number of buckets for popular values. It means there are (N-PopCnt) number of buckets for unpopular rows. We actually distribute the table (column) rows into two types of buckets: popular and unpopular buckets. So, each unpopular bucket will contain:

 Unpopular_rows/unpopular_buckets = (CDN-PopFreq) / (N-PopCnt)     (F.1)

This formula explains the concept of the bucket size. So, the bucket size is calculated based on the formula (F.1).
But the minimum size of the popular bucket will be:

  PopbktSize = min (〖PopValue〗_i) , i = [1..PopCnt]

Now, let see the above two histograms and try to understand how bucket size have been identified of them.

 BktSize_18c =  (CDN-PopFreq) / (N-PopCnt)=(100-638)/(80-30)=7.24

As you see it is the same with the number reported in the 18c trace file. But why the 12c bucket is different?
The answer is that the formula (F.1) identifies the bucket size for HH properly, however there is an important thing need to be considered. The (SQL_HH) returns us a dataset to create HH, but the firstly, the minimum value of the column has to be inserted to the first bucket (as a separate bucket) of HH regardless its popularity and size. So, the minimum value is the first bucket of our HH and its size equal to the frequency of that minimum value. It means we have to consider it in our formula (F.1). In addition the last bucket of the HH is the maximum column value and its size is determined based on size of previous buckets and number of rows in the table.
So, if the minimum column value is popular then we actually, already included it in the formula (F.1), otherwise if it is not a popular value then we have to exclude it in the formula (F.1). Thus, the final bucket size will be:

  If the minimum value is popular then BktSize = ((CDN-PopFreq))/((N-PopCnt) ) else
      BktSize = ((CDN-PopFreq- freq of the minimum value))/((N-PopCnt-1) )

(F.2)

In our case the minimum value is 1 and its frequency is 23, as you know average bucket size is 13 that is why it is a popular value and therefore bucket size should be 7.24 based on the formula (F.2).

But 12c does not consider that fact, it locates the minimum value in the first bucket of the HH and then reduces the bucket size by subtracting its frequency, although it is already included. So:

BktSize_12c =  (CDN-PopFreq-freq of the minimum value) / (N-PopCnt-1)=(1000-638-23)/(80-30-1)= 6.91836…

The undersized bucket prevents to be reached (included) the most popular values, So that, smaller (un-proper) bucket size causes to create 80 (N) number of buckets before reaching the most popular value. If you see the histogram data in 12c, the 79 number of buckets has been created when database reached the value 946. It stopped the process due to reaching the maximum bucket count.

It should be noted that the formula (F.2) is used to identify the initial bucket size, but unpopular bucket size could be even less than that bucket size defined by the (F.2). You can see that if you look at the blog post by Jonathan Lewis. In this example, the bucket size calculated as same in the both version of Oracle Database:

DBMS_STATS:  > cdn 800, popFreq 434, popCnt 5, bktSize 52.14285714285714285714285714285714285714, bktSzFrc .14285714285714285714285714285714285714
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 13, mnb 13, ssize 800, min_ssize 2500, appr_ndv  TRUE, ndv 22, selNdv 1, selFreq 109, pct 100, avg_bktsize 62, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 527

The bucket size 52.14 and histograms contain buckets that size less than 52.14, this is because there are very less distinct values after creating 4 buckets (value=19 in 18c histogram). So 13-4=9 buckets had to be created but, we had 22-13 (until 19 there are 13 distinct value) = 9 distinct value after creating the 4 buckets, thus all 9 distinct values should be included in the HH as separate bucket. The another differences of the two histogram is that in 18c Oracle Database preserves last bucket for the maximum value but 12c does not. It was another drawback of the 12c that initial preserving may help database to include most popular values, otherwise merging/removing last bucket can cause the losing of the popular value.

Note: The formula (F.1) also helps us to understand the concept of selectivity for the HH. Each unpopular buckets could have different size, for each unpopular distinct value we will have:

   Rows per unpopular distinct value=unpopular_rows/unpopular_ndv=(CDN-PopFreq) / (NDV-PopCnt)

Summary

The bucket size for hybrid histogram is calculated based on (F.2). It is initial bucket size based on this and the dataset (returned by SQL_HH) the database determines hybrid histogram. But the average bucket size is used to identify popularity of the distinct value. The minimum and maximum values are located in separate buckets. The first bucket contains only minimum column value and its size is the frequency of the minimum value.

Comments (4)

October 19, 2018

Join Skew Handling Subject to Filter Conditions-2

Filed under: Cardinality Estimation,CBO,Join,Skew,Tuning — Chinar Aliyev @ 11:33 am

In previous blog post I described a way how to estimate join size correctly. I used two virtual columns in order to avoid processing of the filter predicates explicitly. But a drawback of the method is that for each filter value we are required to create a virtual column. For example, if there are following types of filter predicates like:

dt.object_type = 'TABLE'
dt.object_type = 'INDEX'
dt.object_type in ('TABLE','INDEX')

For each such filter predicates we need to (re)create virtual columns to allow optimizer for computing Join Cardinality properly.
In this blog post I am going to demonstrate another possible way to solve this problem without creating several virtual columns. So, we are going to generalize the method.
I did a little change in the test case:

create table dim_table (type_code number, object_type varchar2(10));
insert into dim_table values (1,'TABLE');
insert into dim_table values (2,'INDEX');
insert into dim_table values (3,'VIEW');
insert into dim_table values (4,'SYNONYM');
insert into dim_table values (5,'OTHER');
commit;
exec dbms_stats.gather_table_stats(null, 'dim_table');
drop table t1;

create table t1 
nologging
as
select
        decode(owner, 'SYS',1,'MDSYS',2,'SYSTEM',3,'ORDDATA',4,5) owner_code ,object_id, object_name, 
        decode(object_type, 'TABLE',1,'INDEX',2,'VIEW',3,'SYNONYM',4,5) type_code 
from
        dba_objects
where
        rownum <= 200000;
 
exec dbms_stats.gather_table_stats(null, 't1', cascade=>true,method_opt=>'for all columns size 254');

We are going to interpret the following SQL and its execution plan

select 
       count(*)
from
        t1, dim_table dt
where
        t1.type_code   = dt.type_code 
and     dt.object_type in ('TABLE','INDEX')
and     t1.owner_code in (2,3);

(SQL 1)

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

Plan hash value: 3591903594                                                                                           
                                                                                                                      
----------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |      1 |00:00:00.02 |     526 |       |       |          |
|   1 |  SORT AGGREGATE     |           |      1 |      1 |      1 |00:00:00.02 |     526 |       |       |          |
|*  2 |   HASH JOIN         |           |      1 |   1219 |    644 |00:00:00.01 |     526 |  2546K|  2546K|  788K (0)|
|*  3 |    TABLE ACCESS FULL| DIM_TABLE |      1 |      2 |      2 |00:00:00.01 |       6 |       |       |          |
|*  4 |    TABLE ACCESS FULL| T1        |      1 |   3048 |   3048 |00:00:00.01 |     519 |       |       |          |
----------------------------------------------------------------------------------------------------------------------
                                                                                                                      
Predicate Information (identified by operation id):                                                                   
---------------------------------------------------                                                                   
                                                                                                                      
   2 - access("T1"."TYPE_CODE"="DT"."TYPE_CODE")                                                                      
   3 - filter(("DT"."OBJECT_TYPE"='INDEX' OR "DT"."OBJECT_TYPE"='TABLE'))                                             
   4 - filter(("T1"."OWNER_CODE"=2 OR "T1"."OWNER_CODE"=3)) 

(EP1)

Now I am going to create the following extended statistics

BEGIN
  DBMS_STATS.GATHER_TABLE_STATS( null,'dim_table',
  METHOD_OPT =>'FOR ALL COLUMNS SIZE 254 ' ||
                'FOR COLUMNS SIZE 254 (type_code,object_type)' );
DBMS_STATS.GATHER_TABLE_STATS( null,'t1',
  METHOD_OPT =>'FOR ALL COLUMNS SIZE 254 ' ||
                'FOR COLUMNS SIZE 254 (type_code,owner_code)' );
END;
/

Instead of creating virtual columns (extended stats) that include values of filter predicates, we have created extended statistics for join and filter columns without specifying their exact values. Because we are trying to generalize our approach. Let`s see column statistics.

SELECT   table_name,
           column_name,
           num_distinct,
           histogram
    FROM   user_tab_col_statistics
   WHERE   table_name IN ('DIM_TABLE', 'T1')
ORDER BY   table_name

TABLE_NAME   COLUMN_NAME                              NUM_DISTINCT          HISTOGRAM
-----------  -------------------------------------    ----------------     --------------
DIM_TABLE    SYS_STUAA5GDQVG143JAEWWJJ8DEHC              5                  FREQUENCY
DIM_TABLE    OBJECT_TYPE                                 5                  FREQUENCY
DIM_TABLE    TYPE_CODE                                   5                  FREQUENCY
T1           OBJECT_NAME                             60608                  HYBRID
T1           OBJECT_ID                               73673                  HYBRID
T1           OWNER_CODE                                  5                  FREQUENCY
T1           TYPE_CODE                                   5                  FREQUENCY
T1           SYS_STUYID1$318FLYXAF1AITPI_H9             23                  FREQUENCY

The only join and filter columns separately do not allow us to estimate join size. The reason is that after applying filters optimizer does not have any idea about value range of join columns. But the extended statistics(ES) help us to solve this problem. As you see we have two extended statistics here: DIM_TABLE.SYS_STUAA5GDQVG143JAEWWJJ8DEHC, T1.SYS_STUYID1$318FLYXAF1AITPI_H9 their data distribution (there are frequency histograms for each extended statistic) give us an opportunity to find join columns value ranges after applying filter predicates and estimate join size accurately.
Let`s see their histogram data:

SELECT   endpoint_value val,
         endpoint_number - NVL (prev_endpoint, 0) freq,
         endpoint_number ep
  FROM   (SELECT   endpoint_number,
                   NVL (
                       LAG (endpoint_number, 1)
                           OVER (ORDER BY endpoint_number),
                       0)
                       prev_endpoint,
                   endpoint_value
            FROM   user_tab_histograms
           WHERE   table_name = 'DIM_TABLE'
                   AND column_name = 'SYS_STUAA5GDQVG143JAEWWJJ8DEHC')

       VAL       FREQ         EP
---------- ---------- ----------
1091848468          1          1
3944801947          1          2
6488719988          1          3
7817432188          1          4
8929074769          1          5

SELECT   endpoint_value val,
         endpoint_number - NVL (prev_endpoint, 0) freq,
         endpoint_number ep
  FROM   (SELECT   endpoint_number,
                   NVL (
                       LAG (endpoint_number, 1)
                           OVER (ORDER BY endpoint_number),
                       0)
                       prev_endpoint,
                   endpoint_value
            FROM   user_tab_histograms
           WHERE   table_name = 'T1'
                   AND column_name = 'SYS_STUYID1$318FLYXAF1AITPI_H9')

       VAL       FREQ         EP
---------- ---------- ----------
 331414609       4616       4616
1394367859         99       4715
1598248696       1520       6235
1716676118       2123       8358
1977102303       1650      10008
2498387478       6625      16633
2724409473        205      16838
2767490637        164      17002
3007640327         25      17027
3277363539          8      17035
3306944241         22      17057
3437869372        156      17213
3763665789        474      17687
3897101470        703      18390
6005663600         90      18480
7155777133      42233      60713
7410806291      12187      72900
7713001505        155      73055
7792531372         10      73065
7894566276        147      73212
8044813761        136      73348
9156127441        310      73658
9960197936         16      73674

23 rows selected.

But the column values are not useful directly until we use MOD(sys_op_combined_hash (val1,val2)) expression. For example, if we are interested in to find join column values of t1.type_code after applying owner_code in (2,3) filter then we can use following query:

SELECT   *
                      FROM   (SELECT   endpoint_value t1_type_code,
                                       endpoint_number
                                       - NVL (prev_endpoint, 0)
                                           t1_type_code_freq,
                                       endpoint_number ep_t1_type_code
                                FROM   (SELECT   endpoint_number,
                                                 NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     prev_endpoint,
                                                 endpoint_value
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'T1'
                                                 AND column_name =
                                                        'TYPE_CODE'))
                             t1_hist_type_code,
                             (SELECT   endpoint_value t1_vc,
                                       endpoint_number
                                       - NVL (prev_endpoint, 0)
                                           t1_vc_freq,
                                       endpoint_number ep_t1_vc
                                FROM   (SELECT   endpoint_number,
                                                 NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     prev_endpoint,
                                                 endpoint_value
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'T1'
                                                 AND column_name =
                                                        'SYS_STUYID1$318FLYXAF1AITPI_H9'))
                             t1_hist_vc
                     WHERE   t1_hist_vc.t1_vc IN
                                     (MOD (
                                          sys_op_combined_hash (
                                              t1_hist_type_code.t1_type_code,
                                              2),
                                          9999999999),
                                      MOD (
                                          sys_op_combined_hash (
                                              t1_hist_type_code.t1_type_code,
                                              3),
                                          9999999999)
                                          )
T1_TYPE_CODE T1_TYPE_CODE_FREQ            EP      T1_VC         T1_VC_FREQ   EP_T1_VC
------------ -----------------      ----------   ----------     ----------   ----------
           1              2203            2203   7894566276     147          73212
           2              2869            5072   2724409473     205          16838
           3              7233           12305   1394367859     99           4715
           5             49158           73674   1716676118     2123         8358
           1              2203            2203   8044813761     136          73348
           2              2869            5072   3437869372     156          17213
           3              7233           12305   7792531372     10           73065
           4             12211           24516   3277363539     8            17035
           5             49158           73674   2767490637     164          17002

What does the SQL give us? It identifies the join column values after applying filter predicate and their frequencies that exactly we was going to find it in order to estimate join cardinality. There are T1_TYPE_CODE join column values and T1_VC_FREQ indicates their frequencies. In the same way we can find the appropriate data for dim_table and finally we will be able to produce correct join cardinality using the following SQL:

---JoinCard.sql
SELECT   SUM (dm_vc_freq * t1_vc_freq) "JoinCard"
  FROM   (SELECT   *
            FROM   (SELECT   *
                      FROM   (SELECT   endpoint_number ep_dm_type_code ,endpoint_number-
                                                 NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     dm_type_code_freq,
                                                 endpoint_value dm_type_code
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'DIM_TABLE'
                                                 AND column_name =
                                                        'TYPE_CODE')
                             dm_hist_type_code,
                             (SELECT   endpoint_number ep_dm_vc,
                                                 endpoint_number-NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     dm_vc_freq,
                                                 endpoint_value dm_vc
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'DIM_TABLE'
                                                 AND column_name =
                                                        'SYS_STUAA5GDQVG143JAEWWJJ8DEHC')
                             dm_hist_vc
                     WHERE  /* this part contains values of filter predicates*/   
                          dm_hist_vc.dm_vc IN
                                     (MOD (
                                          sys_op_combined_hash (
                                              dm_hist_type_code.dm_type_code,
                                              'TABLE'),
                                          9999999999),
                                          MOD (
                                          sys_op_combined_hash (
                                              dm_hist_type_code.dm_type_code,
                                              'INDEX'),
                                          9999999999)
                                          )) dm_data,
                                          
                   (SELECT   *
                      FROM   ( SELECT   endpoint_number ep_t1_type_code,
                                                endpoint_number- NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     t1_type_code_freq,
                                                 endpoint_value t1_type_code
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'T1'
                                                 AND column_name =
                                                        'TYPE_CODE') 
                             t1_hist_type_code,
                             ( SELECT   endpoint_number ep_t1_vc,
                                                endpoint_number- NVL (
                                                     LAG (
                                                         endpoint_number,
                                                         1)
                                                         OVER (
                                                             ORDER BY
                                                                 endpoint_number),
                                                     0)
                                                     t1_vc_freq,
                                                 endpoint_value t1_vc
                                          FROM   user_tab_histograms
                                         WHERE   table_name = 'T1'
                                                 AND column_name =
                                                        'SYS_STUYID1$318FLYXAF1AITPI_H9' )
                             t1_hist_vc
                     WHERE   /* this part contains values of filter predicates*/
t1_hist_vc.t1_vc IN
                                     (MOD (
                                          sys_op_combined_hash (
                                              t1_hist_type_code.t1_type_code,
                                              2),
                                          9999999999),
                                      MOD (
                                          sys_op_combined_hash (
                                              t1_hist_type_code.t1_type_code,
                                              3),
                                          9999999999)
                                          )) t1_data
           WHERE   dm_data.dm_type_code = t1_data.t1_type_code)
  JoinCard
----------
       644

SQL>

(SQL2)

As it is seen from (EP1) execution plan actual rows was 644 since we also calculated it as 644 using SQL2.
If we want to execute the SQL for different filter values then we just need to specify the values in SQL2 and we will get the proper cardinality for the join. It means in this case the creation of additional virtual columns will not be required for those filter values.
But, our work has not completed. We just got a number for the join cardinality, so optimizer is not aware of that still. In this case I recommend using OPT_ESTIMATE hint in order to influence the optimizer. Yes you can say that it has not been documented yet. Right, but I cannot say that the hint is not usable. If you look at the “How To Use Optimizer Hints To Specify Cardinality For Join Operation (Doc ID 2402821.1)” MOS note then you can recognize that support actually recommends using OPT_ESTIMATE hint.
As a result:

SELECT   /*+ opt_estimate(join, (t1,dt), rows=644) */
       *
  FROM   t1, dim_table dt
 WHERE       t1.type_code = dt.type_code
         AND dt.object_type IN ('TABLE', 'INDEX')
         AND t1.owner_code IN (2, 3);

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

Plan hash value: 1679886631                                                                                          
                                                                                                                     
---------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |      1 |        |    644 |00:00:00.01 |     527 |       |       |          |
|*  1 |  HASH JOIN         |           |      1 |    644 |    644 |00:00:00.01 |     527 |  1856K|  1856K|  895K (0)|
|*  2 |   TABLE ACCESS FULL| DIM_TABLE |      1 |      2 |      2 |00:00:00.01 |       6 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T1        |      1 |   3048 |   3048 |00:00:00.01 |     520 |       |       |          |
---------------------------------------------------------------------------------------------------------------------
                                                                                                                     
Predicate Information (identified by operation id):                                                                  
---------------------------------------------------                                                                  
                                                                                                                     
   1 - access("T1"."TYPE_CODE"="DT"."TYPE_CODE")                                                                     
   2 - filter(("DT"."OBJECT_TYPE"='INDEX' OR "DT"."OBJECT_TYPE"='TABLE'))                                            
   3 - filter(("T1"."OWNER_CODE"=2 OR "T1"."OWNER_CODE"=3))

Summary

In the blog post we saw another strategy to estimate join size without creating additional virtual columns per filter value. In our example, there are frequency histogram for both join columns, so we can generalize the SQL2 for different types of histogram combinations. This approach completely solves the cardinality estimation for two table joins. In addition, we saw the power of extended statistics and their usability. As you paid attention, in this and previous cases we avoided to use general formula of join cardinality estimation. Actually, we avoided join selectivity factor. It should be noted that if there are more than two tables are involved in a join then it is not easy to achieve correct join cardinality and that is why the join selectivity plays important role. So, in this case optimizer calculates JS and uses it during the join processing. I will write about an approach to estimate better join cardinality for more than two table joins in a future blog post.

Comments (3)

October 2, 2018

Join Skew Handling Subject to Filter Conditions

Filed under: Cardinality Estimation,CBO,Join,Skew,Tuning — Chinar Aliyev @ 1:52 pm

I have written about the problem in my previous post but re-raising the mentioned problem by Jonathan Lewis enforced me to think about other possible solution of it. A suggested solution in Jonathan`s blog is applying fact table`s column distribution to the dimension table`s column. In my opinion, if a column distribution will not express its data by faked statistics, then it is not a good idea to apply the method to production systems. As he said the method does not work as expected from 12 R2 afterward. For some predicate values it works, but for other does not.
In this blog post I am going to describe another possible and supported way to correct the cardinality estimation of the join.
Let`s create the test case again

// tested in Oracle Database version: 18.3.0.0.0, 12.2.0.1.0, 12.1.0.2.0

create table dim_table (type_code number, object_type varchar2(10));
insert into dim_table values (1,'TABLE');
insert into dim_table values (2,'INDEX');
insert into dim_table values (3,'VIEW');
insert into dim_table values (4,'SYNONYM');
insert into dim_table values (5,'OTHER');
commit;
-- I do not create primary index for dim_table, because we consider the case for general purpose

exec dbms_stats.gather_table_stats(user,'dim_table');

create table t1 
nologging
as 
select 
        object_id, object_name, 
        decode(object_type, 'TABLE',1,'INDEX',2,'VIEW',3,'SYNONYM',4,5) type_code 
from 
        all_objects
where
where
        rownum <= 50000;
  
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
  
exec dbms_stats.gather_table_stats(null, 't1', cascade=>true,method_opt=>'for all columns size 254');

The value of type_code is 1 for object_type=’TABLE’ then

explain plan for
select  *
from
        t1
where
        t1.type_code = 1; 

(SQL 1)

select * from table(dbms_xplan.display(null,null,null));

Plan hash value: 3617692013                                               
                                                                          
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      | 14336 |   602K|   726   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   | 14336 |   602K|   726   (1)| 00:00:01 |
--------------------------------------------------------------------------
                                                                          
Predicate Information (identified by operation id):                       
---------------------------------------------------                       
                                                                          
   1 - filter("T1"."TYPE_CODE"=1)   
  EP1

It means in my case, the cardinality of the following join should be calculated as 14336, but:

explain plan for
select 
        t1.object_id
from
        t1, dim_table dt
where
        t1.type_code   = dt.type_code 
and     dt.object_type = 'TABLE';

(SQL 2)

select * from table(dbms_xplan.display(null,null,null));
Plan hash value: 1679886631                                                     
                                                                                
--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 80000 |  1328K|   728   (1)| 00:00:01 |
|*  1 |  HASH JOIN         |           | 80000 |  1328K|   728   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| DIM_TABLE |     1 |     9 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1        |   400K|  3125K|   725   (1)| 00:00:01 |
--------------------------------------------------------------------------------
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - access("T1"."TYPE_CODE"="DT"."TYPE_CODE")                                
   2 - filter("DT"."OBJECT_TYPE"='TABLE')                                       
EP2

As in the previous blog posts was mentioned that in this case optimizer doesn’t estimate cardinality properly due to several of factors. First of all, we cannot say that in the case optimizer doesn`t consider the column distribution for estimating the join size. In contrast, it calculates join selectivity (JS) based on join columns distribution, so histograms data of the join columns are used to calculate JS, in other words these are used for the join cardinality. Let’s prove it then we will return our base query.

create table dim_table2 (type_code number, object_type varchar2(10));
 
insert into dim_table2 values (1,'TABLE');
insert into dim_table2 values (1,'TABLE');
insert into dim_table2 values (2,'INDEX');
insert into dim_table2 values (3,'VIEW');
insert into dim_table2 values (4,'SYNONYM');
insert into dim_table2 values (5,'OTHER');

commit;
 

exec dbms_stats.gather_table_stats(user,'dim_table2',method_opt=&gt;'for all columns size skewonly');
explain plan for
select 
        t1.object_id
from
        t1, dim_table2 dt
where
        t1.type_code   = dt.type_code 
and     dt.object_type = 'TABLE';

(SQL 3)

select * from table(dbms_xplan.display(null,null,null));

Plan hash value: 1964253718                                                      
                                                                                 
---------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |   138K|  2292K|   728   (1)| 00:00:01 |
|*  1 |  HASH JOIN         |            |   138K|  2292K|   728   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| DIM_TABLE2 |     2 |    18 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1         |   400K|  3125K|   725   (1)| 00:00:01 |
---------------------------------------------------------------------------------
                                                                                 
Predicate Information (identified by operation id):                              
---------------------------------------------------                              
                                                                                 
   1 - access("T1"."TYPE_CODE"="DT"."TYPE_CODE")                                 
   2 - filter("DT"."OBJECT_TYPE"='TABLE')
EP3

As you see, although number of distinct values of the columns (NDV) has not been changed but Join cardinality has been changed. NDV is 5 for both columns, it means Join Cardinality should be

JCARD=Filtered_CARD(DIM_TABLE2)* Filtered_CARD(T1)/5=2*400k/5=160000.

But it is not 160K, it was calculated 138K! The reason is that JS is calculated based on frequency information but not just based on NDVs in this case due to histograms.
Now if optimizer considers column distribution then what was wrong with (SQL 1) statement? The answer is the filter predicates. These directly influence the join cardinality and JS doesn`t depend on filter predicates. I pointed out it two years ago at Hotsos symposium 2016 – in this the paper. It means, if we find a way to remove filter predicates from the SQL1 in order to completely eliminate the influencing the join cardinality estimation then we will attain the desired result.
But how we could achieve it? The first thing come to my mind was using extended statistics and rewriting SQL statement to get the result. I thought it should work. So, what I have done? I have added following virtual column to the table and gathered histogram for that.

BEGIN
  DBMS_STATS.GATHER_TABLE_STATS( null,'dim_table',
  METHOD_OPT =&gt; 'FOR ALL COLUMNS SIZE 254 ' ||
                'FOR COLUMNS SIZE 254 (CASE "OBJECT_TYPE" WHEN ''TABLE'' THEN "TYPE_CODE"  END )' );
END;
/

select column_name,
       num_distinct,
       num_buckets,
       histogram from user_tab_col_statistics
where table_name='DIM_TABLE'

COLUMN_NAME                          NUM_DISTINCT NUM_BUCKETS  HISTOGRAM
------------------------------------ ------------ -----------  -----------
TYPE_CODE                            5            5            FREQUENCY
OBJECT_TYPE                          5            5            FREQUENCY
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_       1            1            FREQUENCY

As you see we have frequency histogram for each column including our virtual column. It is very important having histogram of the columns to take advantage of them in order to calculate JC correctly.
Now we can rewrite the original SQL1 with one following SQL statements and can see the estimation.

SELECT   t1.object_id
  FROM   t1, dim_table dt
 WHERE   t1.type_code =
             CASE dt.object_type 
             WHEN 'TABLE' THEN dt.type_code 
             END;
 (SQL4)

OR 

SELECT   t1.object_id
  FROM   t1, dim_table dt
 WHERE   t1.type_code = dt."SYS_STU#S#WF25Z#QAHIHE#MOFFMM1_"; 
 (SQL5)

These SQL2, SQL4, SQL5 are semantically equivalent queries. Explain plans of both SQL4 and SQL5 are same:

Plan hash value: 1679886631                                                     
                                                                                
--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 80000 |   781K|   728   (1)| 00:00:01 |
|*  1 |  HASH JOIN         |           | 80000 |   781K|   728   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| DIM_TABLE |     1 |     2 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1        |   400K|  3125K|   725   (1)| 00:00:01 |
--------------------------------------------------------------------------------
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - access("T1"."TYPE_CODE"=CASE "OBJECT_TYPE" WHEN 'TABLE' THEN             
              "TYPE_CODE" END )                                                 
   2 - filter(CASE "OBJECT_TYPE" WHEN 'TABLE' THEN "TYPE_CODE" END  IS          
              NOT NULL)   
EP4

Unfortunately, we have not got the expected cardinality. What is the reason? Is something wrong with this method?
Let`s look at the optimizer trace file to catch the problem.

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: DIM_TABLE  Alias: DT
  #Rows: 5  SSZ: 0  LGR: 0  #Blks:  1  AvgRowLen:  10.00  NEB: 0  ChainCnt:  0.00  ScanRate:  0.00  SPC: 0  RFL: 0  RNF: 0  CBK: 0  CHR: 0  KQDFLG: 1
  #IMCUs: 0  IMCRowCnt: 0  IMCJournalRowCnt: 0  #IMCBlocks: 0  IMCQuotient: 0.000000
  Column (#3): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_(NUMBER)
    AvgLen: 2 NDV: 1 Nulls: 4 Density: 0.500000 Min: 1.000000 Max: 1.000000
***********************
Table Stats::
  Table: T1  Alias: T1
  #Rows: 400000  SSZ: 0  LGR: 0  #Blks:  2664  AvgRowLen:  43.00  NEB: 0  ChainCnt:  0.00  ScanRate:  0.00  SPC: 0  RFL: 0  RNF: 0  CBK: 0  CHR: 0  KQDFLG: 1
  #IMCUs: 0  IMCRowCnt: 0  IMCJournalRowCnt: 0  #IMCBlocks: 0  IMCQuotient: 0.000000
  Column (#3): TYPE_CODE(NUMBER)
    AvgLen: 3 NDV: 5 Nulls: 0 Density: 0.017180 Min: 1.000000 Max: 5.000000
    Histogram: Freq  #Bkts: 5  UncompBkts: 400000  EndPtVals: 5  ActualVal: yes
try to generate single-table filter predicates from ORs for query block SEL$1 (#0)
finally: "T1"."TYPE_CODE"=CASE "DT"."OBJECT_TYPE" WHEN 'TABLE' THEN "DT"."TYPE_CODE" END  AND CASE "DT"."OBJECT_TYPE" WHEN 'TABLE' THEN "DT"."TYPE_CODE" END  IS NOT NULL

It is seen that optimizer detected virtual column for join processing but there is something missing. The trace file displays all basic column statistics for the virtual column but doesn`t talk about its frequency histogram! And why? The answer is that there is one bucket and there is only one number of distinct value for the virtual column. Therefore optimizer thought it is not necessary to look at the histogram data.
So, what is the next? What can we do for the next step?
We have one distinct value and four nulls for the column, let`s replace the null values with the other values (which not included in the table) in order to create more buckets for histogram. In this case, such histogram will trigger optimizer to consider frequency information from the histogram:

BEGIN
    DBMS_STATS.drop_extended_stats (
        NULL,
        'dim_table',
        '(CASE "OBJECT_TYPE" WHEN ''TABLE'' THEN "TYPE_CODE" END )');
END;
/
BEGIN
  DBMS_STATS.GATHER_TABLE_STATS( null,'dim_table',
  METHOD_OPT =&gt; 'FOR ALL COLUMNS SIZE 254 ' ||
                'FOR COLUMNS SIZE 254 (CASE "OBJECT_TYPE" WHEN ''TABLE'' THEN "TYPE_CODE" ELSE (-1) END )' );
END;
/

select column_name,
       num_distinct,
       num_buckets,
       histogram from user_tab_col_statistics
where table_name='DIM_TABLE'

COLUMN_NAME                          NUM_DISTINCT NUM_BUCKETS  HISTOGRAM
------------------------------------ ------------ -----------  -----------
TYPE_CODE                            5            5            FREQUENCY
OBJECT_TYPE                          5            5            FREQUENCY
SYS_STUE_0VSFXQ_4T7QQLFLMZRDPV       2            2            FREQUENCY

Now, let`s see the execution plan of :

explain plan for
SELECT   t1.object_id
  FROM   t1, dim_table dt
 WHERE   t1.type_code =
             CASE dt.object_type 
             WHEN 'TABLE' THEN dt.type_code else -1 
             END;
select * from table(dbms_xplan.display);
Plan hash value: 1679886631                                                     
                                                                                
--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 14336 |   168K|   728   (1)| 00:00:01 |
|*  1 |  HASH JOIN         |           | 14336 |   168K|   728   (1)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| DIM_TABLE |     5 |    20 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1        |   400K|  3125K|   725   (1)| 00:00:01 |
--------------------------------------------------------------------------------
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - access("T1"."TYPE_CODE"=CASE "OBJECT_TYPE" WHEN 'TABLE' THEN             
              "TYPE_CODE" ELSE (-1) END )                                       
  EP5
explain plan for
SELECT   t1.object_id
  FROM   t1, dim_table dt
 WHERE   t1.type_code = dt." SYS_STUE_0VSFXQ_4T7QQLFLMZRDPV"; 
 select * from table(dbms_xplan.display);
Plan hash value: 1679886631                                                     
                                                                                
--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           | 14336 |   168K|   728   (1)| 00:00:01 |
|*  1 |  HASH JOIN         |           | 14336 |   168K|   728   (1)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| DIM_TABLE |     5 |    20 |     2   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1        |   400K|  3125K|   725   (1)| 00:00:01 |
--------------------------------------------------------------------------------
                                                                                
Predicate Information (identified by operation id):                             
---------------------------------------------------                             
                                                                                
   1 - access("T1"."TYPE_CODE"=CASE "OBJECT_TYPE" WHEN 'TABLE' THEN             
              "TYPE_CODE" ELSE (-1) END )

As you see the last explain plans do not contain filter predicate anymore and we have got the correct cardinality. And the trace file shows that optimizer was going to use histogram of the virtual column for join processing:

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: DIM_TABLE  Alias: DT
  #Rows: 5  SSZ: 0  LGR: 0  #Blks:  1  AvgRowLen:  13.00  NEB: 0  ChainCnt:  0.00  ScanRate:  0.00  SPC: 0  RFL: 0  RNF: 0  CBK: 0  CHR: 0  KQDFLG: 1
  #IMCUs: 0  IMCRowCnt: 0  IMCJournalRowCnt: 0  #IMCBlocks: 0  IMCQuotient: 0.000000
  Column (#3): SYS_STUE_0VSFXQ_4T7QQLFLMZRDPV (NUMBER)
    AvgLen: 4 NDV: 2 Nulls: 0 Density: 0.100000 Min: -1.000000 Max: 1.000000
    Histogram: Freq  #Bkts: 2  UncompBkts: 5  EndPtVals: 2  ActualVal: yes
***********************
Table Stats::
  Table: T1  Alias: T1
  #Rows: 400000  SSZ: 0  LGR: 0  #Blks:  2664  AvgRowLen:  43.00  NEB: 0  ChainCnt:  0.00  ScanRate:  0.00  SPC: 0  RFL: 0  RNF: 0  CBK: 0  CHR: 0  KQDFLG: 1
  #IMCUs: 0  IMCRowCnt: 0  IMCJournalRowCnt: 0  #IMCBlocks: 0  IMCQuotient: 0.000000
  Column (#3): TYPE_CODE(NUMBER)
    AvgLen: 3 NDV: 5 Nulls: 0 Density: 0.017180 Min: 1.000000 Max: 5.000000
    Histogram: Freq  #Bkts: 5  UncompBkts: 400000  EndPtVals: 5  ActualVal: yes
try to generate single-table filter predicates from ORs for query block SEL$1 (#0)
finally: "T1"."TYPE_CODE"=CASE "DT"."OBJECT_TYPE" WHEN 'TABLE' THEN "DT"."TYPE_CODE" ELSE (-1) END

Note: there is another way to solve the problem and will be published in a future blog post, stay tuned 🙂

Comments (6)

	Chinar Aliyev on Join Skew Handling Subject to…
	Andrew Markiewicz on Join Skew Handling Subject to…
	Disjunctive predicat… on Join Skew Handling Subject to…
	Disjunctive predicat… on Join Skew Handling Subject to…
	Vasily on A reason why SQL Plan Baseline…

Chinar Aliyev`s blog

December 7, 2021

Disjunctive predicates and column correlation

May 23, 2020

Serial Hash Join Performance

April 29, 2019

Parallel Shared Hash Join

April 10, 2019

Automatic Parallel Outer Join Null Handling

April 8, 2019

The Performance of Parallel Hash Join and Automatic Skew Handling

February 27, 2019

The Hash Join Buffered Operation

December 17, 2018

The Efficiency of Hybrid Histogram

November 6, 2018

Understanding Hybrid Histogram

October 19, 2018

Join Skew Handling Subject to Filter Conditions-2

October 2, 2018

Join Skew Handling Subject to Filter Conditions

Recent Posts

Recent Comments

Categories

Archives

Follow Blog via Email