Monday, December 30, 2013

Avoiding product joins

Sponsored by PRISE Ltd.
www.prisetools.com

How to eliminate product joins

What is product join?

Product join is one of the implementation methods of an SQL JOIN operation.
Do not mix up with cross join (Cartesian product), which is one type of SQL joins.

SQL join types, eg.: inner join, left outer join, full outer join, cross (Cartesian) join
Join implementation types, eg.: nested join, merge join, hash join, product join.

Product join (of tables A and B ) is the most simple method of join implementation:
  • Produce each of <A;B> record combinations, say take each records from A singly, and match it with each records of B one-by-one.
  • Test the join condition on each produced <A;B> record pairs, and eliminate those combinations where the condition fails.
The two steps are often combined, and the "testing phase" is executed right after a record combination is generated, and the non valid combinations right after dropped. This saves a lot of temp space.

Why don't we like it?

Well, it has a really bad reputation. It is slow, stuffs CPU, etc.
Yes, it usually is, does. It is the brute force method for executing a join, with costs in order of N*M (where N, M are the record numbers of the joinable tables)

Indeed there are situations when it is the best choice, or the only feasible way.

When is it good/necessary?

Please note that product join is the method what is always applicable, independently of all circumstances.

Good

Product join is typically simple, dumb and slow algorithm, this is why we do not like it, but has a very important advantage: requires no pre-processing.* This is why we LIKE IT:)
If we have to join a really large table to a very small table (couple of records) product join is far the most effective method, since the sort of a very large table ( order of N*logN ) can cost a lot, while joining to 1-2 records is really not a big deal.

Necessary

There are join situations when the only way to go is the product join. Why? Because of the join condition. The "clever joins" (merge, hash) require some information and/or condition that somehow enables to cheat the A x B comparisons: reduce them to the ones that really necessary, and be done in a more effective manner.

* OK, in Teradata this means: only requires that the matchable records from both tables must be on the same AMP. This implies the "small" table to be duplicated to all AMPs.

Merge join example

from A
join  B on A.customer_id = B.customer_id
         and A.trx_dt between B.eff_dt and B.exp_dt

  • Customer_id clause is in AND condition with the others
  • Customer_id is selective enough that hash(customer_id) can reduce the comparisons reasonably
  • Note that A and B must be sorted (re-sorted) by the hash of customer_id

Product join example

from A
join   B on substr(A.telephone_no,1,B.prefix_length) = B.telephone_no_prefix

  • There is no comparison reducing partial-condition
  • Note that neither of the tables required to be sorted in a specific order.
Unavoidable product joins
  • Non-eqality condition
  • Function used (eg. substr())
  • Dependent expression is used (eg. A.x+B.y = A.z)
  • Cross join: intentional Cartesian product

Avoidable product joins

Data type mismatch

The merge join example above works only if customer_no in A and B tables have the same "style" data types, since their hash value will match only in this case. Say hash(13674) <> hash('13674'), however integer is compatible with decimal, and char is compatible with varchar.
Pay attention on data type consistence during physical data modeling. 
  • Use domains to eliminate the possibility of mismatch
  • Align to used data types when defining temp tables, or use "create table as ..." statements
  • If you cannot avoid mismatch, relocate the necessary data to temp tables with proper data types during processing.

OR condition

Let's assume the following join condition:
select ...
from A
join  B on A.col1 = B.Col1
        OR 

           A.Col2 = B.Col2
    This is equivalent, w/o compulsory product join :

    select ... 
    from A
    join  B on A.col1 = B.Col1 

    UNION 
    select ...
    from A
    join  B on A.Col2 = B.Col2


    Missing/stale statistics

    As I mentioned before product join is the most effective join between a very large and a really small (couple of records) table. If the optimizer thinks that a table is pretty small, but it is not indeed, it may choose a product join in all good faith, misleaded by a stale or missing statistics.
    Define and keep fresh those statistics by the optimizer can determine the size of the joinable record sets  properly.

    How to find avoidable product joins

    It is not trivial to list the avoidable product joins. Practically all product joins are required to be examined one-by-one and judged to be avoidable or not. And if avoidable, what to do for.
    I strongly recommend to use PRISE Tuning Assistant for both finding the product joins and analyzing the possibility and necessity of elimination:

    • List top consuming queries with product join(s)
    • Check the PROD JOIN steps: which tables are processed that way
    • Check those join conditions for cases described above

    What to do if cannot be avoided?

    In this case I recommend to try the decomposition, described here.
    It can help reducing the number of comparisons, saving CPU and runtime.

    Have a successful optimization and happy new year!


    Sponsored by PRISE Ltd.
    www.prisetools.com

    Thursday, December 19, 2013

    DBQL analysis II. - Monitor "Top skewed/impact queries"

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Find Skewed CPU and I/O in DBQL

    Please note that the solutions found in the article works on the DBQL logs, which covers only that users' activity, for whom the logging is switched on.

    About Skew

    If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

    TeradataTuning - Skewness
    TeradataTuning - DBQL I
    As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

    Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

    Find bad queries in DBQL

    With this SQL you can filter top impacting queries  (replace the date value or maybe you have to adjust the date filtering according to local settings):

    select  top 50
      ProcID
    , QueryID
    , AMPCPUTime
    , MaxAMPCPUTime * (hashamp () + 1) CPUImpact
    , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
    , TotalIOCount
    , MaxAMPIO * (hashamp () + 1) IOImpact
    , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
    , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
    , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
    , ParserCPUTime
    , Queryband
    , Substr(QueryText,1,2000) QueryText
    from
    /* For archived DBQL
        dbql_arch.dbqlogtbl_hst where logdate=1131201
    and ampcputime>0
    */
    /* For online DBQL*/
        dbc.dbqlogtbl where
        cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
        and ampcputime>0
    order by CPUImpact desc


    Explanation of extra fields:
    ParserCPUTime: Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
    LHR/RHL: Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
    QueryBand: Labels that sessions use to identify themselves within the DBQL logs
    QueryText: First 200 characters of the query (depending on DBQL log settings)

    OK, we've listed the terrible top consumers, but what's next?
    Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post...) , you can find which job or report issued that SQL, anyway, you can see the QueryText field.

    If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

    select SQLTextInfo from dbc.dbqlsqltbl where procid=<procid> and queryid=<queryid>
    order by SQLRowNo asc


    You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

    System level Skewness

    Totals

    We've found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?
    The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
    We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

    select
      sum(AMPCPUTime) AMPCPUTimeSum
    , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
    , sum(TotalIOCount) TotalIOCountSum
    , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
    , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
    , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
    from
    /* For archived DBQL
        dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date)     

        and (ampcputime>0 or TotalIOCount > 0)
    */
    /* For online DBQL*/
        dbc.dbqlogtbl where
        cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
        and (ampcputime>0 or TotalIOCount > 0)


    Look at the last two columns. That percent of your CPU and I/O goes to the sink...

    Top bad guys

    OK, let's check how many queries accumulate 5%,10%,25%,50%,75%,90% of this loss?
    Here you are (CPU version, transform for I/O implicitly):

    select 'How many queries?' as "_",min(limit5) "TOP5%Loss",min(limit10) "TOP10%Loss",min(limit25) "TOP25%Loss",min(limit50) "TOP50%Loss",min(limit75) "TOP75%Loss",min(limit90) "TOP90%Loss", max(rnk) TotalQueries, sum(ResourceTotal) "TotalResource", sum(ResourceImpact) "ImpactResource"
    from
    (
    select
     case when ResRatio < 5.00 then null else rnk end limit5
    ,case when ResRatio < 10.00 then null else rnk end limit10
    ,case when ResRatio < 25.00 then null else rnk end limit25
    ,case when ResRatio < 50.00 then null else rnk end limit50
    ,case when ResRatio < 75.00 then null else rnk end limit75
    ,case when ResRatio < 90.00 then null else rnk end limit90
    ,rnk
    , ResourceTotal
    , ResourceImpact
    from
    (
    select
      sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
    , sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
    , subtotalRes *100.00 / totalRes Resratio
    , sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
    , ResourceTotal
    , ResourceImpact
    from
    (
    select
      AMPCPUTime ResourceTotal
    , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
    ,  ResourceImpact - ResourceTotal ResourceLoss
    /* For archived DBQL
    from dbql_arch.dbqlogtbl_hst where logdate=1131207
    and ampcputime>0
    */
    /* For online DBQL*/
    from dbc.dbqlogtbl where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and ampcputime>0
    ) x
    ) y
    ) z
    group by 1



    I expect you are a bit shocked now, how few queries waste how much golden resources.
    I think we will agree that it is worth to tune those dozen of queries, and you save in orders of 100K..MUSD for your company annually, am I right?

    PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

    Typical reasons of skewness - in a nutshell

    • Skewed tables: Bad choice of PI, Skewed data 
    • Bad execution plans (typically skewed redistributions)
      • Bad data model (normalization,data types,PI, etc.)
      • Missing or stale statistics
      • Too many joins (break up the query!)
    • Hash collision (load time problem)

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Interpreting Skewness

    What does Skew metric mean?

    Overview

    You can see this word "Skewness" or "Skew factor" in a lot of places regarding Teradta: documents, applications, etc. Skewed table, skewed cpu. It is something wrong, but what does it explicitly mean? How to interpret it?

    Let's do some explanation and a bit simple maths.

    Teradata is a massive parallel system, where uniform units (AMPs) do the same tasks on that data parcel they are responsible for. In an ideal world all AMPs share the work equally, no one must work more than the average. The reality is far more cold, it is a rare situation when this equality (called "even distribution") exists.
    It is obvious that uneven distribution will cause wrong efficiency of using the parallel infrastructure.

    But how bad is the situation? Exactly that is what Skewness characterizes.

    Definitions

    Let "RESOURCE" mean the amount of resource (CPU, I/O, PERM space) consumed by an AMP.
    Let AMPno is the number of AMPs in the Teradata system.

    Skew factor := 100 - ( AVG ( "RESOURCE" ) / NULLIFZERO ( MAX ("RESOURCE") ) * 100 )

    Total[Resource] := SUM("RESOURCE")

    Impact[Resource] := MAX("RESOURCE") * AMPno

    Parallel Efficiency := Total[Resource] / Impact[Resource] * 100

    or with some transformation:

    Parallel Efficiency := 100 - Skew factor

    Analysis

    Codomain

    0 <= "Skew factor" < 100

    "Total[Resource]" <= "Impact[Resource]"

    0<"Parallel Efficiency"<=100

    Meaning

    Skew factor : This percent of the consumed real resources are wasted
    Eg. an 1Gbytes table with skew factor of 75 will allocate 4Gbytes*

    Total[Resource] :Virtual resource consumption, single sum of individual resource consumptions , measured on  AMPs as independent systems

    Impact[Resource] :Real resource consumption impacted on the parallel infrastructure

    Parallel Efficiency : As it says. Eg. Skew=80: 20%

    * Theoretically if there is/are complementary characteristics resource allocation (consumes that less resources on that AMP where my load has excess) that can compensate the parallel inefficiency from system point of view, but the probability of it tends to zero.

    Illustration



    Skew := Yellow / (Yellow + Green) * 100 [percent]


    The "Average" level indicates the mathematical average of AMP level resource consumptions (Total[Resource]), while "Peak" is maximum of AMP level resource consumptions: the real consumption from "parallel system view" (Impact[Resource])

    On finding skewed tables I will write a post later.
    PRISE Tuning Assistant helps you to find queries using CPU or I/O and helps to get rid of skewness.


    Tuesday, December 17, 2013

    Using Partitioned Primary Index

    How to use partitioned primary index (PPI)

    This post is about row partitioning and will not discuss columnar.

    What is partitioning?

    To explain it correctly, let's get back to the basics...
    Each Teradata tables (except NoPI type) have "Primary Index", aka. PI, which is not physical index, but rather a logical construction: one or more columns of the table which give the input for hashing method. The hash value determines two things:
    • Which AMP will store the record
    • Storing order of the records within the AMPs
    If the PI is non-partitioned then the records are stored in order of hash value of PI.

    If you use (row) partitioning, you define it at the Primary Index.
    In this case Teradata will associate a 2bytes or 2/8 bytes (at V14.10) "partition code" to the record*, and the storing order is <partition code>,<hash_value> (aka. RowKey).
    That way partitions are not sub-tables or other physical objects, but only influence the record storing order.

    * This implies that no more than 64k(2bytes)/9Q(8bytes) partitions can exist. For details read the appropriate Teradata version's documentation.

    What is the difference between PPI and NUSI?

    NUSI (Non Unique Secondary Index) can serve as similar purposes, but is absolutely different.
    NUSI is a separate subtable, with analogue PI to base table, but different (value) ordering.
    For details please read Teradata documentation.

    How to define?

    Non partitioned table:
    create table tablePI
    (
      Trx_id Integer
    , Trx_dt Date
    )
    PRIMARY INDEX (Trx_id)


    Partitioned table:
    create table tablePPI (   Trx_id Integer
    , Trx_dt Date
    )
    PRIMARY INDEX (Trx_id
    , Trx_dt**)
    PARTITION BY RANGE_N(Trx_dt BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)


    Highlights
    • **Partitioning key (Trx_dt here) can be part of the PI or not. This is very important, see below.
    • Partitioning can be single or multiple (MLPPI) levels***
    • RANGE_N or CASE_N functions can be used for determining partition code
    • RANGE_N function has constant interval endpoints and partition length.
    • NO RANGE and UNKNOWN partitions will store the out-of-intervals and null value records respectively
    ***MLPPI is a technique when multiple or nested partitioning is defined on the table. Logically it looks like sub-partitions, but in practice it only influences the calculation of partition code values, which is still a linear 2/8 bytes value overall the table.

    Pros - Cons of using PPI

    PPI is a very useful feature, but not a silver bullet to use it everywhere. Look the trade offs:
    • (+) Partition elimination
      Only the relevant partitions are scanned while accessing data
    • (+) Interval filtering is supported
    • (+) Accelerates INSERTs
      If we load increment data into a populated table. Very likely less data blocks are affected, since few partitions are involved (if date is the partitioning basis) 
    • (-) 2 or 8 bytes extra space allocation per record
    • (-) Compression is not allowed on PartKey column
    • (-) PartKey inclusion problem (see below)
    • (-) Partition elimination works only with literals
      Subselects cause full table scans

    Design aspects

    RANGE_N or CASE_N

    These functions are used to define partitioning. RANGE_N is for concentrate date (integer) intervals into partitions, while CASE_N is like a CASE-WHEN-THEN expression, where the outcome is the partition.

    Typically RANGE_N is used when we partition a transaction table by its date or timestamp, while CASE_N is popular in special cases like categorizing. You can use more columns in the logical expression, but take care, all of them must be used in filter condition to enable partition elimination.

    RANGE_N: what interval size?

    It depends on the granularity of the data, granularity of filtering and how long interval should be stored in the table. Usually daily partitioning is ideal.

    RANGE_N: interval extension or intervals in advance?

    If we load transactional data into our partitioned table, the date column we use as partition key is populated later and later dates, while we have a finite partition range definition.
    Partition ranges can be added to RANGE_N definition periodically (depends on version), or we can define partitions in far advance. (365 partitions required for a year, 65k partitions cover ~ 180years, which is more than enough) Note that empty partitions do not allocate space.

    One of the methods above should be applied, otherwise the NO RANGE partition will grow extensively, which will cause performance degradation due to less effective partition elimination.

    Partitioning Key: include in PI or not?

    This is the funny point.
    Partitioning key is the column(s) that determines the partition, say used in the RANGE_N/CASE_N definition. We can include it in the Primary Index or not, we decide.

    Let's take an example. We have a master-detail pair of tables, nicely "equi-PI"-ed for effective join:

    CREATE TABLE ORDER_HEAD
    (
      ORDER_NO INTEGER
    , ORDER_DT DATE
    ) UNIQUE PRIMARY INDEX (ORDER_NO);

    CREATE TABLE ORDER_ITEM
    (
      ORDER_NO INTEGER
    , ORDER_ITEM_NO
    , PROD_NO INTEGER
    ) PRIMARY INDEX (ORDER_NO);


    We modify ORDER_HEAD's PI:
    UNIQUE PRIMARY INDEX (ORDER_NO, ORDER_DT)

    PARTITION BY RANGE_N(ORDER_DT BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

    Should we include ORDER_DT or not? Which is better, what is the difference?
    • Not include
      ORDER_HEAD and ORDER_ITEM tables will have similar AMP distribution, but different physical order within the AMPs.
      Each join operation requires sort of the selected ORDER_HEAD records in spool, or ORDER_ITEMS table will be merge joined against each selected non empty partitions of ORDER_HEAD sequentially (called sliding-window merge join)
    • Include
      ORDER_HEAD and ORDER_ITEM tables will have different AMP distribution, each join operation requires redistribution.Why do we not use the same PI at ORDER_ITEM? Because we do not have that column there.
    Neither of the above is acceptable in many cases. What should we do? In this case I would copy the ORDER_DT to the ORDER_ITEM table also, and use the same "Included" version of PI. Requires some more space, logic in load time, but great gain while accessing data.

    Use cases

    Filtering

    This select will eliminate all partitions except those three:
    select * from ORDER_HEAD where order_dt between '2013-12-12' (date) and '2013-12-14' (date);

    This select will generate all rows scan:
    select * from ORDER_HEAD where cast( order_dt as char(7)) = '2013-12';

    This select will generate all rows scan* either (sub-query):
    select * from ORDER_HEAD  where order_dt in (select max(calendar_date) from sys_calendar.calendar  where year_of_calendar=2013 and month_of_year=5);
    Why? Optimizer has to determine which partitions to be accessed in time of generating execution plan. That time it cannot know what is the result of the subquery. That is it.

    * I got a proper comment on this option to double check. Yes, right, this information is a out-of-date. With actual versions of Teradata (V13.10..V14.10) I experienced 3 different results:
    • Full scan
      Eg. sub-query contains a "group by"
    • Dynamic partition elimination
      Sub-query is simple, indicates "enhanced by dynamic partition elimination" section in the plan
    • Plan-time partititon elimination
      Literal condition or very simple sub query. Parsing time evaluation enables PO to determine which partitions to be scanned.  Plan: "...We do an all-AMPs ... step from 3 partitions of...". Do not really know exactly what decides between full scan, dynamic- or plan-time elimination... Explanations welcome.

    Join

    We join two tables: T1 and T2. The table shows what happens if they are partitioned, not partitioned and the partitioning key is included or not in the PI:

    T2

    T1
    PI:(a) PI:(a) PART(b) PI:(a,b) PART(b)
    PI:(a) Join: T1.a=T2.a
    RowHash match
    PI:(a) PART(b) Join: T1.a=T2.a
    T1 sorted by hash(a) or
    Sliding-window MJ
    Join: T1.a=T2.a
    T1&T2 sorted by hash(a)
    or Sliding-window MJ
    (NxM combinations)
    Join: T1.a=T2.a and T1.b=T2.b
    T1&T2 sorted by RowKey
    RowKey based MJ
    PI:(a,b) PART(b) Join: T1.a=T2.a
    T1 Redistributed & sorted
    by hash(a)
    Join: T1.a=T2.a
    T1 Redistributed by hash(a)
    T2 sorted by hash(a) and MJ
    Join: T1.a=T2.a and T1.b=T2.b
    T2 Redistributed and sorted by RowKey
    RowKey based MJ
    Join: T1.a=T2.a and T1.b=T2.b
    RowKey based MJ


    Insert

    Let's take a transaction table like ORDERS. In practice we load it periodically (eg. daily) with the new increment which is typically focused to a short interval of transaction date/time. If the ORDERS table is not partitioned, then the outstanding hashing algorithm will spread them all over the data blocks of the table evenly, therefore Teradata has to modify far more data blocks than the increment was reside in.

    But if the ORDERS table is partitioned, then the physical order of the records is primarily determined by the partition key. This means that the increment will reside in very few partitions, close together, and the insert operation requires approx the same number of blocks to be written than the increment was in.

    For more details on PPIs please refer the documentation of the appropriate Teradata version.

    To be continued...

    Tuesday, December 10, 2013

    DBQL analysis I. - Monitor "Top CPU consumers"

    Sponsored by PRISE Ltd.
    www.prisetools.com

    CPU usage distribution

    About DBQL

    What is it?


    DataBase Query Logging.
    It is a nice feature of Teradata RDBMS, which comprehensively logs the issued queries execution - if it is switched on.

    Configuration can be checked/administered eg. in the Teradata tools or from DBC.DBQLRuleTbl.
    Logging can be set on global/user level, and in respect of details (see DBQL tables)

    For detailed information please refer Teradata documentation of your version.

    DBQL tables


    Table Content
    DBQLogTbl Central table, 1 record for each query.
    DBQLSQLTbl Whole SQL command, broken up to 30k blocks
    DBQLStepTbl Execution steps of the query, one row for each step.
    DBQLObjTbl Objects participated in the query. Logged on different levels (db,table, column, index, etc.)
    DBQLExplainTbl English explain text, broken up to 30k blocks
    DBQLXMLTbl Explain in XML format, broken up to 30k blocks
    DBQLSummaryTbl PEs' aggregated table, which accounts on the desired level.

    DBQL tables logically organized into 1:N structure, where DBQLogTbl is the master entity and others (except DBQLSummaryTbl) are the children.
    Join fields are the ProcID and QueryId together, eg:
    ...
    from DBQLogTbl a
    join   DBQLStepTbl b on a.ProcID=b.ProcID and a.QueryID = b.QueryID
    ...
    Unfortunately PI of DBQL tables are not in sync with logical PK-FK relation in (also in latest V14.10), therefore JOIN-ed selects against online DBQL tables are not optimal.

    Cost of using DBQL

    DBQL basically consumes negligible amount of processing resources, since it has cached&batch write and generates data proportional to issued queries (flush rate is DBScontrol parameter).
    It is important to regularly purge/archive them from the DBC tables, Teradata has a recommendation for it. This ensures that PERM space consumption of the DBQL remains low.
    In an environment where ~1M SQLs are issued a day, comprehensive logging generates  ~8..10G of DBQL data daily w/o XML and Summary. Less SQLs generate proportionally less data.

    It is worth to switch on all option except XML and Summary, since the first generates huge data volume (~makes it double), and the second is similar to Acctg info. If you want to utilize them, they should be switched on, of course.

    What is it good for?

    It contains:
    • Query run time, duration
    • Consumed resources
    • Environment info (user, default db, etc)
    • SQL text
    • Explain
    • Step resource info
    • Objects involved
    • Etc.
    One can get a lot of useful aggregated and query specific tuning information, some of them I will share in the blog.

    CPU usage distribution info

    (Everything applies to I/O also, just replace CPU with I/O, AMPCPUTime with TotalIOCount...)

    Do you think Query optimization is rewarding?


    Yes, I know it is hard work to find out why is ONE query run sub-optimally, and what to do with it.

    But guess how many queries consume how many percent of the processing resources (CPU) within a whole day's workload.
    Tip it and write down for CPU%: 5%, 10%, 25% and 50%

    And now run the query below, which will result it to you. (replace the date value or maybe you have to adjust the date filtering according to local settings)

    select 'How many queries?' as "_",min(limit5) "TOP5%CPU",min(limit10) "TOP10%CPU",min(limit25) "TOP25%CPU",min(limit50) "TOP50%CPU", max(rnk) TotalQueries
    from
    (
    select
    case when CPURatio < 5.00 then null else rnk end limit5
    ,case when CPURatio < 10.00 then null else rnk end limit10
    ,case when CPURatio < 25.00 then null else rnk end limit25
    ,case when CPURatio < 50.00 then null else rnk end limit50
    ,rnk
    from
    (
    select
      sum(ampcputime) over (order by ampcputime desc ) totalCPU
    , sum(ampcputime) over (order by ampcputime desc  rows unbounded preceding) subtotalCPU
    , subtotalCPU *100.00 / totalCPU CPUratio
    , sum(1) over (order by ampcputime desc  rows unbounded preceding) rnk
    from
    (
    select *

    /* For archived DBQL
    from dbql_arch.dbqlogtbl_hst where logdate=1131201 

    and ampcputime>0
    */
    /* For online DBQL*/
    from dbc.dbqlogtbl where
    cast(cast(starttime as char(10)) as date) = '2013-12-10' (date) 

    and ampcputime>0
    ) x
    ) y
    ) z
    group by 1



    Are you surprised?
    I bet:
    • Less than 10 queries will consume 5% of the CPU
    • Less than  1% of the queries will consume 50% of the CPU
    Let's calculate.
    How much does your Teradata system cost a year? It is all for storage and processing capacity.
    If you can save eg. X% of CPU&I/O and X% storage using MVC optimization, you saved X% of the price of the Teradata system, by:
    • Improved user experience (earlier load, faster responses)
    • Resources for additional reports and applications
    • Enable postponing a very expensive Teradata hardware upgrade

    PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Friday, November 29, 2013

    Accelerate skewed joins

    Sponsored by PRISE Ltd.
    www.prisetools.com

    How to "re-parallelize" skewed joins

    Case description

    Assume that we have 1M customers, 4M transactions and our top customer produce the 2.5% of all transactions.Others produce the remaining 97.5% of transactions approx. evenly.
    Scroll down to the bottom of the post for sample table and data generator SQL.

    Our task is to join a "Customer" and a "Transaction" tables on Customer_id.

    The join

    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    GROUP BY 1;


    We experience a pretty slow execution.
    On the ViewPoint we see that only one AMP is working, while others are not.

    What is the problem?
    There are two  separate subsets of the Transact table from "joinability" point of view:
    • "Peak" part (records of top customer(s))
      Very few customers have very much Transact records. Product join would be cost effective
    • "Even" part (records of other customers)
      Much customers have much, but specifically evenly few Transact records. Merge join would be ideal.
    Unfortunately Optimizer have to decide, only one operation type can be chosen. It will choose merge join which consumes far less CPU time.

    Execution plan looks like this:

     This query is optimized using type 2 profile T2_Linux64, profileid 21.
      1) First, we lock a distinct D_DB_TMP."pseudo table" for read on a
         RowHash to prevent global deadlock for D_DB_TMP.t.
      2) Next, we lock a distinct D_DB_TMP."pseudo table" for read on a
         RowHash to prevent global deadlock for D_DB_TMP.c.
      3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
      4) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
         all-rows scan with a condition of ("NOT (D_DB_TMP.t.Customer_ID IS
         NULL)") into Spool 4 (all_amps), which is redistributed by the
         hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  Then we do a
         SORT to order Spool 4 by row hash.  The size of Spool 4 is
         estimated with low confidence to be 125 rows (2,125 bytes).  The
         estimated time for this step is 0.01 seconds.
      5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
         RowHash match scan, which is joined to D_DB_TMP.c by way of a
         RowHash match scan.  Spool 4 and D_DB_TMP.c are joined using a
         merge join, with a join condition of ("D_DB_TMP.c.Customer_ID =
         Customer_ID").  The result goes into Spool 3 (all_amps), which is
         built locally on the AMPs.  The size of Spool 3 is estimated with
         index join confidence to be 125 rows (10,375 bytes).  The
         estimated time for this step is 0.02 seconds.
      6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 5.  The size of Spool 5 is
         estimated with no confidence to be 94 rows (14,758 bytes).  The
         estimated time for this step is 0.02 seconds.
      7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
         an all-rows scan into Spool 1 (all_amps), which is built locally
         on the AMPs.  The size of Spool 1 is estimated with no confidence
         to be 94 rows (8,742 bytes).  The estimated time for this step is
         0.02 seconds.
      8) Finally, we send out an END TRANSACTION step to all AMPs involved
         in processing the request.
      -> The contents of Spool 1 are sent back to the user as the result of
         statement 1.  The total estimated time is 0.07 seconds.

    How to identify


    If you experience extremely asymmetric AMP load you can suspect on this case.
    Find highly skewed JOIN steps in the DBQL (set all logging options on):

    select top 50
    a.MaxAMPCPUTime * (hashamp()+1) / nullifzero(a.CPUTime) Skw,a.CPUTime,a.MaxAMPCPUTime * (hashamp()+1) CoveringCPUTime,
    b.*
    from dbc.dbqlsteptbl a
    join dbc.dbqlogtbl b on a.procid=b.procid and a.queryid=b.queryid
    where
    StepName='JIN'
    and CPUtime > 100
    and Skw > 2
    order by CoveringCPUTime desc;




    (Note: Covering CPU time is <No-of-AMPs> * <Max AMP's CPU time>. Virtually this amount of CPU is consumed because asymmetric load of the system)

    Or if you suspect a specific query, check the demography of the join field(s) in the "big" table:

    SELECT TOP 100 <Join_field>, count(*) Nbr
    FROM <Big_table> GROUP BY 1 ORDER BY 2 DESC;


    If the top occurences are spectacularly larger than others (or than average) the idea likely matches.


    Solution

    Break the query into two parts: join the top customer(s) separately, and then all others. Finally union the results. (Sometimes additional modification also required if the embedding operation(s) - the group by here - is/are not decomposable on the same parameter.)
    First we have to identify the top customer(s):

    SELECT TOP 5 Customer_id, count(*) Nbr
    FROM Transact GROUP BY 1 ORDER BY 2 DESC;

    Customer_id          Nbr
    ------------------------------
              345       100004
         499873                4
         677423                4
         187236                4
           23482                4
         
    Replace the original query with his one:

    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    where t.Customer_id in (345)  

    /*
       ID of the top Customer(s). 
       If more customers are salient, list them, but max ~5
    */
    GROUP BY 1
    UNION ALL
    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    where t.Customer_id not in (345)  -- Same customer(s)
    GROUP BY 1
    ;

    Be sure that Customer.Customer_id, Transact.Transact_id and Transact.Customer_id have statistics!

    Rhis query is more complex, has more steps, scans Transact table 2 times, but runs much faster, you can check it.
    But why? And how to determine which "top" customers worth to be handled separately?
    Read ahead.

    Explanation

    Calculation


    Let's do some maths:
    Assume that we are on a 125 AMP system.
    Customer table contains 1M records with unique ID.
    We have ~4.1M records in the Transact table, 100k for the top customer (ID=345), and 4 for each other customers. This matches the 2.5% we assumed above.

    If the  Transact table is redistributed on hash(Customer_id) then we will get ~33k records on each AMPs, excluding AMP(hash(345)). Here we'll get ~133k (33k + 100K).
    That means that this AMP will process ~4x more data than others, therefore runs 4x longer.
    With other words in 75% of this JOIN step's time 124 AMPs will DO NOTHING with the query.

    Moreover the preparation and subsequent steps are problematic also: the JOIN is prepared by a redistribution which produces a strongly skewed spool, and the JOIN's result stays locally on the AMPs being skewed also.

    Optimized version

    This query will consume moderately more CPU, but it is distributed evenly across the AMPs, utilizing the Teradata's full parallel capability.
    It contains a product join also, but is it no problem it joins 1 records to the selected 100k records of Transacts, that will be lightning fast.

    All

    Look at the execution plan of the broken-up query:


     This query is optimized using type 2 profile T2_Linux64, profileid 21.
      1) First, we lock a distinct D_DB_TMP."pseudo table" for read on a
         RowHash to prevent global deadlock for D_DB_TMP.t.
      2) Next, we lock a distinct D_DB_TMP."pseudo table" for read on a
         RowHash to prevent global deadlock for D_DB_TMP.c.
      3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
      4) We do a single-AMP RETRIEVE step from D_DB_TMP.c by way of the
         unique primary index "D_DB_TMP.c.Customer_ID = 345" with no
         residual conditions into Spool 4 (all_amps), which is duplicated
         on all AMPs.  The size of Spool 4 is estimated with high
         confidence to be 125 rows (10,625 bytes).  The estimated time for
         this step is 0.01 seconds.
      5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an
         all-rows scan, which is joined to D_DB_TMP.t by way of an all-rows
         scan with a condition of ("D_DB_TMP.t.Customer_ID = 345").  Spool
         4 and D_DB_TMP.t are joined using a product join, with a join
         condition of ("Customer_ID = D_DB_TMP.t.Customer_ID").  The result
         goes into Spool 3 (all_amps), which is built locally on the AMPs.
         The size of Spool 3 is estimated with low confidence to be 99,670
         rows (8,272,610 bytes).  The estimated time for this step is 0.09
         seconds.
      6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 5.  The size of Spool 5 is
         estimated with no confidence to be 74,753 rows (11,736,221 bytes).
         The estimated time for this step is 0.20 seconds.
      7) We execute the following steps in parallel.
           1) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by
              way of an all-rows scan into Spool 1 (all_amps), which is
              built locally on the AMPs.  The size of Spool 1 is estimated
              with no confidence to be 74,753 rows (22,052,135 bytes).  The
              estimated time for this step is 0.02 seconds.
           2) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
              all-rows scan with a condition of ("D_DB_TMP.t.Customer_ID <>
              3454") into Spool 9 (all_amps), which is redistributed by the
              hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  The size
              of Spool 9 is estimated with high confidence to be 4,294,230
              rows (73,001,910 bytes).  The estimated time for this step is
              1.80 seconds.
      8) We do an all-AMPs JOIN step from D_DB_TMP.c by way of an all-rows
         scan with a condition of ("D_DB_TMP.c.Customer_ID <> 3454"), which
         is joined to Spool 9 (Last Use) by way of an all-rows scan.
         D_DB_TMP.c and Spool 9 are joined using a single partition hash
         join, with a join condition of ("D_DB_TMP.c.Customer_ID =
         Customer_ID").  The result goes into Spool 8 (all_amps), which is
         built locally on the AMPs.  The size of Spool 8 is estimated with
         low confidence to be 4,294,230 rows (356,421,090 bytes).  The
         estimated time for this step is 0.72 seconds.
      9) We do an all-AMPs SUM step to aggregate from Spool 8 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 10.  The size of Spool 10
         is estimated with no confidence to be 3,220,673 rows (505,645,661
         bytes).  The estimated time for this step is 8.46 seconds.
     10) We do an all-AMPs RETRIEVE step from Spool 10 (Last Use) by way of
         an all-rows scan into Spool 1 (all_amps), which is built locally
         on the AMPs.  The size of Spool 1 is estimated with no confidence
         to be 3,295,426 rows (972,150,670 bytes).  The estimated time for
         this step is 0.32 seconds.
     11) Finally, we send out an END TRANSACTION step to all AMPs involved
         in processing the request.
      -> The contents of Spool 1 are sent back to the user as the result of
         statement 1.  The total estimated time is 11.60 seconds.


    Sample structures

    The table structures (simplified for the example):


    CREATE TABLE Customer
    (
      Customer_ID   INTEGER
    , Customer_name VARCHAR(200)
    )
    UNIQUE PRIMARY INDEX (Customer_id)
    ;

    insert into Customer values (1,'Cust-1');
    Run 20x: 

    insert into Customer select mx + sum(1) over (order by Customer_id rows unbounded preceding) id, 'Cust-' || trim(id) from Customer cross join (select max(Customer_id) mx from Customer) x;

    collect statistics using sample on customer column (Customer_id);

    CREATE TABLE Transact
    (
      Transaction_ID   INTEGER
    , Customer_ID      INTEGER
    )
    UNIQUE PRIMARY INDEX (Transaction_id)
    ;

    insert into Transact values (1,1);
    Run 22x: 

    insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, id mod 1000000 from Transact cross join (select max(Transaction_id) mx from Transact) x;

    insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, 345 from Transact t cross join (select max(Transaction_id) mx from Transact) x where t.Transaction_id < 100000;

    collect statistics using sample on Transact column (Customer_id);

    collect statistics using sample on Transact column (Transaction_id) ;

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Thursday, November 28, 2013

    Optimizing Multi Value Compression

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Teradata MVC optimization
    Techniques and effects

    What is Multi Value Compression (MVC)?

    Teradata RDBMS supports a nice feature: multi-value-compression. It enables to reduce the storage space allocated by the tables in the database, while - this is incredible - processing compressed data usually requires less resources (CPU and I/O) than the uncompressed.
    The feature needs no additional licence or hardware components.

    How does MVC work?

    I give a short summary, if you are interested in the details please refer to Teradata documentation.

    MVC can be defined in CREATE TABLE DDL or later added/modified by ALTER TABLE statements. User must define a 1..255 element list of values for each compressable columns. Those will be stored as compressed value, while others will be uncompressed.
    If a column is compressed, each row has an additional area of 1..8 bits allocated (if N value is listed: upper(log2(N)) bits will be allocated). One of bit combinations means that the value is uncompressed (and allocates its corresponding space within the row layout), but all others mean compressed value, which will not allocate the value's place in the row.
    The compress bits are allocated in every rows regardless the actual value is compressed or not.
    Compress bits are "compacted", eg.: 3 + 8 + 4 = 15 compress bits will allocate 2 bytes with only 1 wasted bit instead of 3 byte aligned values.
    The value belonging to each bit combinations are stored in the table header.

    Multi Value Compression is:
    • Column level
      Have to be defined on each applicable columns of a table separately
    • "Manual"
      You have to calculate which values are worth to compress - Teradata gives no automatism
    • Static
      Once you defined the values it will not adapt to the changing conditions by itself
    It is obvious that the current optimal settings of the compression depends on the data demography and the applied data types. Optimal setting may be different later, when data demography may be different.

    Summary of most important properties of MVC once again:
    • Can be defined in the CREATE TABLE statement
    • Can be applied or modified later in an ALTER TABLE statement
    • Must be set on COLUMN level
    • Optimal value list must be calculated by you
    • Optimal settings may change in time. Optimize regularly.

    Storage effects

    Using MVC tables will allocate less PERM space, as can be calculated - simple.
    What about the increment?
    The table sizes usually grow along the time as more and more data is generated. The change in growth speed depands on the volatility of data demography. If it is stable then the growth speed will drop by the rate of compression. If typical values change in time than growth will not drop, or may speed up in extreme cases. However theese cases are when regular optimization is neccessary.

    The growth look like this in stable demography cases:




    Performance effects

    It is a key question - what have to be payed for less storage

    It is obvious that compression process requires resources during both compress and decompress phase.However there are processing gains also, which usually dominate the costs. How?

    Compressed table will reside in proportionally less data blocks, therefore data fetching requires less I/O operations. In addition moving data in-memory (during processing) requires less CPU cycles.
    While SELECTing table data usually small fragment of the row is used, and not used coulmns will not be decompressed.
    Caching is a CPU intensive operation also, which is more effective if less data blocks are processed.
    Compression helps tables to be treated as "small enough to cache 100% into memory", which results more effective execution plans.

    Summary:
    • INSERT into a compressed table usually consume more CPU by 10..50% (only final step!)
    • SELECT usually cost no more, or less CPU than at uncompressed tables
    • SELECT and INSERT usually cost proportionally less I/O like the compression ratio
    • System level CPU and I/O usage usually drops by 5..10% (!) when compressing the medium and big tables of the system (caused by more effective caching)

    How to set MVC?

    Setting up the MVC compression on a single table should consist of the following 4 steps:
    1. Analyze the data demography of each compressible columns of the table *1.
    2. Calculate the optimal compress settings for the columns. Notice that
      •   Optimum should be calculated not on separated columns, but on table level, since compress bits are packed into whole bytes.
      •   The more values are listed as compressed, the more overhead is on compress. Proper mathematical formula is to be used for calculating the optimum. *2
      •   Take care of the exceptions: PI / FK / etc.columns and some data types are not compressible (varies in different Teradata versions).
    3. Assemble the corresponding scripts
      CREATE TABLE DDL + INSERT SELECT + RENAME / ALTER TABLE DDL
    4. Implement the compression by running the script
       Concern to take good care of data protection like: backups, locking, documenting.
    *1 Simplified sample: 
         select top 256 <columnX> , count(*), avg(<length(columnX)>) from <table> group by 1 order by 2 desc; for each columns
     *2 About the algorithm: It is a maximum-seeking function (n) based on the expression of gains when specific TOP {(2^n)-1} frequent values are compressed. The expression is far more complex to discuss here because different datatypes, exceptions and internal storing constructions.

     One time or regular?

    Optimal MVC setting is valid for a specific point in time, since your data changes along your business. The daily change is usually negligible, but it accumulates.
    Practice shows that it is worth to review compress settings every 3..6 months, and continually optimize new tables, couple of weeks after coming into production.


    Estimate how much space and processing capacity is lost if compress optimization is neglected!

     

    Solution in practice

    There are "magic excels" on the net, which can calculate the optimal settings if you load the data demography, but it requires lots of manual work in addition (Running the calculations, DDL assembling, transformation script writing, testing, etc.)
     
    If you want a really simple solution, try PRISE Compress Wizard , that supplies a comprehensive solution:
    • Assists to collect good candidate tables to compress
    • Automatically analyses the tables, and gives feedback:
      • How much space can be saved by compress
      • What is the current compress ratio (if there is compress already applied)
      • How much resources were used for analysis
      • What is the optimal structure
    • Generates transforming script (+ checks, lock, logging) along with
      • Backup (arcmain)
      • Revert process (for safety and documentation)
      • Reverse engineering (for E/R documentation update)
    • Log implementation
      •  Reflect achieved space saving: success measurement
      •  Report used CPU and I/O resources for transformation

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Thursday, November 14, 2013

    Curing slow INSERTs and CREATE TABLEs I.

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Eliminating hash collisions

    Case description

    We have an INSERT or CREATE TABLE operation that runs unreasonably long time compared to the affected number of rows, in spite the table is not skewed.

    What is hash collision?

    Hash collision is when two or more records in a table have the same hash value.

    SET type of tables ensure that there are no more records with exactly the same record content within a table. How does Teradata do it?

    Teradata stores the records in a hash filesystem, where each record has a hash value calculated from the Primary Index (PI) value. If the PI values are the same in more records, they will surely have the same hash value either.

    When INSERTING a record, Teradata has to compare the new record to the table's only those records that have the same hash value that new record has, since all records with different hash value will surely differ at least at the PI columns.
    If we have to INSERT N records with the same hash value into an empty table, Teradata has to do N*(N-1)/2 times - very CPU demanding - full record comparisons.

    How to identify

    Hash collisions can be easily found by using PRISE Tuning Assistant tool also, or follow this method:

    DBQL filtering for qualifying queries:
    The Merge (MRG) phase of the INSERT/CREATE TABLE operation consumes lot of CPU.
    Look for high CPU consuming 'MRG' steps in the dbc.DBQLStepTbl:

    sel a.cputime,a.MaxAmpCPUTime * (hashamp() +1) CoveringCPUTIme,  a.stepname,a.RowCount,b.* from
         dbc.DBQLStepTbl a
    join dbc.DBQLogTbl   b on a.ProcId=b.ProcId and a.QueryId=b.QueryId
    where
        a.StepName in ('MRG' /*, 'MRU' for UPDATEs also*/)
    and a.CPUTime > 100 /* Performance boost: eliminates most of the records (small cpu seconds) at low processing cost. Adapt number to your site */
    qualify sum(1) over (order by a.cputime desc rows unbounded preceding) <= 100;

     

    At a specific SQL statement (INSERT or CREATE TABLE) you have to check your PI for level of hash collisions (number of records where the hash values are the same) in the target table.

    How to make sure that the hash-collision is the reason? Let the target table be TableA, with primary index: ColA,ColB,ColC (can be any number of columns in practice)

    select top 100 hashrow(ColA,ColB,ColC), count(*) from TableA group by 1 order by 2 desc;


    The top row(s) will show the most frequent hash values. Count values >>1 mean significant hash collisions in the order of N * N. Each high frequency hash value will generate a hash-collision group causing comparisons in the order of N*N.

    If the table still not exists, embed the producing "SELECT" statement into the script above, and count those field values that would get to the PI columns.


    Explanation

    If we use "SET" type of table (this is the default setting), Teradata ensures that there will be no perfectly alike records in the table. This can be ensured by comparing the inserted/updated record with the existing ones.
    Teradata's "hash filesystem" gives a very effective trick: only those records must be compared, whose RowID (hash) equals, otherwise at least the PI fields must differ.
    If we've chosen the Primary Index for UNIQUE, or non-UNIQUE, but on field(s) that are almost unique, then the "SET comparison"
    restricts to zero or one records in most cases.


    Solution

    For good solution unfortunately we have to modify the table structure.
    • Option 1: Change table type to MULTISET. This will eliminate duplication checks, but its disadvantage is the same. If the process falls back on the de-duplication of SET table, you have to replace it with programmed de-duplication (group by, left join...).
    • Option2: Change the table's PI to a unique or nearly unique column set. Be prudent, consider the workload also. (joins, where conditions, group by expressions, etc.)

    Summary

    Tables with strongly non unique PI are highly dangered for hash collision slowed INSERTs/CREATE TABLEs (or UPDATEs), even if they are not skewed. Use "more unique" PI, or MULTISET table.

    Typical mistake: if a CREATE TABLE ... as SELECT... lacks the PRIMARY INDEX() section. In this case Teradata chooses the first column as PI, which often causes terrible performance.

    What's next

    Next post will discuss Multi Value Compress (MVC) optimization.

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Wednesday, September 25, 2013

    Boost slow (LEFT/RIGHT) OUTER JOINs

    Sponsored by PRISE Ltd.
    www.prisetools.com

    How to optimize slow OUTER JOINs

    Case description


    We have a (LEFT or RIGHT) OUTER JOIN, and it runs a long time while causing skewed CPU / Spool usage. In practice most of the time during the query execution only 1 AMP will work, while others have nothing to do, causing poor parallel efficiency.

    How to identify

    The query typically runs long time, contains a "MERGE JOIN" step in the Explain description, and that  step consumes skewed CPU consumption (MaxAMPCPUTime * Number-of-AMPS >> AMPCPUTime) and lasts long.

    In the DBQL you should find skewed, high CPU usage queries (
    dbc.DBQLogtbl.MaxAMPCPUTime * (hashamp()+1)  / nullifzero(dbc.DBQLogtbl.AmpCPUTime) > 1.2 and dbc.DBQLogtbl.AMPCPUTime > 1000 , depends on system size) which also has "left outer join" expression in the execution plan text (dbc.DBQLExplaintbl.ExplainText like '%left outer joined using a merge join%')
    This is only an approximation since the skewness causing step may be a different one.

    PRISE Tuning Assistant supplies easy-to-use GUI based search function.

    Explanation


    Let's assume that we outer join Table1 and Table2 on a condition that causes no product join (merge join instead), eg.:

    select Table1.x,Table2.y
    from Table1
    LEFT JOIN Table2 on Table1.x = Table2.x
    ...


    If Table2 is not a "small" table, Teradata optimizer will choose to "equi-distribute" (place matchable records on the same AMP) the two tables on the join field(s), in our case: Table1.x and Table2.x respectively.
    If Table1.x contains significant percentage of NULLs, then the distribution will be skewed, since all "x is NULL" records will get to the same AMP.
    We know that the NULL value never results in a join match, so those records are useless to examine, but they have to appear in the resultset, since it is an OUTER JOIN.


    Solution

    Let's handle the Table1 into two separate subsets: NULL(x) and NotNULL(x), and modify the select this way:

    select Table1.x,Table2.y
    from Table1
    LEFT INNER JOIN Table2 on Table1.x = Table2.x
    where Table1.x is not null -- This will eliminate skewed spool
    UNION ALL
    select Table1.x,NULL
    from Table1
    where Table1.x IS NULL;



    Practical example:
    Some of our transactions are contributed by an operator, in this case OpID is filled, else null. We would like  to query the number of transactions by operators including the non-contributed ones. Most of the transactions are non contributed ones (OpID is null).

    select
     
    a.Transaction_id, b.OperatorName as ContribOpName
    from TransactionT a
    LEFT JOIN OperatorT b on a.OpID = b.OpID


    Optimized form:

    select
      a.Transaction_id, b.OperatorName as ContribOpName
    from TransactionT a
    LEFT JOIN OperatorT b on a.OpID = b.OpID
    where a.OpID is not null
    UNION ALL
    select
      a.Transaction_id, NULL as ContribOpName
    from TransactionT
    where OpID is null;



    The execution will not cause a skewed CPU / Spool, because the those records of Table1 that caused peak ( x is NULL ) are excluded from processing of the join.
    The second part will supply the "x is NULL" records to the result set without join processing.

    The tradeoff is two full scans and a UNION ALL operation, which are comparably much less cost than a strongly skewed redistribution and a JOIN processing.

    What's next

    Next post will discuss unexpectedly slow INSERTs (hash collision).

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Monday, September 23, 2013

    Accelerate PRODUCT JOIN by decomposition

    Sponsored by PRISE Ltd.
    www.prisetools.com

    How to optimize slow product joins

    Case description

    There are SQL queries that cannot be executed any other way, but using product join method, because of the content of join condition. Eg:
    • OR predicate
        Examle:
          ON (a.x = b.x OR a.y = b.y)
    • BETWEEN / LIKE operator
        Examples:
          ON (a.x LIKE b.y)
          ON (a.x LIKE b.y || '%')
          ON (a.b between b.y and b.y)
      
    • Comparison (=) of different datatype fields
        Example (a.x is INTEGER, b.x is VARCHAR)
          ON (a.x = b.x)
    • Arithmetic expression usage
        Example
          ON (a.x = b.x + 1)
    • SQL function (eg. substr() ) or UDF usage
        Example
          ON (substr(a.x,1,characters(b.x)) = b.x) 
    Product join is a technique when the execution will match each record combinations from the two joinable tables and evaluates the join condition on each of them. Product join usually causes huge CPU consumption and long response time.


    How to identify

    The query typically runs long time, contains a "PRODUCT JOIN" step in the Explain description, and that  step consumes high AMPCPUTime and lasts long. Those queries usually have >>1 LHR index (Larry Higa Ratio, showing the CPU and I/O rate), typicall 10s, 100s or more.

    In the DBQL you should find high CPU usage queries ( dbc.DBQLogtbl.AMPCPUTime > 1000 , depends on system size) which also has "product join" expression in the execution plan text (dbc.DBQLExplaintbl.ExplainText like '%product join%')

    PRISE Tuning Assistant supplies easy-to-use GUI based search function.

    Explanation of product join execution

    Let's assume that we join tables: Table1 (N records, bigger table) and Table2 (M records, smaller table) Join processing assumes that the matchable record pairs must reside on the same AMP. Since product join compares each Table1 records to each Table2 records, one of the tables' all records must reside on all AMPs, therefore PRODUCT JOIN is preceded by a "Duplicated to all AMPs" step of the smaller table.
    Each record pairs will be evaluated, if the JOIN condition satisfies, the result gets to the result spool, otherwise discarded.
    The number of required comparisons: (N x M), and the cost (approx. the required CPU time) of one comparison depends on the complexity of the join expression.

    Solution

    In most cases the JOIN condition of the product join satisfies only small fraction of all possible combinations. In practice we can identify an often situation:
    Significant subset of the bigger table's records will fit to a small subset of the smaller table's records.
    Telco example: Number analysis. Most of the Call records are directed to national number areas (>90%), but the number area describing contains dominantly international number regions (>80..95%). We can declare that national calls will never fit to international areas. In addition it is very simple to identify both a "Number" and a "Number area" if it is national or international.
    The base query looks like that:

    select Table1.Number,Table2.Area_code
    from Table1
    join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number;



    Let's decompose the query into two parts:

    select Table1.Number,Table2.Area_code
    from Table1
    join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number
    where substr(Table1.Number,1,1) = '1'   -- USA area code
    and substr(Table2.Area_Start_number,1,1) = '1' -- USA area code
    and substr(Table2.Area_end_number,1,1)   = '1' -- USA area code
    UNION ALL
    select Table1.Number,Table2.Area_code
    from Table1
    join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number
    where
    NOT (substr(Table1.Number,1,1) = '1')   -- non-USA area code
    and NOT
    (
        substr(Table2.Area_Start_number,1,1) = '1' -- non-USA area code
    and substr(Table2.Area_end_number,1,1)   = '1' -- non-USA area code
    );



    This way we added some low cost operations (full scan on tables to identify national/international) , and the const of UNIONing the results, but we eliminated lots of trivially not satisfing comparisions.

    The following figures show the processing cost, the red area represents the number of comparisons, therefore the cost:
    Figure1: Original case
    Figure2: Decomposed case


    Let's do some maths, with imagined combinations:
    N1: International calls
    N2: National calls
    M1: International area descriptions
    M2: National area descriptions
    90% of calls (N) are national (N2)
    90% of area descriptions (M) are international (M1).
    Originall we have to do N x M comparisons.
    The decomposed query must do
    ((0.9 x N) x (0.1 x M)) + ((0.1 x N) x (0.9 x M)) = 0.09 x N x M + 0.09 x N x M = 0.18 x N x M

    The optimized query will do only 18% of the original comparisons, with tradeoff
    of two full scans (I/O intensive) of the base tables and one UNION ALL-ing (low cost)
    of the results.

    In this case we will get ~4 times faster and CPU saving execution.


    Sometimes it can be worth to decompose to 3 or more process phases, depending on data.

    It is important, if there are further joins or transformations on the result data, they should be done on the UNION ALL-ed result, and should not be dupliated on the decomposed phases, due to code management reasons.

    Why can not do the Teradata PE the same?
    The decomposition requires the knowledge of the data, and will vary from query to query, which is currently out of scope and intelligence of an automatic optimizer.

    Summary

    Eliminate trivially invalid record pairs from the PRODUCT JOIN by breaking the query in more parts. 

    What's next

    Next post will discuss slow OUTER (LEFT/RIGHT) JOIN.

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Introduction

    Sponsored by PRISE Ltd.
    www.prisetools.com

    Who am I?

    My name is Ákos Lévai, I work in Teradata environments since 2000.
    I am Teradata V2R5 master and my focus area is Teradata performance optimization.

    Goal of this blog

    My experience is that Teradata is a mysterious black box for lots of users and even developers: sometimes it is a lightning fast rocket, but sometimes as slow as a snail.
    But really do not know why.

    If we look behind the scenes, everything gets clear. My goal is to demystify Teradata and help my fellows to easily write effective SQLs instead of suffering from lack of resources or keep buying new hardware.

    This blog is not a theoretical lecture will not explain algorithms, but discusses practical problems and solutions in the following structure:
    • What is the problem?
    • How to identify it?
    • What is the solution?
    This blog is not a structured tutorial. The topics are sorted randomly, in the order they came into my mind.
    Search if you are interested in a specific topic, or read through for just getting useful tricks and ideas.

    Comments are welcome!

    So let's uncover the mysteries...

    Sponsored by PRISE Ltd.
    www.prisetools.com