Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
How to eliminate product joins
What is product join?
Product join is one of the implementation methods of an SQL JOIN operation.Do not mix up with cross join (Cartesian product), which is one type of SQL joins.
SQL join types, eg.: inner join, left outer join, full outer join, cross (Cartesian) join
Join implementation types, eg.: nested join, merge join, hash join, product join.
Product join (of tables A and B ) is the most simple method of join implementation:
- Produce each of <A;B> record combinations, say take each records from A singly, and match it with each records of B one-by-one.
- Test the join condition on each produced <A;B> record pairs, and eliminate those combinations where the condition fails.
The two steps are often combined, and the "testing phase" is executed right after a record combination is generated, and the non valid combinations right after dropped. This saves a lot of temp space.
Why don't we like it?
Well, it has a really bad reputation. It is slow, stuffs CPU, etc.Yes, it usually is, does. It is the brute force method for executing a join, with costs in order of N*M (where N, M are the record numbers of the joinable tables)
Indeed there are situations when it is the best choice, or the only feasible way.
When is it good/necessary?
Please note that product join is the method what is always applicable, independently of all circumstances.
Good
Product join is typically simple, dumb and slow algorithm, this is why we do not like it, but has a very important advantage: requires no pre-processing.* This is why we LIKE IT:)If we have to join a really large table to a very small table (couple of records) product join is far the most effective method, since the sort of a very large table ( order of N*logN ) can cost a lot, while joining to 1-2 records is really not a big deal.
Necessary
There are join situations when the only way to go is the product join. Why? Because of the join condition. The "clever joins" (merge, hash) require some information and/or condition that somehow enables to cheat the A x B comparisons: reduce them to the ones that really necessary, and be done in a more effective manner.* OK, in Teradata this means: only requires that the matchable records from both tables must be on the same AMP. This implies the "small" table to be duplicated to all AMPs.
from A
join B on A.customer_id = B.customer_id
and A.trx_dt between B.eff_dt and B.exp_dt
- Customer_id clause is in AND condition with the others
- Customer_id is selective enough that hash(customer_id) can reduce the comparisons reasonably
- Note that A and B must be sorted (re-sorted) by the hash of customer_id
Product join example
from A
join B on substr(A.telephone_no,1,B.prefix_length) = B.telephone_no_prefix
- There is no comparison reducing partial-condition
- Note that neither of the tables required to be sorted in a specific order.
Unavoidable product joins
- Non-eqality condition
- Function used (eg. substr())
- Dependent expression is used (eg. A.x+B.y = A.z)
- Cross join: intentional Cartesian product
Avoidable product joins
Data type mismatch
The merge join example above works only if customer_no in A and B tables have the same "style" data types, since their hash value will match only in this case. Say hash(13674) <> hash('13674'), however integer is compatible with decimal, and char is compatible with varchar.
Pay attention on data type consistence during physical data modeling.
- Use domains to eliminate the possibility of mismatch
- Align to used data types when defining temp tables, or use "create table as ..." statements
- If you cannot avoid mismatch, relocate the necessary data to temp tables with proper data types during processing.
OR condition
Let's assume the following join condition:
select ...
from A
join B on A.col1 = B.Col1
OR
A.Col2 = B.Col2
select ...
from A
join B on A.col1 = B.Col1
UNION
select ...
from A
join B on A.Col2 = B.Col2
Missing/stale statistics
As I mentioned before product join is the most effective join between a very large and a really small (couple of records) table. If the optimizer thinks that a table is pretty small, but it is not indeed, it may choose a product join in all good faith, misleaded by a stale or missing statistics.
Define and keep fresh those statistics by the optimizer can determine the size of the joinable record sets properly.
How to find avoidable product joins
It is not trivial to list the avoidable product joins. Practically all product joins are required to be examined one-by-one and judged to be avoidable or not. And if avoidable, what to do for.
I strongly recommend to use PRISE Tuning Assistant for both finding the product joins and analyzing the possibility and necessity of elimination:
In this case I recommend to try the decomposition, described here.
It can help reducing the number of comparisons, saving CPU and runtime.
- List top consuming queries with product join(s)
- Check the PROD JOIN steps: which tables are processed that way
- Check those join conditions for cases described above
What to do if cannot be avoided?
It can help reducing the number of comparisons, saving CPU and runtime.
Have a successful optimization and happy new year!
Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
Hi Akos,
ReplyDeleteCan you please explain what is the differences between product join and hash join, and benchmark for the optimizer to choose between the product join and hash join, which I believe in both product join and hash join the large table and smaller tables are not pre-processed.
"A Hash Join can only take place if one or both of the tables on each AMP can fit completely inside the AMP’s memory"
When you say very small rows in smaller table, then doesn't that fit on AMP's memory leading to HASH JON? Thank you for your time on this.
Hi Cheeli,
ReplyDeleteIt deserves a dedicated post, maybe I will write it later, but here is a short summary:
Product join is a simple brute force: try each records of table A to each records of table B. Works in all cases, Costs O(NxM), expensive, excluding the smaller table is really small.
The benchmark for decision is the expected COST, calculated by the statistics. Typically skewed content of join fields (in both small and big table) will reduce the probability of hash join.
Hash join is applicable in case of "equi-joins" ('=' conditions with ANDs ), like "Merge join". For Hash join some pre-processing (with linear costs) is required (hash table build, occassional fanning), but much lighter than Merge join: no need for a SORT of O(N*logN)!
The "AMP memory fitting condition" is false that way you cited.
The key area of hash joins: medium table joined to large table.
The SMALLER (called Build) table should fit into memory. Or if not, it can be partitioned into max. 50 partitions, where one partition must fit into the memory at a time.
Teradata may run more sub-types of hash joins (classic, dynamic, etc)
Hash join workslike this (seriously simplified): The SMALL table is supplemented by a join-hash value, transformed to a hash-table and duplicated to all AMPs (usually). The BIG (called Probe) table is scanned through record-by-record. For each BIG table records a "join-hash" value is calculated. All records in the SMALL table having that join-hash value are matched with that BIG table record, and qualifying ones will result the joined rows.
If the SMALL table is too big to fit into memory, it is divided into max of 50 "partitions" by hash range, and the BIG table is fanned out the same way (prepare process, with linear cost), in order to avoid multiple full-scans: the matching partitions' records will be matched.
It is a bit dense and inaccurate, a whole post would be required to explain all cases, details and "why"s. I hope it helped:)
@Akos thanks a lot,I was looking for fanned out into 9 hash
ReplyDeletejoin partitions, which is built locally on the AMPs.