Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
How to use Join Index (JI) in Teradata
About JIs
Join index is most like a "materialized view", say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.What are the main differences between a JI and a Secondary Index?
- Different internal structure and purposes
- Appears as separate object (not a sub-table) in the DBC tables/views
- Can reside in different database than the base table(s)
- Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured
What is the difference between JI and a regular table in which I can store the same query result?
- You can't access it directly from an SQL query
- Optimizer automatically decides to be used or not:
- By logic: if the JI can support the query with its data content
- By cost: if usage of JI will result a "cheaper" execution (always collect statistics!)
- Maintained automatically if the content of the base table(s) change
- "Locks" the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).
JI types
The Join Index types I will list below are not differentiated by SQL phrase, but the structure of the SQL SELECT used in the JI definition.
They can be combined also in reasonable ways, eg. <single table - aggregate - sparse> or <multi table - aggregate>, etc.Let's take these base tables for our examples:
CREATE TABLE TBL_A
(
Col1 integer
, Col2 integer
, Col3 integer
) PRIMARY INDEX (Col1)
;
CREATE TABLE TBL_B
(
Col1 integer
, Col2 integer
) PRIMARY INDEX (Col1)
;
Single table JI
This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:- Non-covering
We select only the filtering column(s) (those will be the PI of the JI also) and the "Rowid" pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table's appropriate records. The JI can be very small this way, but note that we have an additional "join phase". - Covering
The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some "central" tables.
This example shows a non-covering index for the query below:
create join index JI_1
as
SELECT Col1,Col2
FROM TBL_A
PRIMARY INDEX(Col2)
;
select Col3 from TBL_A where Col2=1;
as
SELECT Col1,Col2
FROM TBL_A
PRIMARY INDEX(Col2)
;
select Col3 from TBL_A where Col2=1;
Multi table JI
This kind of JI is for accelerating frequent join statements. Technically it stores the result of a join. It can cover all the columns of just store the key-pairs, or somewhere between.
create join index JI_2
as
SELECT a.Col3,b.Col1
FROM TBL_A a
join TBL_B b on a.Col2=b.Col2
PRIMARY INDEX(Col3);
create join index JI_2
as
SELECT a.Col3,b.Col1
FROM TBL_A a
join TBL_B b on a.Col2=b.Col2
PRIMARY INDEX(Col3);
Aggregate JI
The JI's SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata's optimizer is quite clever, because it can recognize "intermediate" aggregate JIs for further aggregation instead using the base table. Example:create join index JI_3
as
SELECT Col1,Col2,sum(Col3) X
FROM TBL_A
GROUP BY 1,2
PRIMARY INDEX(Col2);
All three SELECTs can be served from the JI_3:
SQL1: select Col1 , sum(X) from TBL_A group by 1;
SQL1: select Col2 , sum(X) from TBL_A group by 1;
SQL1: select Col1,Col2, sum(X) from TBL_A group by 1,2;
Sparse JI
Thees JIs contain where clause, say the indexing is not for the whole table. If the where condition of the JI is a logical subset of the supported SQL's where condition than the JI can support the query.
Typical usage is on the transactional tables: we have frequent accessed on a transaction table by Customer_id, but have PI of Trx_id. We can observe that 90% of the SELECTs fall onto the last 30 days.
We can put a sparse non covering single table JI on the table with PI:Customer_id
90% of the selects will finish in no time, and 10% of the queries will result in a full table scan, meanwhile our JI size remains pretty small: we index only 30 days instead of eg. 365 days.
Please note that where condition of a sparse JI can not contain "changing" values like current_date. That means JI must be regularly redefined if a moving window sparse JI is to be maintained.
create join index JI_4
as
SELECT Col1,Col2,Col3 X
FROM TBL_A
WHERE Col3 between 1 and 10
PRIMARY INDEX(Col2);
Serveable query:
select Col1 from TBL_A where Col2=20 and Col3=4;
Non serveable query:
select Col1 from TBL_A where Col2=20 and Col3=11;
When will join index help?
Join indices typically serve two purposes (or the combination of them):- Store the result of frequently produced (part-)results
- Provide alternative access path to the data
Practical examples:
- Frequent aggregation or join operation
- Frequent redistribution of a table
- Non primary index based filtering on a table (causing full table scan)
- Non primary index based join of a table (causing redistribution)
Important things about JIs
Statistics
Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use themChoosing PI
Te PI of the JI will be the most effective access path to the JI's records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.
Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
This comment has been removed by the author.
ReplyDeleteHi, Thanks for the post.
ReplyDeleteCould you please clarify this scenario?
I have a SQl like below,
Sel
a.col1,
a.col2
sum(a.col3)
b.col1
from a
join b ….
Now, the aggregation from table a is very huge and daily the table get loaded by 1 million records.
If the code logic validate history of records till date, how we could use JI here ?
Hi Agilan,
ReplyDeleteI do not have a lot of informations to give you exact answer. It is very important that you have batch load, or trickle feed (TPump). In the second case you may have significant overhead on JI maintenance.
On the other hand, you have to accept, that JI will not save space of the aggregation, but provides an aggregated (or joined or redistributed, etc...) data that can be automatically used by the optimizer, without changing the original SQL query.
Answering your question, you can define a JI, that is exactly equals to your query. You can measure the performance impact on the load job, and the size of the JI itself. Or - if the join condition allws - just store the aggregation of table 'a' and the join will be perfromed against the JI, instaed of table 'a'. You have to make tries and measurements.
Hi Akos,
ReplyDeleteI would like to have a clarification about JI.
As per the explain plan, my sql takes JI table and result in spool issue.
whereas if i remove JI, the SQl runs under my spool
P:S:- the main table of JI has SI,PPI and stats at the tables are 3 days old.
stats collected on the PI of JI.
Hi Agilan,
ReplyDeleteI should see the whole picture give exact answer. There are cases when optimizer is misleaded (by statistics or its internal rules and weights) and does not make the optimal decision. I am pretty sure that the out-of-spool step is not the retrive from the table/JI but some other subsequent step the results in redistribution or duplication-to-all-amps.
If you contact me on the LinkedIN I try to help you. I suggest you to download the trial of PRISE Tuning Assistant from https://prisetools.com/free-trial/tuning-assistant and visualize your query execution. You will see the difference instantly.
Ákos