Monday, January 13, 2014

DBQL analysis III - Monitor "collect statistics"

Sponsored by PRISE Ltd.
www.prisetools.com

Analyze "collect statistics" in DBQL

Please note that the solutions found in the article works on the DBQL logs, which covers only that users' activity, for whom the logging is switched on. "Object" and "SQL" option in DBQL is required "on" to use the scripts provided.
This article is applicable up to V13.10 w/o modifications, statistics handling changed from V14. 

About Statistics

"Statistics" is a descriptive object in the Teradata database that are used by the optimizer for transforming SQLs to effective execution plans.
Statistics reflect the key data demographic information of one or more table column(s).
These objects should be created and maintained, the RDBMS will not do it by itself.
Statistics internally contain value histogram, which needs the table data (or sample) to be analyzed, which is an expensive task.

Summarized: appropriate statistics are required for getting good and effective executon plans for SQLs, but statistics consume resources to be collected or refreshed.

"Statistics" footprint in DBQL

When a "statistics" is created or refreshed it is executed by an SQL command: collect statistics....
This command will create a log entry into the DBQL if the logging is switched on.

One can track when, which "statistics" was collected, consuming how much CPU and I/O.
Those statements are very easy to identify in the central table:

select * from dbc.DBQLogTbl where StatementType='Collect statistics'

Analyzing DBQL data 

Prepare data

You may need to modify the script:
  • Date (interval)
  • Online/archived: use commented section
  • QueryBand: "JOB" variable is used, modify according to your ETL settings
create volatile table DBQLStat_tmp1
as
(
sel a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount

,case when a.querytext like '% sample %' then 'S' else 'F' end Full_Sample
,UserName,(FirstRespTime - StartTime) DAY(4) TO SECOND(4) AS RUNINTERVAL      
,(EXTRACT(DAY FROM RUNINTERVAL) * 86400 + EXTRACT(HOUR FROM RUNINTERVAL)  * 3600 + EXTRACT(MINUTE FROM RUNINTERVAL)  * 60 + EXTRACT(SECOND FROM RUNINTERVAL) ) (decimal(10,1)) Duration
,b.ObjectDatabaseName DatabaseName,b.ObjectTableName TableName,c.ObjectColumnName ColumnName
,case when d.SQLTextInfo like any ('%"PARTITION"%', '%,PARTITION %', '%,PARTITION,%', '% PARTITION,%', '% PARTITION %', '%(PARTITION,%', '%(PARTITION %', '%,PARTITION)%', '% PARTITION)%', '%(PARTITION)%') then 'Y' else 'N' end inclPartition
,CAST((case when index(queryband,'JOB=') >0 then  substr(queryband,index(queryband,'JOB=') ) else '' end) AS VARCHAR(500)) tmp_Q
,case when queryband = '' then 'N/A'
         when tmp_q = '' then '-Other'
else CAST( (substr(tmp_Q,characters('JOB=')+1, nullifzero(index(tmp_Q,';'))-characters('JOB=')-1)) AS VARCHAR(500)) end QB_info
,sum(1) over (partition by a.procid,a.Queryid order by c.ObjectColumnName, a.QueryID rows unbounded preceding) Rnk
from
/* For achived tables
     dbql_arch.DBQLogTbl_hst       a
join dbql_arch.DBQLObjTbl_hst      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID and a.logDate=b.logDate
left join dbql_arch.DBQLObjTbl_hst c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID and a.logDate=c.logDate
join dbql_arch.DBQLSQLTbl_hst      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID and a.logDate=d.logDate
where a.logDate=1140113
*/
/*end*/
/* For online tables */
     dbc.DBQLogTbl       a
join dbc.DBQLObjTbl      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID
left join dbc.DBQLObjTbl c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID
join dbc.DBQLSQLTbl      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID
where cast(cast(a.starttime as char(10)) as date) = '2014-01-13' (date)
/*end*/
and a.StatementType='Collect statistics'
) with data
primary index (procId,QueryId)
on commit preserve rows
;

create volatile table DBQLStat
as
(
WITH RECURSIVE rec_tbl
(
 procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,SColumns
)
AS
(
select
 procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,cast(case when ColumnName is null and inclPartition='Y' then '' else '('||ColumnName end as varchar(10000)) SColumns
from DBQLStat_tmp1 where Rnk=1
UNION ALL
select
  a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount,a.Duration,a.Full_Sample,a.UserName,a.DatabaseName,a.TableName,a.QB_info,a.inclPartition,a.ColumnName,a.Rnk,b.SColumns ||','||a.ColumnName
from DBQLStat_tmp1     a
join rec_tbl b on a.procId=b.ProcId and a.QueryId=b.QueryID and a.Rnk=b.Rnk+1
)
select   procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,Rnk NumOfColumns
        ,case when SColumns = '' then '(PARTITION)' else SColumns || case when inclPartition='Y' then ',PARTITION)' else ')' end end StatColumns
from rec_tbl qualify sum(1) over (partition by procid,queryid order by Rnk desc, QueryID rows unbounded preceding) = 1
) with data
primary index (procid,queryid)
on commit preserve rows
;

Reports

  • How many statistics has been collected for how much resources?

select
  UserName /*Or: DatabaseName*//*Or: Full_sample*/
, count(*) Nbr
, sum(AMPCpuTIme) CPU
, sum(TotalIOCount) IO
from DBQLStat
group by 1
order by 1
;
  • Which statistics has been collected multiple times?
    (If more days are in preapred data, frequency can be determined, erase "qualify")
select a.*,
sum(1) over (partition by databasename,tablename,statcolumns)  Repl
from DBQLStat a

/* Comment for frequency report*/
qualify sum(1) over (partition by databasename,tablename,statcolumns) > 1

/*end*/
order by repl desc, databasename,tablename,statcolumns
;


Sponsored by PRISE Ltd.
www.prisetools.com

Thursday, January 02, 2014

PRISE Tuning Assistant use cases

Sponsored by PRISE Ltd.
www.prisetools.com

What is PRISE Tuning Assistant (PTA) good for?

PRISE developed this tool to help DBAs and developers solve performance issues quickly and easily. In this post I describe a couple of typical situations when it can assist you.
A full functioning trial version of PRISE Tuning Assistant is available for download.

In general, PTA application consist of two main functions:
  • Query finder
    Enables applying lots of search criteria like: thresholds, skewness, queryband, SQL text, Explain text, used objects, time window, etc.
  • Query analyzer
    Unfolds the execution process of one selected query, helps understanding the query and its performance bottlenecks. Collects all required supplementary information like table sizes, DDLs, statistics, etc. in one comprehensive dashboard, and provides a "walking tour" feature, you can roam in the execution plan with.
All this enables even an average skilled Teradata expert to find which queries to be optimized and figure out the weak points of even a complex query in minutes.

PTA's key architectural properties:
  • Supports active DBQL (dbc.dbqlogtbl and friends...) and also archived DBQL tables.
  • Optimized DBQL querying (fast response times and low resource usage)
  • Requires read only access to database
  • Portable, multi platform Java application, requires no installation
Let's see typical use cases of the application.

Find the TOP consuming queries

The optimization's "low hanging fruits" are the top consuming queries, I mean CPU or I/O  stuffing SQL statements. They are often small in number, however consume significant percent of the total system resources, as I mentioned in this post.

How to find them?
Simply set your focused time interval and select the "CPU", "I/O" or "Duration" (for long time running ones) as TOP criteria, and click the "Get DBQL data" button. You will be given the list of the qualifying queries in descending order of your TOP criteria. After you can sort it by the other KPIs, like:
  • Skewness
  • Spool usage
  • LHR/RHL (Larry Higa ratio)
  • QueryBand (and specific variables of it)
  • Etc.

Skewed CPU or I/O usage

Teradata is a massive parallel system. It can utilize its full power if the work is spread evenly across the AMPs, otherwise (skewed queries) the response time increases dramatically while virtual resource consumption can remain low.

How to find the bad guys?
Start same as "TOP queries" case, but choose "CPU Impact" or "I/O Impact" as TOP criteria. This will result those queries that have the highest "Impact" (skew-normalized resource consumption) on the system.

"No more spool space"

Teradata uses Spool limiting in fact for cancelling those queries that have bad plans. Therefore "Failure 2646: No more spool space error" means a bad query execution plan, and the solution is usually not spool barrier raising. But often those spool limits are raised, until query is able to finish, and bad queries throttle the system... Why does a query usually run out of spool?
  • Duplicates a large spool/table to all AMPS
    Thinks it is small
  • Redistributes spool/table by a skewed value
    Thinks it is non-skewed
  • Processes (copies to spool) unreasonably large number of records
First two cases are usually caused by missing or stale statistics that can be easily found and supplied.

How to select?
Start same as "TOP queries" case, but choose "Spool" as TOP criteria. This will result those queries that have the highest spool consumption, and SQL dashboard will help you to figure out which statistics should be collected. In addition you can filter for "Error 2646" with Error code option, which will result those queries that has reached the spool limit.

Find a specific query

The DBA gets a question: "My query runs slow, please help me!" - occasionally an SQL is attached in the email. PTA lets figuring out which query was it in the DBQL. You can filter on:
  • Username
  • Session ID
  • Full query text (with LIKE)
  • QueryID (internal DBQL ID)
  • QueryBand (with LIKE)
  • Error code

Search disliked plan elements

PTA enables text search in Explain text of the queries. Eg. find quickly:
  • Product joins
  • Sliding-window joins
  • "Redistributed by"
If it is combined with CPU and/or I/O limitation (CPU >=; I/O >= criteria) then we can get those queries where they really cause problem.

Object usage

We have a table or index in the database, and we look for all the queries that utilize that object.
It is very useful for:
  • Preparing for a PI change
    PI change usually triggered by skewed data distribution, but the more important factor is the access path it provides. Before you change a PI, you need to examine the workload, accessing that table. What is the typical filter/join criteria, does the current/new PI support good access path? PTA enables to do it quickly and easily.
  • Determine usage of a table
    How many times a table is accessed a day? Is it used extensively, or just kept loaded for nothing?
  • Determine usage of a hash/join index
    A hash/join index requires processing and storage resources. Is it pretty utilized, or not?
How?
Simply set the "Object used" filter citeria. In addition you can use the "Thru join/hash index" option for including the join/hash index based data accesses.

Recurring queries

Which query to be optimized? Maybe there are a medium consuming queries, but run lot of times a day, consuming much resource in total. Use the "Recurring query finder" function of the PTA to see aggregated information: total and average Duration, CPU and I/O.

What was a query run by?

If your ETL and OLAP systems use QueryBand properly, you can set specific variables (max of 6) to be parsed into separate columns in the result query list. Sort the result by it, and find which queries of a load job should be optimized.


Sponsored by PRISE Ltd.
www.prisetools.com