Monday, January 13, 2014

DBQL analysis III - Monitor "collect statistics"

Sponsored by PRISE Ltd.
www.prisetools.com

Analyze "collect statistics" in DBQL

Please note that the solutions found in the article works on the DBQL logs, which covers only that users' activity, for whom the logging is switched on. "Object" and "SQL" option in DBQL is required "on" to use the scripts provided.
This article is applicable up to V13.10 w/o modifications, statistics handling changed from V14. 

About Statistics

"Statistics" is a descriptive object in the Teradata database that are used by the optimizer for transforming SQLs to effective execution plans.
Statistics reflect the key data demographic information of one or more table column(s).
These objects should be created and maintained, the RDBMS will not do it by itself.
Statistics internally contain value histogram, which needs the table data (or sample) to be analyzed, which is an expensive task.

Summarized: appropriate statistics are required for getting good and effective executon plans for SQLs, but statistics consume resources to be collected or refreshed.

"Statistics" footprint in DBQL

When a "statistics" is created or refreshed it is executed by an SQL command: collect statistics....
This command will create a log entry into the DBQL if the logging is switched on.

One can track when, which "statistics" was collected, consuming how much CPU and I/O.
Those statements are very easy to identify in the central table:

select * from dbc.DBQLogTbl where StatementType='Collect statistics'

Analyzing DBQL data 

Prepare data

You may need to modify the script:
  • Date (interval)
  • Online/archived: use commented section
  • QueryBand: "JOB" variable is used, modify according to your ETL settings
create volatile table DBQLStat_tmp1
as
(
sel a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount

,case when a.querytext like '% sample %' then 'S' else 'F' end Full_Sample
,UserName,(FirstRespTime - StartTime) DAY(4) TO SECOND(4) AS RUNINTERVAL      
,(EXTRACT(DAY FROM RUNINTERVAL) * 86400 + EXTRACT(HOUR FROM RUNINTERVAL)  * 3600 + EXTRACT(MINUTE FROM RUNINTERVAL)  * 60 + EXTRACT(SECOND FROM RUNINTERVAL) ) (decimal(10,1)) Duration
,b.ObjectDatabaseName DatabaseName,b.ObjectTableName TableName,c.ObjectColumnName ColumnName
,case when d.SQLTextInfo like any ('%"PARTITION"%', '%,PARTITION %', '%,PARTITION,%', '% PARTITION,%', '% PARTITION %', '%(PARTITION,%', '%(PARTITION %', '%,PARTITION)%', '% PARTITION)%', '%(PARTITION)%') then 'Y' else 'N' end inclPartition
,CAST((case when index(queryband,'JOB=') >0 then  substr(queryband,index(queryband,'JOB=') ) else '' end) AS VARCHAR(500)) tmp_Q
,case when queryband = '' then 'N/A'
         when tmp_q = '' then '-Other'
else CAST( (substr(tmp_Q,characters('JOB=')+1, nullifzero(index(tmp_Q,';'))-characters('JOB=')-1)) AS VARCHAR(500)) end QB_info
,sum(1) over (partition by a.procid,a.Queryid order by c.ObjectColumnName, a.QueryID rows unbounded preceding) Rnk
from
/* For achived tables
     dbql_arch.DBQLogTbl_hst       a
join dbql_arch.DBQLObjTbl_hst      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID and a.logDate=b.logDate
left join dbql_arch.DBQLObjTbl_hst c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID and a.logDate=c.logDate
join dbql_arch.DBQLSQLTbl_hst      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID and a.logDate=d.logDate
where a.logDate=1140113
*/
/*end*/
/* For online tables */
     dbc.DBQLogTbl       a
join dbc.DBQLObjTbl      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID
left join dbc.DBQLObjTbl c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID
join dbc.DBQLSQLTbl      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID
where cast(cast(a.starttime as char(10)) as date) = '2014-01-13' (date)
/*end*/
and a.StatementType='Collect statistics'
) with data
primary index (procId,QueryId)
on commit preserve rows
;

create volatile table DBQLStat
as
(
WITH RECURSIVE rec_tbl
(
 procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,SColumns
)
AS
(
select
 procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,cast(case when ColumnName is null and inclPartition='Y' then '' else '('||ColumnName end as varchar(10000)) SColumns
from DBQLStat_tmp1 where Rnk=1
UNION ALL
select
  a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount,a.Duration,a.Full_Sample,a.UserName,a.DatabaseName,a.TableName,a.QB_info,a.inclPartition,a.ColumnName,a.Rnk,b.SColumns ||','||a.ColumnName
from DBQLStat_tmp1     a
join rec_tbl b on a.procId=b.ProcId and a.QueryId=b.QueryID and a.Rnk=b.Rnk+1
)
select   procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,Rnk NumOfColumns
        ,case when SColumns = '' then '(PARTITION)' else SColumns || case when inclPartition='Y' then ',PARTITION)' else ')' end end StatColumns
from rec_tbl qualify sum(1) over (partition by procid,queryid order by Rnk desc, QueryID rows unbounded preceding) = 1
) with data
primary index (procid,queryid)
on commit preserve rows
;

Reports

  • How many statistics has been collected for how much resources?

select
  UserName /*Or: DatabaseName*//*Or: Full_sample*/
, count(*) Nbr
, sum(AMPCpuTIme) CPU
, sum(TotalIOCount) IO
from DBQLStat
group by 1
order by 1
;
  • Which statistics has been collected multiple times?
    (If more days are in preapred data, frequency can be determined, erase "qualify")
select a.*,
sum(1) over (partition by databasename,tablename,statcolumns)  Repl
from DBQLStat a

/* Comment for frequency report*/
qualify sum(1) over (partition by databasename,tablename,statcolumns) > 1

/*end*/
order by repl desc, databasename,tablename,statcolumns
;


Sponsored by PRISE Ltd.
www.prisetools.com

3 comments:

  1. Hi,
    which is efficient ? collecting stats on all the columns in a table or only on the indexed columns.

    ReplyDelete
    Replies
    1. Hi Agilan,

      Stats on 'all columns' is definitely not efficient.
      I recommend different approach, collect statistics on "important" columns. Which are they?
      - Primary index
      - PARTITION and partitioning column (if PPI)
      - Secondary indexed fields (composite, if index is composite)
      - Significant columns in workload:
      - Join column(s)
      - Column(s) in "where" condition
      - Columns in "group by" (composite)

      Use 'SAMPLE' if it is advisable (read Teradata recommendations)

      Delete