Thursday, December 19, 2013

DBQL analysis II. - Monitor "Top skewed/impact queries"

Sponsored by PRISE Ltd.
www.prisetools.com

Find Skewed CPU and I/O in DBQL

Please note that the solutions found in the article works on the DBQL logs, which covers only that users' activity, for whom the logging is switched on.

About Skew

If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

TeradataTuning - Skewness
TeradataTuning - DBQL I
As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

Find bad queries in DBQL

With this SQL you can filter top impacting queries  (replace the date value or maybe you have to adjust the date filtering according to local settings):

select  top 50
  ProcID
, QueryID
, AMPCPUTime
, MaxAMPCPUTime * (hashamp () + 1) CPUImpact
, CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
, TotalIOCount
, MaxAMPIO * (hashamp () + 1) IOImpact
, CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
, AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
, TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
, ParserCPUTime
, Queryband
, Substr(QueryText,1,2000) QueryText
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst where logdate=1131201
and ampcputime>0
*/
/* For online DBQL*/
    dbc.dbqlogtbl where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and ampcputime>0
order by CPUImpact desc


Explanation of extra fields:
ParserCPUTime: Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL: Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand: Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)

OK, we've listed the terrible top consumers, but what's next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post...) , you can find which job or report issued that SQL, anyway, you can see the QueryText field.

If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

select SQLTextInfo from dbc.dbqlsqltbl where procid=<procid> and queryid=<queryid>
order by SQLRowNo asc


You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

System level Skewness

Totals

We've found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?
The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

select
  sum(AMPCPUTime) AMPCPUTimeSum
, sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
, sum(TotalIOCount) TotalIOCountSum
, sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
, cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
, cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date)     

    and (ampcputime>0 or TotalIOCount > 0)
*/
/* For online DBQL*/
    dbc.dbqlogtbl where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and (ampcputime>0 or TotalIOCount > 0)


Look at the last two columns. That percent of your CPU and I/O goes to the sink...

Top bad guys

OK, let's check how many queries accumulate 5%,10%,25%,50%,75%,90% of this loss?
Here you are (CPU version, transform for I/O implicitly):

select 'How many queries?' as "_",min(limit5) "TOP5%Loss",min(limit10) "TOP10%Loss",min(limit25) "TOP25%Loss",min(limit50) "TOP50%Loss",min(limit75) "TOP75%Loss",min(limit90) "TOP90%Loss", max(rnk) TotalQueries, sum(ResourceTotal) "TotalResource", sum(ResourceImpact) "ImpactResource"
from
(
select
 case when ResRatio < 5.00 then null else rnk end limit5
,case when ResRatio < 10.00 then null else rnk end limit10
,case when ResRatio < 25.00 then null else rnk end limit25
,case when ResRatio < 50.00 then null else rnk end limit50
,case when ResRatio < 75.00 then null else rnk end limit75
,case when ResRatio < 90.00 then null else rnk end limit90
,rnk
, ResourceTotal
, ResourceImpact
from
(
select
  sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
, sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
, subtotalRes *100.00 / totalRes Resratio
, sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
, ResourceTotal
, ResourceImpact
from
(
select
  AMPCPUTime ResourceTotal
, (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
,  ResourceImpact - ResourceTotal ResourceLoss
/* For archived DBQL
from dbql_arch.dbqlogtbl_hst where logdate=1131207
and ampcputime>0
*/
/* For online DBQL*/
from dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and ampcputime>0
) x
) y
) z
group by 1



I expect you are a bit shocked now, how few queries waste how much golden resources.
I think we will agree that it is worth to tune those dozen of queries, and you save in orders of 100K..MUSD for your company annually, am I right?

PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

Typical reasons of skewness - in a nutshell

  • Skewed tables: Bad choice of PI, Skewed data 
  • Bad execution plans (typically skewed redistributions)
    • Bad data model (normalization,data types,PI, etc.)
    • Missing or stale statistics
    • Too many joins (break up the query!)
  • Hash collision (load time problem)

Sponsored by PRISE Ltd.
www.prisetools.com

2 comments:

  1. Looking for those queries to find bad queries, thank you Akos

    ReplyDelete
  2. I appreciate if you can share the statistics you've found, private or public, thanks Cheeli

    ReplyDelete