Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
Find Skewed CPU and I/O in DBQL
Please note that the solutions found in the article works on the DBQL logs, which covers only that users' activity, for whom the logging is switched on.
About Skew
If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:TeradataTuning - Skewness
TeradataTuning - DBQL I
As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.
Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.
Explanation of extra fields:
ParserCPUTime: Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL: Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand: Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)
OK, we've listed the terrible top consumers, but what's next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post...) , you can find which job or report issued that SQL, anyway, you can see the QueryText field.
If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:
You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.
The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.
Look at the last two columns. That percent of your CPU and I/O goes to the sink...
Here you are (CPU version, transform for I/O implicitly):
I expect you are a bit shocked now, how few queries waste how much golden resources.
I think we will agree that it is worth to tune those dozen of queries, and you save in orders of 100K..MUSD for your company annually, am I right?
PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.
Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.
Find bad queries in DBQL
With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):
select top 50
ProcID
, QueryID
, AMPCPUTime
, MaxAMPCPUTime * (hashamp () + 1) CPUImpact
, CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
, TotalIOCount
, MaxAMPIO * (hashamp () + 1) IOImpact
, CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
, AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
, TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
, ParserCPUTime
, Queryband
, Substr(QueryText,1,2000) QueryText
from
/* For archived DBQL
dbql_arch.dbqlogtbl_hst where logdate=1131201
and ampcputime>0
*/
/* For online DBQL*/
dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and ampcputime>0
order by CPUImpact desc
Explanation of extra fields:
ParserCPUTime: Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL: Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand: Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)
OK, we've listed the terrible top consumers, but what's next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post...) , you can find which job or report issued that SQL, anyway, you can see the QueryText field.
If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:
select SQLTextInfo from dbc.dbqlsqltbl where procid=<procid> and queryid=<queryid>
order by SQLRowNo asc
You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.
System level Skewness
Totals
We've found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.
select
sum(AMPCPUTime) AMPCPUTimeSum
, sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
, sum(TotalIOCount) TotalIOCountSum
, sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
, cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
, cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date)
and (ampcputime>0 or TotalIOCount > 0)
*/
/* For online DBQL*/
dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and (ampcputime>0 or TotalIOCount > 0)
Look at the last two columns. That percent of your CPU and I/O goes to the sink...
Top bad guys
OK, let's check how many queries accumulate 5%,10%,25%,50%,75%,90% of this loss?Here you are (CPU version, transform for I/O implicitly):
select 'How many queries?' as "_",min(limit5) "TOP5%Loss",min(limit10) "TOP10%Loss",min(limit25) "TOP25%Loss",min(limit50) "TOP50%Loss",min(limit75) "TOP75%Loss",min(limit90) "TOP90%Loss", max(rnk) TotalQueries, sum(ResourceTotal) "TotalResource", sum(ResourceImpact) "ImpactResource"
from
(
select
case when ResRatio < 5.00 then null else rnk end limit5
,case when ResRatio < 10.00 then null else rnk end limit10
,case when ResRatio < 25.00 then null else rnk end limit25
,case when ResRatio < 50.00 then null else rnk end limit50
,case when ResRatio < 75.00 then null else rnk end limit75
,case when ResRatio < 90.00 then null else rnk end limit90
,rnk
, ResourceTotal
, ResourceImpact
from
(
select
sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
, sum(ResourceLoss) over (order by ResourceLoss desc rows unbounded preceding) subtotalRes
, subtotalRes *100.00 / totalRes Resratio
, sum(1) over (order by ResourceLoss desc rows unbounded preceding) rnk
, ResourceTotal
, ResourceImpact
from
(
select
AMPCPUTime ResourceTotal
, (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
, ResourceImpact - ResourceTotal ResourceLoss
/* For archived DBQL
from dbql_arch.dbqlogtbl_hst where logdate=1131207
and ampcputime>0
*/
/* For online DBQL*/
from dbc.dbqlogtbl where
cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
and ampcputime>0
) x
) y
) z
group by 1
I expect you are a bit shocked now, how few queries waste how much golden resources.
I think we will agree that it is worth to tune those dozen of queries, and you save in orders of 100K..MUSD for your company annually, am I right?
PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.
Typical reasons of skewness - in a nutshell
- Skewed tables: Bad choice of PI, Skewed data
- Bad execution plans (typically skewed redistributions)
- Bad data model (normalization,data types,PI, etc.)
- Missing or stale statistics
- Too many joins (break up the query!)
- Hash collision (load time problem)
Sponsored by PRISE Ltd.
www.prisetools.com
www.prisetools.com
Looking for those queries to find bad queries, thank you Akos
ReplyDeleteI appreciate if you can share the statistics you've found, private or public, thanks Cheeli
ReplyDelete