Friday, February 26, 2016

Monitor ongoing SQLs - progress indication

Sponsored by PRISE Ltd.
www.prisetools.com

How can I monitor the progress of my SQL?

We often have SQL operations that last hours or more, and we do want to know when will they expectedly finish. There is no progress bar or something similar tool in Teradata, but we have some useful information parcels to be used to make our estimations.
Let's see how.

Execution steps

SQL statements are broken down to execution steps by the Parsing Engine (PE), which steps are executed sometimes sequentially, sometimes parallel to each other.
This level of monitoring can easily be done by either Teradata Viewpoint or DB_MON tools. It will show you which step(s) was/is/will be executed, and some information about them:
  • What is the estimated runtime (for all steps)
  • How long did the finished ones run (only for finished ones)
  • For how long the currently executed is/are running (only for the running ones)
First we can compare the estimations and the fact runtimes for the finished steps. It can give a guess rate we can use for the further estimations. Naturally it is far from exact calculation, but is just a guideline. The difference comes from estimation bias and server load/priority conditions.

So at the end we can use the estimations' sum multiplied by the finished steps' fact/estimation rate as a very rough guess

Within a step

Typically a long running SQL statement contains a few number of long running steps besides more lightweight ones. So the runtime is dominated by some few heavy steps, and in this case the interesting question is: what is the progress rate of the actually running step?
The answer is not that very simple, one must know what happens behind - more or less.
Before examining  exact cases, I provide some basic information and tricks.

Tricks to know

Typically used SQL scripts by myself

Current spool used by a user (1)

We can query the data dictionary for the actually allocated spool space by a user. This is very important that this is related to a user, not a session (sum of all logged in sessions of that specific user). 
The current spool covers all objects currently stored in spool: 
  • used by the currently executed query step
  • all the other spools generated by previous steps and still not dropped
  • all existing volatile tables either. 
So be careful with it.

The current spool information is usually interesting from two aspects:
  • Total spool usage of a user, now (a):
    SELECT cast(sum(currentspool) /1024/1024/1024 as decimal(15,3)) Current_spool,
    Current_spool / (hashamp()+1) Average_spool, cast(max(currentspool) /1024/1024/1024 as decimal(15,3)) MaxAMP_spool,
    cast(100.0 - ((Average_spool / nullifzero(MaxAMP_spool)) * 100.0) as decimal(4,1)) Spool_skew
    FROM dbc.diskspace WHERE databasename = '
    <user>';
  • Spool usage by AMP of a user, now (b):
    SELECT vproc, cast(currentspool /1024/1024 as decimal(15,3)) Current_AMP_spool_MB
    FROM dbc.diskspace WHERE databasename = '
    <user>'
    ORDER BY 2 DESC;

Current size of a table (2)

  • Total size of a table (a):
    SELECT databasename,tablename,cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Current_perm_GB
    FROM dbc.allspace
    WHERE databasename = '
    <db_name>' and tablename = '<table_name>'
    GROUP BY 1,2;
  • Table size by AMP (b)
    SELECT databasename,tablename,vproc,cast(currentperm /1024/1024/1024 as decimal(15,3)) Current_perm_MB
    FROM dbc.allspace
    WHERE databasename = '
    <db_name>' and tablename = '<table_name>'
    ORDER BY 4 DESC;

Current allocated and free space in a database (3)

  • Total allocated/free space in a database, currently (a):
    SELECT databasename,
    cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Used_perm_GB,
    cast(sum(maxperm-currentperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB
    FROM dbc.diskspace
    WHERE databasename = '
    <db_name>'
    GROUP BY 1;
  • Sizes in a database by AMP (b)
    Run before the operation:
    CREATE VOLATILE TABLE CPERM as
    (SELECT vproc,currentperm, maxperm-currentperm freeperm FROM dbc.diskspace WHERE databasename = '<db_name>') with data PRIMARY INDEX( vproc) on commit preserve rows;

    Run during the operation any time of interest:
    SELECT databasename,
    cast(sum(a.currentperm-b.currentperm) /1024/1024/1024 as decimal(15,1)) Used_perm_GB,
    cast(sum(a.maxperm-a.currentperm-b.freeperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB
    FROM dbc.diskspace a join CPERM b on a.vproc=b.vproc
    WHERE databasename = '<db_name>' GROUP BY 1

Transient journal size and free space in DBC (4)

When running an SQL statement which modifies table data, it will generate more or less transaction log (called TransientJournal in Teradata). Its size sometimes helps to determine the progress of an operation. Please note that this information is very-very informative, since we do not know how much journal will be written, and in addition all the transactions use the same journal table, and the purge of the released journal area is an asynchronous task. It is important, that the journal must fit into DBC, so the following query shows the DBC's free space along with the current journal size. Please note: this query examines the conditions on total level instead of AMP level, so if you run out of DBC space on one AMP the system will fail, in spite you see lots of total free spaces.
  • How's going with the transient journal:
    SELECT tj.Current_perm_GB TransientJournal_current_GB, dbspc.Free_perm_GB DBC_free_perm_GB FROM
    (SELECT databasename,tablename,cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Current_perm_GB
    FROM dbc.allspace WHERE databasename = 'DBC' and tablename = 'TransientJournal' GROUP BY 1,2 ) tj cross join
    (SELECT cast(sum(maxperm-currentperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB FROM dbc.diskspace WHERE databasename = 'DBC') dbspc;

Within a step, continued...

The progress examination method will differ depending on the type of the operation - unfortunately this is how it goes. The reasons are hidden behind the internal processes of Teradata... 

INSERT INTO xxx SELECT ... FROM ...

This type of operation typically consists of the following steps:

  1. Retrieve data from the source into a SpoolX
    This step can be examined by checking the executing user's spool space (query 1.a/1.b). If the data is skewed, it is worth to check the spool on AMP level (1.b), mainly after the growth of the total will get slower. This is because most of the AMPs are finished already, but a few of them must process the "peak" 
  2. Merge SpoolX into "xxx" table
    I monitor this step in two sections: first check the target table's size (query 2.a/2.b), which will increase as the merge process advances. The target table's size will increase by the size of the source spool, (except in some special cases like SET table duplication elimination) so if the target table was not empty before the insert, just watch the difference between initial and current size. It goes linearly along the time up to a specific point, when the first AMP finishes its work.When an AMP has been finished the merge, it will release the corresponding source spool area ( which was produced by step 1. ). So when the growth of the target slows down, switch to check the executing user's spool space by AMP (query 1.b), and watch how many AMPs of the all will have zero or minimal spool occupied (the non-finished ones show the full spool size and do not decrement until they finish). This phase is quite short if the data is not skewed AND the system is well balanced (co-existing node generations...) AND another sessions do not cause skewed load to the system, otherwise the first and last finishing AMP may differ significantly. In addition, the last AMP can run extremely long when the hash collision problem escalates. Unfortunately the progress within a single AMP is not visible* (see note below).
    When the target table was non empty when the INSERT started, we usually experience significant increase in the size of the Transient Journal (query 4). This is not a good progress indicator, since we do not have a guideline how big is the target journal size, and other transactions use the same journal. In addition the release of the unused journal area is an asynchronous process, which is may be delayed from the SQL operations finish. The purpose of monitoring this object - together with the DBC free space - is to ensure that we do not eat up all the free DBC space. Of course it is a rare situation, but is is a very unpleasant situation when half-a-day run is rolled back because we run out of DBC space...
  3. Commit transaction
    This step usually lasts a very short time, except when the table is very skewed. Usually I do not monitor this step.

CREATE TABLE AS ...

The basic execution structure (from our aspect) will not differ in major things, except one. The "preparation step(s)" eg. retrieve into a spool can be measured exactly the same way as we did at the "INSERT... SELECT..." case, but the MERGE is different: at the time of execution we do not have the target object in the data dictionary, which size we could measure. However we are not absolutely lost: the free space in the target database (query 3.a/3.b) is kept maintained during the MERGE, so if we measure it, and compare to the amount when the process started, we get a good progress indicator of the MERGE step. Take care, it only works if nobody modifies the Maxperm (size) of the target database neither other process puts or removes data into/from it during our operation.

  1. Save the initial condition into a volatile table (it is very small spool)
  2. Query the current increment globally
When the size increasing slows down we can go back to check the spool size at AMP level, and watch how many AMPs has already been finished.




* Note: The progress of a single AMP is not measurable using the provided methods, but we can have some indirect deductions at the MERGE step. If we know the expected size or record number ratio of the smallest and biggest AMP, and we can measure the finish time of the smallest AMP (when the spool size on an AMPs begin to fall to zero or very low value (the MERGE is finished on that AMP), we can extrapolate the runtime on the biggest AMP (this will be the total time). But only if hash collision is not significant: MULTISET TABLE, or low "high-mode-frequency" of the PI columns, otherwise the runtime is dominantly determined by the hash collision level (where Nx is the top mode-frequency values of the PI; O(sum(Nx*Nx)) ), not the number of records
Sponsored by PRISE Ltd.
www.prisetools.com