Monday, March 03, 2014

No more spool space

Sponsored by PRISE Ltd.
www.prisetools.com

Why do I get "No more spool space" error?

This is the most familiar error message in Teradata world:
"Failure 2646 No more spool space"

What does it really mean, what is it caused by?
Let's get back to the basics.

What is spool space?

Spool space is a temporary area, that can be used to store the part-results during query processing, as well as volatile tables. All free space in the database that are not allocated by PERM data, technically can be used for spool area, since a PERM data does not want to allocate that area.

Each database users may have a "spool limit" that restricts the user to allocate more spool area at a time, than its limit. Keep in mind that all active sessions of a username must share the spool limit together.

Teradata is a massive parallel system, therefore the spool limit must be interpreted on AMP level:
Eg: 100AMP system, a user with 10G spool limit means: 100M spool/AMP

What is spool space limit good for?

This limitation method is a quite simple way to cut those queries from the system that would suck too much resources. There is no exact relationship between high spool usage and ineffective query, but statistically the correlation is high.
Practically: a bad query is being kicked off before is consumes too much resources unnecessarily.

No more spool space scenarios

System ran out of spool space

This is the most rare situation, forget about. There are too few free space on the system, but this situation used to be avoided by defining a "SpoolReserve" database, where no objects are created, this way that area is always available for spool.
If many "big spool limit" users run high spool queries parallel, then this rare situation can yet occure.

Multiple session of the user are active together

This is a quite rare situation also. Check the active users from dbc.sessioninfo.

Volatile tables

All existing volatile tables reside in your available spool space, reducing the available. If you create many, and even with skewed distribution, you can stuff your spool up. Choose "primary index" carefully, when defining volatile tables also.

Improper execution plan

These are the >90% of cases that cause the "No more spool space" errors. Let' see how:
  • "Duplication to all AMPs" of a non-small set of records
    The root cause is typically missing or stale statistics. Either system thinks that much less records will be duplicated than the real (sometimes billions of records get in this kind of spools), or knows this exactly, but on the other branch of the query there are more low quality estimations, and this execution seems to be cheaper. 
  • Redistribution of records by a hash that causes skewed distribution
    Check the corresponding blog post: Accelerate skewed joins
  • Retrieve huge amount of records into spool (locally or redistributed onto the AMPs)
    Specific query structures imply this execution, like: join to a view that "union all"-s big tables.
I suggest to use PRISE Tuning Assistant to identify what is the problem. It spectacularly displays which execution step falls in the problems above.
Increasing the spool limit will not solve the problems in the most cases. 

Too big task

Sometimes a given SQL query requires big spool area to be performed, even with the best execution plan.
This is the only case when raising spool limit is the solution. But first you have to understand that the task is really big. PRISE Tuning Assistant is a good tool for identify this in a minute.

Sponsored by PRISE Ltd.
www.prisetools.com

2 comments:

  1. Thank you Akos for your wonderful posts. I am interested to seek your thoughts on the following.

    Can you please explain whether the following can be achieved or not.

    We are having 137 AMPS, joining 2 tables and one of the table_a has 98,016,938 rows. In explain plan, I could see that table_a is duplicated across all amps
    (98,016,938 rows * 137 AMPs = 13428320506 rows)

    Instead, if we have a central disk (only for reading -- like reserved database with not objects in it i.e., no PERM space utilized) shared by all amps (all amps has its individual disks, besides assume a new small central disk which is shared by all amps, here only 98,016,938 rows can be duplicated instead of 98,016,938 * 137 AMPS which might help in CPU, I/O.

    Now the joins can also be performed as the AMP's has the table_b data scattered on its AMPs and entire table_a data in a shared disk and hence can perform join.

    It CONTRADICTs the Teradata's shared nothing architecture.

    But what are the pros and cons of this approach? (like reduce the number of records duplicated, waste of space as having a disk only for sharing {duplicated data} etc)

    Thanks,
    Cheeli

    ReplyDelete
    Replies
    1. Thanks for your feedbacks!
      Regarding your question, I do not really have experience with virtualized storage, I can approach your question from SQL optimizing point of view.
      Duplication of such a big table is part of a bad plan, I'm sure. My opinion is that the big gain will not come from placing the data on shared or dedicated disks, but rather eliminating that huge duplication and do something more effective.
      I should see the query and table info to give a correct answer. If you contact me personally on LinkedIn I gladly help you with your questions.

      regards, Ákos

      Delete