Re: Very big insert/join performance problem (bacula)

Andres Freund <andres@xxxxxxxxxxx> · Mon, 13 Jul 2009 17:06:07 +0200

Hi Marc,

I don't have really extensive comments, but I found two small things...

On Monday 13 July 2009 15:40:18 Marc Cousin wrote:
> I'm trying to solve big performance issues with PostgreSQL + bacula while
> inserting very big sets of records.
>
> I'm sorry, this email will be a long one, as I've already spent quite a lot
> of time on the issue, I don't want to waste your time speculating on things
> I may already have done, and the problem is (or seems to me) a bit complex.
> The other problem is that I don't have the explain plans to provide with
> the email right now. I'll try to use this as a way to push 8.4 in this
> setup, to dump all these plans with autoexplain (queries are on temporary
> tables, so a bit tricky to get).
>
> Let me first explain or remind how this works. Bacula is a backup solution
> and is trying to insert its metadatas at the end of backups (file name,
> directory name, size, etc ...)
> For what we are interested in, there are 3 tables :
> - file
> - filename
> - path
>
> file is the one containing most records. It's the real metadata. filename
> and path just contain an id and the real file or directory name (to save
> some space with redundant names).
>
> Before explaining the issue, just some information about sizing here :
>
> file is 1.1 billion records for 280GB (with indexes).
>
>    Column   |  Type   |                       Modifiers
> ------------+---------+----------------------------------------------------
>--- fileid     | bigint  | not null default
> nextval('file_fileid_seq'::regclass) fileindex  | integer | not null
> default 0
>  jobid      | integer | not null
>  pathid     | integer | not null
>  filenameid | integer | not null
>  markid     | integer | not null default 0
>  lstat      | text    | not null
>  md5        | text    | not null
> Indexes:
>     "file_pkey" UNIQUE, btree (fileid)
>     "file_fp_idx" btree (filenameid, pathid)
>     "file_jpfid_idx" btree (jobid, pathid, filenameid)
>
>
> path is 17 million for 6 GB
>
>  Column |  Type   |                       Modifiers
> --------+---------+-------------------------------------------------------
>  pathid | integer | not null default nextval('path_pathid_seq'::regclass)
>  path   | text    | not null
> Indexes:
>     "path_pkey" PRIMARY KEY, btree (pathid)
>     "path_name_idx" UNIQUE, btree (path)
>
> filename is 80 million for 13GB
>
>    Column   |  Type   |                           Modifiers
> ------------+---------+----------------------------------------------------
>----------- filenameid | integer | not null default
> nextval('filename_filenameid_seq'::regclass)
>  name       | text    | not null
> Indexes:
>     "filename_pkey" PRIMARY KEY, btree (filenameid)
>     "filename_name_idx" UNIQUE, btree (name)
>
>
> There are several queries for each job despooling :
>
> First we fill a temp table with the raw data (filename, pathname,
> metadata), using COPY (no problem here)
>
> Then we insert missing filenames in file, and missing pathnames in path,
> with this query (and the same for file) :
>
> INSERT INTO Path (Path)
>   SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a
>      WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path)
>
> These do nested loops and work very well (after a sort on batch to get rid
> from duplicates). They work reasonably fast (what one would expect when
> looping on millions of records... they do their job in a few minutes).
While this is not your questions, I still noticed you seem to be on 8.3 - it 
might be a bit faster to use GROUP BY instead of DISTINCT.

> The problem occurs with the final query, which inserts data in file,
> joining the temp table to both file and filename
>
> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5)
>   SELECT batch.FileIndex,
>               batch.JobId,
>               Path.PathId,
>               Filename.FilenameId,
>               batch.LStat,
>               batch.MD5
>   FROM batch
>   JOIN Path ON (batch.Path = Path.Path)
>   JOIN Filename ON (batch.Name = Filename.Name)
>
> This one has two split personnalities, depending on how many records are in
> batch.
> For small batch tables, it does nested loops.
> For big batch tables (more than around one million initially) it decides to
> hash join path (still ok, it's reasonably small) and then filename to batch
> before starting. And that's when the problems begin The behaviour seems
> logicial to me, it should go to hash join when batch gets bigger, but it
> seems to be much too early here, considering the size of filename.
>
> First of all, performance remains much better on nested loops, except for
> extremely big batches (i'd say over 30 million, extrapolating from the
> times I'm seeing with 10 millions records), so if I disable hash/merge
> joins, I get my performance back on these queries (they execute in around
> the same time as the searches in path and filename above). So I found a way
> to make most of my queries do nested loops (I'll come back to this later)
>
> Second, If there is more than one of these big sorts, performance degrades
> drastically (we had 2 of them this weekend, they both took 24 hours to
> complete). This is probably due to our quite bad disk setup (we didn't have
> a big budget for this). There was no swapping of linux
>
>
> So all of this makes me think there is a cost evaluation problem in this
> setup : with the default values, postgresql seems to underestimate the cost
> of sorting here (the row estimates were good, no problem with that).
> PostgreSQL seems to think that at around 1 million records in file it
> should go with a hash join on filename and path, so we go on hashing the 17
> million records of path, the 80 millions of filename, then joining and
> inserting into file (we're talking about sorting around 15 GB for each of
> these despools in parallel).
>
> Temporarily I moved the problem at a bit higher sizes of batch by changing
> random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an
> apprentice sorcerer with this, as I told postgreSQL that fetching rows from
> disk are much cheaper than they are. These values are, I think, completely
> abnormal. Doing this, I got the change of plan at around 8 million. And had
> 2 of them at 9 millions at the same time this weekend, and both of the took
> 24 hours, while the nested loops before the join (for inserts in path and
> filename) did their work in minutes...
>
> So, finally, to my questions :
> - Is it normal that PostgreSQL is this off base on these queries (sorry I
> don't have the plans, if they are required I'll do my best to get some, but
> they really are the two obvious plans for this kind of query). What could
> make it choose the hash join for too small batch tables ?
> - Is changing the 2 costs the way to go ?
> - Is there a way to tell postgreSQL that it's more costly to sort than it
> thinks ? (instead of telling it that fetching data from disk doesn't cost
> anything).

> Here are the other non-default values from my configuration :
>
> shared_buffers = 2GB
> work_mem = 64MB
> maintenance_work_mem = 256MB
> max_fsm_pages = 15000000 # There are quite big deletes with bacula ...
> effective_cache_size = 800MB
> default_statistics_target = 1000
Your effective_cache_size is really small for the system you seem to have - its 
the size of IO caching your os is doing and uses no resources itself. And 
800MB of that on a system with that amount of data seems a bit unlikely ;-)

Using `free` you can see the amount of io caching your OS is doing atm. in the 
'cached' column.

That possibly might tip some plans in a direction you prefer.

What kind of machine are you running this on?

Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance