Re: Very big insert/join performance problem (bacula)

SystemManagement <rdsys@xxxxxxxxxxxxxx> · Mon, 13 Jul 2009 16:37:06 +0200

Hi,
just a remark, as the number of entries seems to be very high:
Did you ever activate bacula's program dbcheck Option 16?

Regards

Reiner

Marc Cousin schrieb:
Hi,

I'm trying to solve big performance issues with PostgreSQL + bacula while 
inserting very big sets of records.

I'm sorry, this email will be a long one, as I've already spent quite a lot of 
time on the issue, I don't want to waste your time speculating on things I 
may already have done, and the problem is (or seems to me) a bit complex. The 
other problem is that I don't have the explain plans to provide with the 
email right now. I'll try to use this as a way to push 8.4 in this setup, to 
dump all these plans with autoexplain (queries are on temporary tables, so a 
bit tricky to get).

Let me first explain or remind how this works. Bacula is a backup solution and 
is trying to insert its metadatas at the end of backups (file name, directory 
name, size, etc ...)
For what we are interested in, there are 3 tables :
- file
- filename
- path

file is the one containing most records. It's the real metadata. filename and 
path just contain an id and the real file or directory name (to save some 
space with redundant names).

Before explaining the issue, just some information about sizing here :

file is 1.1 billion records for 280GB (with indexes).

   Column   |  Type   |                       Modifiers
------------+---------+-------------------------------------------------------
 fileid     | bigint  | not null default nextval('file_fileid_seq'::regclass)
 fileindex  | integer | not null default 0
 jobid      | integer | not null
 pathid     | integer | not null
 filenameid | integer | not null
 markid     | integer | not null default 0
 lstat      | text    | not null
 md5        | text    | not null
Indexes:
    "file_pkey" UNIQUE, btree (fileid)
    "file_fp_idx" btree (filenameid, pathid)
    "file_jpfid_idx" btree (jobid, pathid, filenameid)

path is 17 million for 6 GB

 Column |  Type   |                       Modifiers
--------+---------+-------------------------------------------------------
 pathid | integer | not null default nextval('path_pathid_seq'::regclass)
 path   | text    | not null
Indexes:
    "path_pkey" PRIMARY KEY, btree (pathid)
    "path_name_idx" UNIQUE, btree (path)

filename is 80 million for 13GB

   Column   |  Type   |                           Modifiers
------------+---------+---------------------------------------------------------------
 filenameid | integer | not null default 
nextval('filename_filenameid_seq'::regclass)
 name       | text    | not null
Indexes:
    "filename_pkey" PRIMARY KEY, btree (filenameid)
    "filename_name_idx" UNIQUE, btree (name)

There are several queries for each job despooling :

First we fill a temp table with the raw data (filename, pathname, metadata), 
using COPY (no problem here)

Then we insert missing filenames in file, and missing pathnames in path,
with this query (and the same for file) :

INSERT INTO Path (Path) 
  SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a 
     WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path)

These do nested loops and work very well (after a sort on batch to get rid 
from duplicates). They work reasonably fast (what one would expect when 
looping on millions of records... they do their job in a few minutes).

The problem occurs with the final query, which inserts data in file, joining 
the temp table to both file and filename

INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5)
  SELECT batch.FileIndex,
              batch.JobId,
              Path.PathId,
              Filename.FilenameId,
              batch.LStat,
              batch.MD5 
  FROM batch 
  JOIN Path ON (batch.Path = Path.Path) 
  JOIN Filename ON (batch.Name = Filename.Name)

This one has two split personnalities, depending on how many records are in 
batch.
For small batch tables, it does nested loops.
For big batch tables (more than around one million initially) it decides to 
hash join path (still ok, it's reasonably small) and then filename to batch 
before starting. And that's when the problems begin The behaviour seems 
logicial to me, it should go to hash join when batch gets bigger, but it 
seems to be much too early here, considering the size of filename.

First of all, performance remains much better on nested loops, except for 
extremely big batches (i'd say over 30 million, extrapolating from the times 
I'm seeing with 10 millions records), so if I disable hash/merge joins, I get 
my performance back on these queries (they execute in around the same time as 
the searches in path and filename above). So I found a way to make most of my 
queries do nested loops (I'll come back to this later)

Second, If there is more than one of these big sorts, performance degrades 
drastically (we had 2 of them this weekend, they both took 24 hours to 
complete). This is probably due to our quite bad disk setup (we didn't have a 
big budget for this). There was no swapping of linux

So all of this makes me think there is a cost evaluation problem in this 
setup : with the default values, postgresql seems to underestimate the cost 
of sorting here (the row estimates were good, no problem with that).
PostgreSQL seems to think that at around 1 million records in file it should 
go with a hash join on filename and path, so we go on hashing the 17 million 
records of path, the 80 millions of filename, then joining and inserting into 
file (we're talking about sorting around 15 GB for each of these despools in 
parallel).

Temporarily I moved the problem at a bit higher sizes of batch by changing 
random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an 
apprentice sorcerer with this, as I told postgreSQL that fetching rows from 
disk are much cheaper than they are. These values are, I think, completely 
abnormal. Doing this, I got the change of plan at around 8 million. And had 2 
of them at 9 millions at the same time this weekend, and both of the took 24 
hours, while the nested loops before the join (for inserts in path and 
filename) did their work in minutes...

So, finally, to my questions :
- Is it normal that PostgreSQL is this off base on these queries (sorry I 
don't have the plans, if they are required I'll do my best to get some, but 
they really are the two obvious plans for this kind of query). What could 
make it choose the hash join for too small batch tables ?
- Is changing the 2 costs the way to go ?
- Is there a way to tell postgreSQL that it's more costly to sort than it 
thinks ? (instead of telling it that fetching data from disk doesn't cost 
anything).

Here are the other non-default values from my configuration :

shared_buffers = 2GB
work_mem = 64MB
maintenance_work_mem = 256MB
max_fsm_pages = 15000000 # There are quite big deletes with bacula ...
effective_cache_size = 800MB
default_statistics_target = 1000

PostgreSQL is 8.3.5 on Debian Lenny

I'm sorry for this very long email, I tried to be as precise as I could, but 
don't hesitate to ask me more.

Thanks for helping.

Marc Cousin

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance