Hi, I'm trying to solve big performance issues with PostgreSQL + bacula while inserting very big sets of records. I'm sorry, this email will be a long one, as I've already spent quite a lot of time on the issue, I don't want to waste your time speculating on things I may already have done, and the problem is (or seems to me) a bit complex. The other problem is that I don't have the explain plans to provide with the email right now. I'll try to use this as a way to push 8.4 in this setup, to dump all these plans with autoexplain (queries are on temporary tables, so a bit tricky to get). Let me first explain or remind how this works. Bacula is a backup solution and is trying to insert its metadatas at the end of backups (file name, directory name, size, etc ...) For what we are interested in, there are 3 tables : - file - filename - path file is the one containing most records. It's the real metadata. filename and path just contain an id and the real file or directory name (to save some space with redundant names). Before explaining the issue, just some information about sizing here : file is 1.1 billion records for 280GB (with indexes). Column | Type | Modifiers ------------+---------+------------------------------------------------------- fileid | bigint | not null default nextval('file_fileid_seq'::regclass) fileindex | integer | not null default 0 jobid | integer | not null pathid | integer | not null filenameid | integer | not null markid | integer | not null default 0 lstat | text | not null md5 | text | not null Indexes: "file_pkey" UNIQUE, btree (fileid) "file_fp_idx" btree (filenameid, pathid) "file_jpfid_idx" btree (jobid, pathid, filenameid) path is 17 million for 6 GB Column | Type | Modifiers --------+---------+------------------------------------------------------- pathid | integer | not null default nextval('path_pathid_seq'::regclass) path | text | not null Indexes: "path_pkey" PRIMARY KEY, btree (pathid) "path_name_idx" UNIQUE, btree (path) filename is 80 million for 13GB Column | Type | Modifiers ------------+---------+--------------------------------------------------------------- filenameid | integer | not null default nextval('filename_filenameid_seq'::regclass) name | text | not null Indexes: "filename_pkey" PRIMARY KEY, btree (filenameid) "filename_name_idx" UNIQUE, btree (name) There are several queries for each job despooling : First we fill a temp table with the raw data (filename, pathname, metadata), using COPY (no problem here) Then we insert missing filenames in file, and missing pathnames in path, with this query (and the same for file) : INSERT INTO Path (Path) SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path) These do nested loops and work very well (after a sort on batch to get rid from duplicates). They work reasonably fast (what one would expect when looping on millions of records... they do their job in a few minutes). The problem occurs with the final query, which inserts data in file, joining the temp table to both file and filename INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5) SELECT batch.FileIndex, batch.JobId, Path.PathId, Filename.FilenameId, batch.LStat, batch.MD5 FROM batch JOIN Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name = Filename.Name) This one has two split personnalities, depending on how many records are in batch. For small batch tables, it does nested loops. For big batch tables (more than around one million initially) it decides to hash join path (still ok, it's reasonably small) and then filename to batch before starting. And that's when the problems begin The behaviour seems logicial to me, it should go to hash join when batch gets bigger, but it seems to be much too early here, considering the size of filename. First of all, performance remains much better on nested loops, except for extremely big batches (i'd say over 30 million, extrapolating from the times I'm seeing with 10 millions records), so if I disable hash/merge joins, I get my performance back on these queries (they execute in around the same time as the searches in path and filename above). So I found a way to make most of my queries do nested loops (I'll come back to this later) Second, If there is more than one of these big sorts, performance degrades drastically (we had 2 of them this weekend, they both took 24 hours to complete). This is probably due to our quite bad disk setup (we didn't have a big budget for this). There was no swapping of linux So all of this makes me think there is a cost evaluation problem in this setup : with the default values, postgresql seems to underestimate the cost of sorting here (the row estimates were good, no problem with that). PostgreSQL seems to think that at around 1 million records in file it should go with a hash join on filename and path, so we go on hashing the 17 million records of path, the 80 millions of filename, then joining and inserting into file (we're talking about sorting around 15 GB for each of these despools in parallel). Temporarily I moved the problem at a bit higher sizes of batch by changing random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an apprentice sorcerer with this, as I told postgreSQL that fetching rows from disk are much cheaper than they are. These values are, I think, completely abnormal. Doing this, I got the change of plan at around 8 million. And had 2 of them at 9 millions at the same time this weekend, and both of the took 24 hours, while the nested loops before the join (for inserts in path and filename) did their work in minutes... So, finally, to my questions : - Is it normal that PostgreSQL is this off base on these queries (sorry I don't have the plans, if they are required I'll do my best to get some, but they really are the two obvious plans for this kind of query). What could make it choose the hash join for too small batch tables ? - Is changing the 2 costs the way to go ? - Is there a way to tell postgreSQL that it's more costly to sort than it thinks ? (instead of telling it that fetching data from disk doesn't cost anything). Here are the other non-default values from my configuration : shared_buffers = 2GB work_mem = 64MB maintenance_work_mem = 256MB max_fsm_pages = 15000000 # There are quite big deletes with bacula ... effective_cache_size = 800MB default_statistics_target = 1000 PostgreSQL is 8.3.5 on Debian Lenny I'm sorry for this very long email, I tried to be as precise as I could, but don't hesitate to ask me more. Thanks for helping. Marc Cousin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance