Hi Marc, I don't have really extensive comments, but I found two small things... On Monday 13 July 2009 15:40:18 Marc Cousin wrote: > I'm trying to solve big performance issues with PostgreSQL + bacula while > inserting very big sets of records. > > I'm sorry, this email will be a long one, as I've already spent quite a lot > of time on the issue, I don't want to waste your time speculating on things > I may already have done, and the problem is (or seems to me) a bit complex. > The other problem is that I don't have the explain plans to provide with > the email right now. I'll try to use this as a way to push 8.4 in this > setup, to dump all these plans with autoexplain (queries are on temporary > tables, so a bit tricky to get). > > Let me first explain or remind how this works. Bacula is a backup solution > and is trying to insert its metadatas at the end of backups (file name, > directory name, size, etc ...) > For what we are interested in, there are 3 tables : > - file > - filename > - path > > file is the one containing most records. It's the real metadata. filename > and path just contain an id and the real file or directory name (to save > some space with redundant names). > > Before explaining the issue, just some information about sizing here : > > file is 1.1 billion records for 280GB (with indexes). > > Column | Type | Modifiers > ------------+---------+---------------------------------------------------- >--- fileid | bigint | not null default > nextval('file_fileid_seq'::regclass) fileindex | integer | not null > default 0 > jobid | integer | not null > pathid | integer | not null > filenameid | integer | not null > markid | integer | not null default 0 > lstat | text | not null > md5 | text | not null > Indexes: > "file_pkey" UNIQUE, btree (fileid) > "file_fp_idx" btree (filenameid, pathid) > "file_jpfid_idx" btree (jobid, pathid, filenameid) > > > path is 17 million for 6 GB > > Column | Type | Modifiers > --------+---------+------------------------------------------------------- > pathid | integer | not null default nextval('path_pathid_seq'::regclass) > path | text | not null > Indexes: > "path_pkey" PRIMARY KEY, btree (pathid) > "path_name_idx" UNIQUE, btree (path) > > filename is 80 million for 13GB > > Column | Type | Modifiers > ------------+---------+---------------------------------------------------- >----------- filenameid | integer | not null default > nextval('filename_filenameid_seq'::regclass) > name | text | not null > Indexes: > "filename_pkey" PRIMARY KEY, btree (filenameid) > "filename_name_idx" UNIQUE, btree (name) > > > There are several queries for each job despooling : > > First we fill a temp table with the raw data (filename, pathname, > metadata), using COPY (no problem here) > > Then we insert missing filenames in file, and missing pathnames in path, > with this query (and the same for file) : > > INSERT INTO Path (Path) > SELECT a.Path FROM (SELECT DISTINCT Path FROM batch) AS a > WHERE NOT EXISTS (SELECT Path FROM Path WHERE Path = a.Path) > > These do nested loops and work very well (after a sort on batch to get rid > from duplicates). They work reasonably fast (what one would expect when > looping on millions of records... they do their job in a few minutes). While this is not your questions, I still noticed you seem to be on 8.3 - it might be a bit faster to use GROUP BY instead of DISTINCT. > The problem occurs with the final query, which inserts data in file, > joining the temp table to both file and filename > > INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5) > SELECT batch.FileIndex, > batch.JobId, > Path.PathId, > Filename.FilenameId, > batch.LStat, > batch.MD5 > FROM batch > JOIN Path ON (batch.Path = Path.Path) > JOIN Filename ON (batch.Name = Filename.Name) > > This one has two split personnalities, depending on how many records are in > batch. > For small batch tables, it does nested loops. > For big batch tables (more than around one million initially) it decides to > hash join path (still ok, it's reasonably small) and then filename to batch > before starting. And that's when the problems begin The behaviour seems > logicial to me, it should go to hash join when batch gets bigger, but it > seems to be much too early here, considering the size of filename. > > First of all, performance remains much better on nested loops, except for > extremely big batches (i'd say over 30 million, extrapolating from the > times I'm seeing with 10 millions records), so if I disable hash/merge > joins, I get my performance back on these queries (they execute in around > the same time as the searches in path and filename above). So I found a way > to make most of my queries do nested loops (I'll come back to this later) > > Second, If there is more than one of these big sorts, performance degrades > drastically (we had 2 of them this weekend, they both took 24 hours to > complete). This is probably due to our quite bad disk setup (we didn't have > a big budget for this). There was no swapping of linux > > > So all of this makes me think there is a cost evaluation problem in this > setup : with the default values, postgresql seems to underestimate the cost > of sorting here (the row estimates were good, no problem with that). > PostgreSQL seems to think that at around 1 million records in file it > should go with a hash join on filename and path, so we go on hashing the 17 > million records of path, the 80 millions of filename, then joining and > inserting into file (we're talking about sorting around 15 GB for each of > these despools in parallel). > > Temporarily I moved the problem at a bit higher sizes of batch by changing > random_page_cost to 0.02 and seq_page_cost to 0.01, but I feel like an > apprentice sorcerer with this, as I told postgreSQL that fetching rows from > disk are much cheaper than they are. These values are, I think, completely > abnormal. Doing this, I got the change of plan at around 8 million. And had > 2 of them at 9 millions at the same time this weekend, and both of the took > 24 hours, while the nested loops before the join (for inserts in path and > filename) did their work in minutes... > > So, finally, to my questions : > - Is it normal that PostgreSQL is this off base on these queries (sorry I > don't have the plans, if they are required I'll do my best to get some, but > they really are the two obvious plans for this kind of query). What could > make it choose the hash join for too small batch tables ? > - Is changing the 2 costs the way to go ? > - Is there a way to tell postgreSQL that it's more costly to sort than it > thinks ? (instead of telling it that fetching data from disk doesn't cost > anything). > Here are the other non-default values from my configuration : > > shared_buffers = 2GB > work_mem = 64MB > maintenance_work_mem = 256MB > max_fsm_pages = 15000000 # There are quite big deletes with bacula ... > effective_cache_size = 800MB > default_statistics_target = 1000 Your effective_cache_size is really small for the system you seem to have - its the size of IO caching your os is doing and uses no resources itself. And 800MB of that on a system with that amount of data seems a bit unlikely ;-) Using `free` you can see the amount of io caching your OS is doing atm. in the 'cached' column. That possibly might tip some plans in a direction you prefer. What kind of machine are you running this on? Andres -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance