On 4:55 pm 07/18/08 Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > The only thing I can think of is that you had a huge number of rows > with all the same hash value, so that there wasn't any way to split > the batch into smaller sections. What are the join keys exactly in > this query, and what can you tell us about their data distributions? I can't put actual table or column names so I am putting the actual select and explain, with all names changed.. insert into customer_transactions (record_id, date, type, amount, ids, groupid) select ca.record_id, coh.date, coh.type, coh.amount, coh.ids, ids.groupid from customer_original_historical coh, cards ca, customer_ids ids where ca.natural_key = coh.natural_key and ids.ids = coh.ids and coh.yearmo > '200703'; Hash Join (cost=712213.57..27293913.33 rows=234402352 width=24) Hash Cond: (coh.id = ids.id) -> Hash Join (cost=551387.26..18799378.16 rows=234402352 width=22) Hash Cond: (coh.user_id = ca.user_id) -> Seq Scan on customer_original_historical coh (cost=0.00..6702501.40 rows=234402352 width=47) Filter: (yearmo > '200703'::bpchar) -> Hash (cost=268355.67..268355.67 rows=14637567 width=32) -> Seq Scan on cards ca (cost=0.00..268355.67 rows=14637567 width=32) -> Hash (cost=77883.25..77883.25 rows=5055525 width=6) -> Seq Scan on customer_ids ids (cost=0.00..77883.25 rows=5055525 width=6) There was a single table, customer_original_historical, which was using a natural key with a text field. Most queries used the customer_original_historical by itself or joined against a single other table which we shoudl call "area". The new schema I am testing is to split the one single file into 12 files per month. In addition I replaced the natural keys with a synthetic integer key. I also replaced the "area" table with a customer_ids table which only has two columns: synthetic key for historical and a region. In order to have 12 tables per month I grouped all the regions into 12 groups. Queries are usually within a single region so what I am trying to benchmark is if dividing 24 months of data into 24 sets of 12 regions will perform better than a single large file. The distribution of the joins is: There are about 1000,000 unique natural keys. Each natural key has in average 15 rows per month. ids are regions where the natural_keys are. Figure 10s of thousands of natural_keys to an id. Is that along the lines of what you were looking for?