Re: Reducing memory usage of insert into select operations?

"Francisco Reyes" <lists@xxxxxxxxxxxxxxxx> · Fri, 18 Jul 2008 17:46:18 -0400

On 4:55 pm 07/18/08 Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> The only thing I can think of is that you had a huge number of rows
> with all the same hash value, so that there wasn't any way to split
> the batch into smaller sections.  What are the join keys exactly in
> this query, and what can you tell us about their data distributions?

I can't put actual table or column names so I am putting the actual select
and explain, with all names changed..

insert into customer_transactions
 (record_id, date, type, amount, ids, groupid)
select
  ca.record_id, coh.date, coh.type, coh.amount, coh.ids, ids.groupid
from
customer_original_historical coh,
cards ca,
customer_ids ids
where
ca.natural_key = coh.natural_key
and ids.ids = coh.ids
and coh.yearmo > '200703';

Hash Join  (cost=712213.57..27293913.33 rows=234402352 width=24)
   Hash Cond: (coh.id = ids.id)
   ->  Hash Join  (cost=551387.26..18799378.16 rows=234402352
width=22)
         Hash Cond: (coh.user_id = ca.user_id)
         ->  Seq Scan on customer_original_historical coh
              (cost=0.00..6702501.40 rows=234402352 width=47)
               Filter: (yearmo > '200703'::bpchar)
         ->  Hash  (cost=268355.67..268355.67 rows=14637567 width=32)
               ->  Seq Scan on cards ca
               (cost=0.00..268355.67 rows=14637567 width=32)
   ->  Hash  (cost=77883.25..77883.25 rows=5055525 width=6)
         ->  Seq Scan on customer_ids ids
             (cost=0.00..77883.25 rows=5055525 width=6)

There was a single table, customer_original_historical, which was using a
natural key with a text field.

Most queries used the customer_original_historical by itself or joined
against a single other table which we shoudl call "area".

The new schema I am testing is to split the one single file into 12 files
per month.

In addition I replaced the natural keys with a synthetic integer key.
I also replaced the "area" table with a customer_ids table which only has
two columns: synthetic key for historical and a region.

In order to have 12 tables per month I grouped all the regions into 12
groups. Queries are usually within a single region so what I am trying to
benchmark is if dividing 24 months of data into 24 sets of 12 regions will
perform better than a single large file.

The distribution of the joins is:
There are about 1000,000 unique natural keys. Each natural key has in
average 15 rows per month.
ids are regions where the natural_keys are. Figure 10s of thousands of
natural_keys to an id.

Is that along the lines of what you were looking for?