Would it be possible to just update the summary table, instead of recreating
it from scratch every night?
Hrm, I believe it's probably less work for the computer to do if
it's rebuilt. Any number of rows may be changed during an update, not
including additions, so I'd have to pull out what's changed and sync it
with what's in the summary table already. It'll be a lot more selects and
program-side computation to save the big copy; it might work out, but I'd
say this would be my last ditch thing. :)
Building these indexes takes forever! It's a long grind through inserts
and then building the indexes takes a hefty amount of time too. (about 9
hours). Now, the application is likely part at fault, and we're working to
make it more efficient, but it has nothing to do with the index building
time. I'm wondering what we can do to make this better if anything; would
it be better to leave the indexes on? It doesn't seem to be. Would it be
better to use INSERTs instead of copies? Doesn't seem to be.
Would it help if you created multiple indexes simultaneously? You have enough
CPU to do it. Is the index creation CPU or I/O bound? 9 million rows should
fit in 16 GB of memory, right?
This is a very very interesting idea. It looks like we're
probably not fully utilizing the machine for the index build, and this
could be the ticket for us. I'm going to go ahead and set up a test for
this and we'll see how it goes.
Can you describe the load process in more detail? What's it doing with the 6
gigs?
There's two halves to the load process; the loader and the
summarizer. The loader is the part that takes 6 gigs; the summarizer only
takes a few hundred MEG.
Basically we have these COBOL files that vary in size but
are usually in the hundred's of MEG realm. These files contain new data
OR updates to existing data. We load this data from the COBOL files in
chunks, so that's not a place where we're burning a lot of memory.
The first thing we do is cache the list of COBOL ID codes that are
already in the DB; the COBOL ID codes are really long numeric strings, so
we use a sequenced integer primary key. The cache translates COBOL IDs to
primary keys, and this takes most of our memory nowadays. Our cache is
fast, but it's kind of a memory hog. We're working on trimming that down,
but it's definitely faster than making a query for each COBOL ID.
The load is relatively fast and is considered "acceptable", and
has been relatively constant in speed. It's the summarizer that's brutal.
The summarizer produces 3 main summary tables and a few
secondaries that don't take much time to make. Two of them are smallish
and not that big a deal, and the last one is the biggie that's 9 mil rows
and growing. To produce the 9 mil row table, we query out the data in
groups, do our processing, and save that data to a series of text files
that are in blocks of 10,000 rows as I recall. We then copy each file
into the DB (there were some issues with copying in an entire 9 mil row
file in the past, which is why we don't use just one file -- those issues
have been fixed, but we didn't undo the change).
What's your maintenance_work_mem setting? It can make a big difference in
sorting the data for indexes.
6 gigs currently. :)
If you could post the schema including the indexes, people might have more
ideas...
I'll have to ask first, but I'll see if I can :)
Talk to you later, and thanks for the info!
Steve