Re: Performace Optimization for Dummies

Matthew Nuzum <matthew.nuzum@xxxxxxxxxxxxx> · Thu, 28 Sep 2006 13:47:26 -0500

Carlo Stonebanks wrote:
>> are you using the 'copy' interface?
> 
> Straightforward inserts - the import data has to transformed, normalised and 
> de-duped by the import program. I imagine the copy interface is for more 
> straightforward data importing. These are - buy necessity - single row 
> inserts.
> 

I know this is an answer to a question you didn't ask, but here it is. I
was once doing stuff where I processed log files and had to do many
lookups to normalize the data before insertion.

I started out doing everything in SQL and using postgresql tables and it
took a little over 24 hours to process 24 hours worth of data. Like you,
it was single process, many lookups.

I found a better way. I rewrote it (once in c#, again in python) and
used hashtables/dictionaries instead of tables for the lookup data. For
example, I'd start by loading the data into hash tables (yes, this took
a *lot* of ram) then for each row I did something like:
 1. is it in the hash table?
   1. If not, insert it into the db
   1. Insert it into the hashtable
 2. Get the lookup field out of the hash table
 3. Output normalized data

This allow me to create text files containing the data in COPY format
which can then be inserted into the database at dramatically increased
speeds.

My first version in C# (mono) cut the time down to 6 hours for 24 hours
worth of data. I tweaked the algorithms and rewrote it in Python and got
it down to 45 min. (Python can't take all the credit for the performance
boost, I used an improved technique that could have been done in C# as
well) This time included the time needed to do the copy and update the
indexes.

I created a version that also used gdb databases instead of hash tables.
It increased the time from 45 min to a little over an hour but decreased
the memory usage to something like 45MB (vs dozens or hundreds of MB per
hashtable)
-- 
Matthew Nuzum
newz2000 on freenode