Robert, * Robert Schnabel (schnabelr@xxxxxxxxxxxx) wrote: > Once the bulk data is inserted into the tables I generally > do some updates on columns to set values which characterize the > data. Please tell me you're not running actual full-table UPDATE statements... You would be *much* better off either: a) munging the data on the way in (if possible/reasonable) b) loading the data into temp tables first, and then using INSERT statements to move the data into the 'final' tables WITH the new columns/info you want c) considering if you can normalize the data into multiple tables and/or to cut down the columns to only what you need as you go through the above, too A full-table UPDATE means you're basically making the table twice as big as it needs to be, and it'll never get smaller.. > These columns then get indexed. Basically once the initial > manipulation is done the table is then static and what I'm looking > for is query speed. Sadly, this is the same type of DW needs that I've got (though with telecomm data and phone calls, not genetic stuffs ;), and PG ends up being limited by the fact that it can only use one core/thread to go through the data with. You might consider investing some time trying to figure out how to parallelize your queries. My approach to this has been to partition the data (probably something you're doing already) into multiple tables and then have shell/perl scripts which will run a given query against all of the tables, dumping the results of that aggregation/analysis into other tables, and then having a final 'merge' query. > The data is sorted by snp_number, sample_id. So if I want the data > for a given sample_id it would be a block of ~58k rows. The size of > the table depends on how many sample_id's there are. My largest has > ~30k sample_id by 58k snp_number per sample. The other big table > (with children) is "mutations" and is set up similarly so that I can > access individual tables (samples) based on constraints. Each of > these children have between 5-60M records. Understand that indexes are only going to be used/useful, typically, if the amount of records being returned is small relative to the size of the table (eg: 5%). > This is all direct attach storage via SAS2 so I'm guessing it's > probably limited to the single port link between the controller and > the expander. Again, geneticist here not computer scientist. ;-) That link certainly isn't going to help things.. You might consider how or if you can improve that. > All of the data could be reloaded. Basically, once I get the data > into the database and I'm done manipulating it I create a backup > copy/dump which then gets stored at a couple different locations. You might consider turning fsync off while you're doing these massive data loads.. and make sure that you issue your 'CREATE TABLE' and your 'COPY' statements in the same transaction, and again, I suggest loading into temporary (CREATE TEMPORARY TABLE) tables first, then doing the CREATE TABLE/INSERT statement for the 'real' table. Make sure that you create *both* your constraints *and* your indexes *after* the table is populated. If you turn fsync off, make sure you turn it back on. :) > My goal is to 1) have a fairly robust system so that I don't have to > spend my time rebuilding things and 2) be able to query the data > quickly. Most of what I do are ad hoc queries. I have an idea... > "how many X have Y in this set of Z samples" and write the query to > get the answer. I can wait a couple minutes to get an answer but > waiting an hour is becoming tiresome. Have you done any analysis to see what the bottleneck actually is? When you run top, is your PG process constantly in 'D' state, or is it in 'R' state, or what? Might help figure some of that out. Note that parallelizing the query will help regardless of if it's disk bound or CPU bound, when you're running on the kind of hardware you're talking about (lots of spindles, multiple CPUs, etc). Thanks, Stephen
Attachment:
signature.asc
Description: Digital signature