Hi Kevin, I'm not deeply knowledgeable about PostgreSQL, but my guess is that 2 things are doing you in: (1) scanning all those nulls during SELECTs (even though PostgreSQL is efficient at nulls, there are still tens or hundreds of billions of them) (2) All those single-field indexes, and aggregations between them Both (1) and (2) make it very difficult for PG to cache effectively. You have the advantage that your data is read-only at query time, so I'd suggest this (fairly lengthy) experiment: Go back to a split-table format, where you have: - one rectangular table containing those fields which are always, or nearly always, filled (the "Full" table.) You talk about there being "some knowns". Does that mean that you know some combinations of two or more fields will be selected on very frequently? Optimise any of these combinations in the "full" table fields using multi-field indexes across these combinations (e.g. (state,age).) Put full single-field indexes on all fields in the "Full" table. - one or more tables (one initially, see below) containing the fields that are mostly null (the "Sparse" table.) Store the sparse data in a "depivoted" form. Explanation follows: "Sparse" Table --- Instead of (the original "sparse" table): Id,field1,field2,field3,field4,...field500 Rec1,... Rec2,... ... RecM,... store the sparse data as: Id,fieldname,value Rec1,field1name,field1value(Rec1) Rec1,field2name,field1value(Rec2) ... Rec1,field500name,field500value(Rec1) Rec2,field1name,field1value(Rec2) ... RecM,field500name,field500value(RecM) I.e. one row per cell in the "sparse" table, and an Id to link to the "Full" table. For null values, don't store a depivoted record at all. I'd estimate this would give you a few billion rows at most in this table. (If anyone has a better name for this process than "depivoting", please pass it along!) In the depivoted table, put single-field indexes and multi-field indexes on every combination of Id, fieldname, value in the depivoted data: (Id) (fieldname) (value) (Id,fieldname) (Id,value) (fieldname,value) You might eventually have to keep a different depivoted table for each type of field value (boolean, integer, character varying etc) but you could do a dirty experiment by converting all values to CHARACTER VARYING and having a look at query performance using the new structure before doing further work. Rationale --- "Depivoting" makes sense to me because your data is so sparse. The huge number of nulls may be causing severe aggregation bottlenecks and many cache misses. All those indexes will absolutely kill the query performance of any database in terms of hard disk seeks, data cache limits and aggregation time. I'm not surprise that both MySQL and PostgreSQL struggle. Too many indexes probably gave you a diminishing return on a table this big because 15G RAM on your server is way too small a cache for so many fields and indexes. "Depivoting" eliminates the need to scan the huge number of nulls in the dataset. Even if nulls are very efficiently handled in PostgreSQL, you're talking about hundreds of billions of them, and hundreds of billions of anything is never going to be quick. Better not to process them at all. "De-pivoting" will (of course) eventually mean rewriting all your querying code, and you'll probably need to "rebuild" the sparse data into a wide table format at some point. But if the result set is small, this should be a small price to pay for better SELECT query performance. If you want to do a quick and dirty experiment, I have an MS Access app that depivots arbitrarily wide tables. I'd be glad to pass it along, although there is a risk it's too small a gun for the job. But if Access manages okay with the depivoted table, it might be worth a try. Based on 500 fields, 250M records, 2% filled it looks like it might depivot your table overnight, or better. You'd finish with about 2.5 billion rows. Best wishes, Nick > -----Original Message----- > From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general- > owner@xxxxxxxxxxxxxx] On Behalf Of Kevin Galligan > Sent: Thursday, 30 October 2008 7:18 AM > To: pgsql-general@xxxxxxxxxxxxxx > Subject: Slow query performance > > I'm approaching the end of my rope here. I have a large database. > 250 million rows (ish). Each row has potentially about 500 pieces of > data, although most of the columns are sparsely populated. > > What I'm trying to do is, essentially, search for sub-sets of that > data based on arbitrary queries of those data columns. the queries > would be relatively simple ("dirbtl is not null and qqrq between 20 > and 40"). After the database is built, it is read only. > > So, I started with maybe 10-15 fields in a main table, as most records > have values for those fields. Then had individual tables for the > other values. The idea is that the percentage of rows with values > drops off significantly after those main tables. That, an each > individual query looks at probably 3 or 4 fields in total. The > performance of those queries was pretty bad. Its got to join large > numbers of values, which didn't really work out well. > > So, went the other direction completely. I rebuilt the database with > a much larger main table. Any values with 5% or greater filled in > rows were added to this table. Maybe 130 columns. Indexes applied to > most of these. Some limited testing with a smaller table seemed to > indicate that queries on a single table without a join would work much > faster. > > So, built that huge table. now query time is terrible. Maybe a > minute or more for simple queries. > > I'm running vacuum/analyze right now (which is also taking forever, BTW). > > The box has 15 g of ram. I made the shared_buffers setting to 8 or 9 > gig. My first question, what would be better to bump up to increase > the performance? I thought that was the field to jack up to improve > query time or index caching, but I've read conflicting data. The 15 > ram is available. > > I originally had this in mysql. Also bad performance. I understand > how to optimize that much better, but it just wasn't cutting it. > > Anyway, help with tuning the settings would be greatly appreciated. > Advice on how best to lay this out would also be helpful (I know its > difficult without serious detail). > > Thanks in advance, > -Kevin > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
BEGIN:VCARD VERSION:2.1 N:Mellor;Nick FN:Nick Mellor (nick.mellor.groups@xxxxxxxxx) TITLE:.NET Programmer, Python/Django/Apache Web Programmer TEL;WORK;VOICE:+61 (03) 54234292 TEL;HOME;VOICE:+61 (03) 54234292 TEL;CELL;VOICE:+61 (0458) 192321 ADR;WORK:;;455 Goldfields Road;Kyneton;VIC;3444;Australia LABEL;WORK;ENCODING=QUOTED-PRINTABLE:455 Goldfields Road=0D=0AKyneton, VIC 3444=0D=0AAustralia ADR;HOME:;;455 Goldfields Road;Kyneton;VIC;3444;Australia LABEL;HOME;ENCODING=QUOTED-PRINTABLE:455 Goldfields Road=0D=0AKyneton, VIC 3444=0D=0AAustralia URL;WORK:http://www.back-pain-self-help.com; http://www.behandlerlisten.dk EMAIL;PREF;INTERNET:nick.mellor.groups@xxxxxxxxx REV:20081010T022249Z END:VCARD
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general