On Fri, Nov 27, 2009 at 4:47 PM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote: >>>> If not, you might want to look at some way of pre-marking the >>>> non-duplicate rows so that you don't have to recompute that each time. >>> >>> What are the options re pre-marking? >> >> Well, what I usually do is - if I'm going to do the same >> distinct-ification frequently, I add an extra column (say, a boolean) >> and set it to true for all and only those rows which will pass the >> distinct-ification filter. Then I can just say WHERE <that column >> name>. > > Yes, I see. The problem with is premarking is that the selection is somewhat > dynamic, in the sense that this depends on the idlink table, which depends > on patient data, which can change. Yeah. For things like this I find you have to think hard about how to organize your schema so that you can optimize the queries you care about. There are no "just do this and it works" solutions to performance problems of this type. Still, many of them are solvable by making the right decisions elsewhere. Sometimes you can use triggers to recompute your premarks when the data in the other table changes. Another strategy is to keep a cache of precomputed results somewhere. When the underlying data changes, you use triggers to invalidate anything in the cache that might now be wrong, and set things up so that it will be recomputed when next it is used. But in either case you have to figure out the right place to do the computation so that it gains you more than it saves you, and adjusting your schema is often necessary. >>>> Then you might be able to use the underlying table directly in the next >>>> CTE, which will usually permit better optimization, more use of indices, >>>> etc. It seems pretty unfortunate that dedup_patient_anno joins against >>>> geno >>>> and then patient_geno does what appears to be the same join again. Is >>>> there >>>> some way to eliminate that? If so it will probably help. >>> >>> You don't say whether you are looking at the PED or TPED query, so I'll >>> assume PED. They are similar anyway. >>> >>> I see your point re the joins. You mean >>> >>> anno INNER JOIN geno >>> >>> followed by >>> >>> geno INNER JOIN dedup_patient_anno >>> >>> ? I think the point of the first join is to reduce the anno table based >>> on >>> information from the geno table. The result is basically a subset of the >>> anno table with some potential duplication removed, which is then >>> re-joined >>> to the geno table. I agree this seems a bit suboptimal, and there might >>> be a >>> better way to do this. >> >> Yeah, I didn't think about it in detail, but it looks like it should >> be possible. Eliminating joins can sometimes have *dramatic* effects >> on query performance, and it never hurts. > > Failing all else, couldn't I smoosh together the two queries and do a triple > join? For reference, the two CTEs in question, from the PED query, are as > follows. > > dedup_patient_anno AS > ( SELECT * > FROM > (SELECT *, > row_number() OVER(PARTITION BY anno.rsid ORDER BY > anno.id) > FROM anno > INNER JOIN geno > ON anno.id = geno.anno_id > WHERE idlink_id = > (SELECT MIN(id) > FROM idlink > ) > ) AS s > WHERE row_number = '1' > ), > patient_geno AS > ( SELECT geno.idlink_id AS idlink_id, > geno.anno_id AS anno_id, > geno.snpval_id AS snpval_id, > allelea_id, alleleb_id > FROM geno > INNER JOIN dedup_patient_anno > ON geno.anno_id = dedup_patient_anno.id > ), If that will give the same results, which I'm not immediately certain about, then I highly recommend it. In general I would recommend only using CTEs to express concepts that can't sensibly be expressed in other ways, not to beautify your queries. Keep in mind that joins can be reordered and/or executed using different methods but most other operations can't be, so trying to get your joins together in one place is usually a good strategy, in my experience. And of course if that lets you reduce the total number of joins, that's even better. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance