To be sure about the performance of index scan, try forcing the planner to use it instead of seq scan. A way might be to force the planner to use index scan on your table by using a dummy where clause. Try using a condition in your where clause which holds true for all rows. --Imad www.EnterpriseDB.com On 1/30/07, Richard Huxton <dev@xxxxxxxxxxxx> wrote:
Igor Lobanov wrote: > > > Richard Huxton wrote: >>> I have rather large table with about 5 millions of rows and a dozen >>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I >>> need to query distinct pairs of ('a';'b') from this table. > > >> What version of PostgreSQL is it? > > 8.1.4 Current release is 8.1.6 - probably worth upgrading when you've got time. It should be a simple matter of replacing the binaries but do check the release notes. >> How many distinct values are you getting back from your 5 million >> rows? If there are too many, an index isn't going to help. > > No more than 10,000. OK. Should be possible to do something then. >> Can you share the EXPLAIN ANALYSE output? You might want to try >> increasing work_mem for this one query to speed any sorting. > > Real table and colum names are obfuscated because of NDA, sorry. Fair enough. > explain analyze select distinct a, b from tbl > > EXPLAIN ANALYZE output is: > > Unique (cost=500327.32..525646.88 rows=1848 width=6) (actual > time=52719.868..56126.356 rows=5390 loops=1) > -> Sort (cost=500327.32..508767.17 rows=3375941 width=6) (actual > time=52719.865..54919.989 rows=3378864 loops=1) > Sort Key: a, b > -> Seq Scan on tbl (cost=0.00..101216.41 rows=3375941 > width=6) (actual time=16.643..20652.610 rows=3378864 loops=1) > Total runtime: 57307.394 ms Hmm - am I right in thinking (a,b) are the only two columns on this table? That means you'll have a lot of rows per page and an index scan could end up fetching lots of pages to check the rows are visible. Still - I'd expect it to be better than a seq scan. The first thing to try is to put the index back and run "SET enable_seqscan=off" before the explain analyse. That should force it to use the index. Then we'll see what costs it's expecting. >> How often is the table updated? Clustering might buy you some >> improvements (but not a huge amount I suspect). > > It is updated once per 3-5 seconds. OK - forget clustering then. > And one more thing. I don't know if it helps, but column 'a' can have > value from a limited set: 0, 1 or 2. Column 'b' is also an integer > (foreign key, actually). Hmm - might be worth trying distinct on (b,a) with an index that way around - that would give you greater selectivity at the top-level of the btree. Can you repeat the EXPLAIN ANALYSE with that too please. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org