Re: Querying distinct values from a large table

Richard Huxton <dev@xxxxxxxxxxxx> · Tue, 30 Jan 2007 10:02:33 +0000

Igor Lobanov wrote:

Richard Huxton wrote:
I have rather large table with about 5 millions of rows and a dozen 
of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I 
need to query distinct pairs of ('a';'b') from this table.
 >
What version of PostgreSQL is it?

8.1.4

Current release is 8.1.6 - probably worth upgrading when you've got 
time. It should be a simple matter of replacing the binaries but do 
check the release notes.

How many distinct values are you getting back from your 5 million 
rows? If there are too many, an index isn't going to help.

No more than 10,000.

OK. Should be possible to do something then.

Can you share the EXPLAIN ANALYSE output? You might want to try 
increasing work_mem for this one query to speed any sorting.

Real table and colum names are obfuscated because of NDA, sorry.

Fair enough.

explain analyze select distinct a, b from tbl

EXPLAIN ANALYZE output is:

 Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual 
time=52719.868..56126.356 rows=5390 loops=1)
   ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual 
time=52719.865..54919.989 rows=3378864 loops=1)
         Sort Key: a, b
         ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941 
width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
 Total runtime: 57307.394 ms

Hmm - am I right in thinking (a,b) are the only two columns on this 
table? That means you'll have a lot of rows per page and an index scan 
could end up fetching lots of pages to check the rows are visible. Still 
- I'd expect it to be better than a seq scan.

The first thing to try is to put the index back and run "SET 
enable_seqscan=off" before the explain analyse. That should force it to 
use the index. Then we'll see what costs it's expecting.

How often is the table updated? Clustering might buy you some 
improvements (but not a huge amount I suspect).

It is updated once per 3-5 seconds.

OK - forget clustering then.

And one more thing. I don't know if it helps, but column 'a' can have 
value from a limited set: 0, 1 or 2. Column 'b' is also an integer 
(foreign key, actually).

Hmm - might be worth trying distinct on (b,a) with an index that way 
around - that would give you greater selectivity at the top-level of the 
btree. Can you repeat the EXPLAIN ANALYSE with that too please.

--
  Richard Huxton
  Archonet Ltd