Re: Querying distinct values from a large table

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Igor Lobanov wrote:


Richard Huxton wrote:
I have rather large table with about 5 millions of rows and a dozen of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I need to query distinct pairs of ('a';'b') from this table.
 >
What version of PostgreSQL is it?

8.1.4

Current release is 8.1.6 - probably worth upgrading when you've got time. It should be a simple matter of replacing the binaries but do check the release notes.

How many distinct values are you getting back from your 5 million rows? If there are too many, an index isn't going to help.

No more than 10,000.

OK. Should be possible to do something then.

Can you share the EXPLAIN ANALYSE output? You might want to try increasing work_mem for this one query to speed any sorting.

Real table and colum names are obfuscated because of NDA, sorry.

Fair enough.

explain analyze select distinct a, b from tbl

EXPLAIN ANALYZE output is:

Unique (cost=500327.32..525646.88 rows=1848 width=6) (actual time=52719.868..56126.356 rows=5390 loops=1) -> Sort (cost=500327.32..508767.17 rows=3375941 width=6) (actual time=52719.865..54919.989 rows=3378864 loops=1)
         Sort Key: a, b
-> Seq Scan on tbl (cost=0.00..101216.41 rows=3375941 width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
 Total runtime: 57307.394 ms

Hmm - am I right in thinking (a,b) are the only two columns on this table? That means you'll have a lot of rows per page and an index scan could end up fetching lots of pages to check the rows are visible. Still - I'd expect it to be better than a seq scan.

The first thing to try is to put the index back and run "SET enable_seqscan=off" before the explain analyse. That should force it to use the index. Then we'll see what costs it's expecting.

How often is the table updated? Clustering might buy you some improvements (but not a huge amount I suspect).

It is updated once per 3-5 seconds.

OK - forget clustering then.

And one more thing. I don't know if it helps, but column 'a' can have value from a limited set: 0, 1 or 2. Column 'b' is also an integer (foreign key, actually).

Hmm - might be worth trying distinct on (b,a) with an index that way around - that would give you greater selectivity at the top-level of the btree. Can you repeat the EXPLAIN ANALYSE with that too please.

--
  Richard Huxton
  Archonet Ltd


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux