Re: Performance question 83 GB Table 150 million rows, distinct select

"Tomas Vondra" <tv@xxxxxxxx> · Thu, 17 Nov 2011 00:59:38 +0100

Hi.

On 16 Listopad 2011, 23:53, Tory M Blue wrote:
>
> We now have about 180mill records in that table. The database size is
> about 580GB and the userstats table which is the biggest one and the
> one we query the most is 83GB.
>
> Just a basic query takes 4 minutes:
>
> For e.g. select count(distinct uid) from userstats where log_date
> >'11/7/2011'
>
> Since we are looking for distinct we can't obviously use an index. But
> I'm wondering what should be expected and what is caused be tuning or
> lack there of? Doing an iostat  I see maybe 10-15%, however the cpu
> that this query is attached to is obviously in the 99-100% busy arena.
> Or am I really IOBound for this single query (sure lots of data
> but?!).

What do you mean by "can't use an index"? The query may use an index to
evaluate the WHERE condition, no matter if there's a distinct or not.

The index-only scans that might be used to speed up this query are
committed in 9.2 - but even that might use index both for plain count and
count distinct.

But you're right - you're not bound by I/O (although I don't know what are
those 15% - iowait, util or what?). The COUNT(DISTINCT) has to actually
keep all the distinct values to determine which are actually distinct.

If there's enough memory (work_mem) to keep all the values, this may be
done using a hash table (hash aggregate). Otherwise it has to sort them.
You can see this in explain plan (which you haven't posted).

Anyway this is actually a rather CPU intensive - how exactly depends on
the data type. Comparing integers is much easier / cheaper than comparing
text values. What data type is the 'uid' column?

> It takes roughly 5.5 hours to do a concurrent re-index and this DB is
> vac'd nightly.
>
> Just not sure if this is what to expect, however there are many other
> DB's out there bigger than ours, so I'm curious what can I do?

Well, not much. Use an integer data type for the 'uid' column (unless
you're already using it). Then you can use more work_mem so that a hash
aggregate is used (maybe it's already used, we need to see the explain
plan to check).

Then you could precompute the distinct values somehow - for example if
there are only a few distinct values for each day, you could do something
like this every day

INSERT INTO userstats_distinct
SELECT DISTINCT date_trunc('day', log_date), uid FROM userstats
 WHERE log_date BETWEEN date_trunc('day', log_date) - interval '1 day'
                    AND date_trunc('day', log_date);

and then just

SELECT COUNT(DISTINCT uid) FROM userstats_distinct
 WHERE log_date > '11/7/2011';

The point is to preaggregate the data to the desired granularity (e.g.
day), and how it improves the performance depends on how much the amount
of data decreases.

Another option is to use estimates instead of exact results - I've
actually written an extension for that, maybe you'll find that useful.
It's available on github (https://github.com/tvondra/distinct_estimators)
and pgxn (http://pgxn.org/tag/estimate/). I've posted a brief description
here:

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

and the current extensions actually performs much better. It's not that
difficult to reach 1% precision. Let me know if this is interesting for
you and if you need a help with the extensions.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance