Search Postgresql Archives

Feature request for count_estimate(samplesize) aggregate or SAMPLE keyword

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Postgres Devs,

I use Postgres everywhere I can and for me it is the by far best database system out there. Great job, thank you!

Now I would like to humbly propose a feature that gives an easy way to get a quick count estimate for any condition - index based or not - based on a random sample of rows, that does not require a custom function creation or complex SQL statement

The final syntax could be for example:

    SELECT count_estimate(2000) FROM my_table WHERE ...

or, since peerce in the chat pointed out that aggregate functions always only get hit after the WHERE condition is already evaluated:

    SAMPLE count(*) FROM my_table WHERE ... SAMPLESIZE 2000

The idea of this function is to estimate the total number of rows by checking a random sample size of them for the WHERE condition.

1. Get an as good as possible but quick estimate of max number of rows. (rowcount)
2. Select random sample size of all rows (samplesize)
3. check if where condition matches (hits)
4. calculate and return estimate using rowcount * hits / samplesize

The higher the sample size, the more precise is the count, so the developer can decide on the tradeoff between accuracy and speed. Ofc this has it's limitations if the sample size is so small, or the actual results so few, that only a few hits - if any at all - make it into it. So maybe a min_hits value should be considered as well, so the sample size is increased until at least that many hits are found or an exact count was produced (or until an optional max_samplesize was reached)

Typical scenario: highly flexible searching/filtering data with paging mechanism and "total results" preview

I think this would find a lot of fans. When once more hitting the famous count(*) performance issue I read again a lot of comments, stackoverflows and mail threads discussing it, possible workarounds and solutions, one more complex than the other. I ended up writing my own SQL statement that achieves the above and I think it meets the requirements of many who started discussions, but I really would love to see this become a supported feature so it is easier to use and most likely far more performant.


Greetings,
     Torge.




PS:

Here my current SQL level statement to achieve this. Due to limitations I couldn't work around, it works slightly different. Instead of the row count, it uses the max ID of the table, which ofc has to exist to even work. Using reltuples was suggested, but in my system it is already often off by more than 10% and I don't know how to select random rows in that case (efficiently) while I can easily generate random IDs in the space of possible IDs.

My statement returns an estimate in ~ 1 second on a 100M table with a not index supported WHERE matching 10M entries that is roughly 5% off while an exact count takes over 20 seconds

WITH vars AS (
    SELECT (SELECT my_id FROM my_table ORDER BY my_id DESC LIMIT 1) AS max_id,  --the maximum ID currently used in this table     2000 AS sample_size --The number of entries to sample. The higher the more precise the estimate
),
random_rows AS ( --Create a temporary table with sample_size number of random rows
    SELECT *
    FROM my_table
    WHERE my_id = ANY (ARRAY(
        SELECT (random() * (SELECT max_id FROM vars))::int --Select a random id from 0 to max_id         FROM GENERATE_SERIES(1, (SELECT sample_size FROM vars)) --generate sample_size number of rows
    ))
)
SELECT (SELECT max_id FROM vars) * count(*) / (SELECT sample_size from vars)::int AS estimated_count
FROM random_rows
WHERE ... --Any where condition possible

There was a concern mentioned in chat that the ID might have holes, especially as the sequence is not reset on failed transactions or when stuff is deleted. But this should not matter much, it reduces the accuracy, but could be countered by a bigger sample size. Also a min_id could easily be added in cases where old rows get deleted.

Example:

table has 100M entries
Condition would match 10M entries
highest ID is 300.000.000
so only 1/3rd of IDs really exist in the table.
Sample size: 3000

We try to fetch 3000 rows by probing random IDs resulting in roughly 1000 actual rows, lets say 1012
We compare those rows against our WHERE condition and match 96

The resulting estimate is maxId * hits / samplesize = 300.000.000 * 96 / 3000 = 9.600.000

The holes, no matter where they are, while reducing the precision, are not a general problem. This becomes clearer if we first calculate the estimated number of existing rows based on the data and then continue from there, which will in the end return the same result

actualRowEstimate = maxId * fetchedRows / samplesize = 300.000.000 * 1012 / 3000 = 101.200.000 estimate = actualRowEstimate * hits / fetchedRows = 101.200.000 * 96 / 1012 = 9.600.000  => same number

Because:
estimate = actualRowEstimate * hits / fetchedRows
= (maxId * fetchedRows / samplesize) * hits / fetchedRows
= maxId * hits / samplesize






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux