Re: Large data and slow queries

Samuel Williams <space.ship.traveller@xxxxxxxxx> · Thu, 20 Apr 2017 01:13:11 +1200

Thanks for all the suggestions Stephen.

> That explain analyze shows a whole ton of heap fetches.  When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?

On 20 April 2017 at 00:48, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
> Greetings,
>
> * Samuel Williams (space.ship.traveller@xxxxxxxxx) wrote:
>> We want the following kinds of query to be fast:
>
> "kinds of query" isn't helpful, you should be reviewing exactly the
> queries you care about because statistics and your exact data set and
> what the exact query you're running is will all have an impact.
>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>
> Ugh.  You should really look at and consider PostGIS whenever you're
> working with geospatial data.
>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>
> You're finding some 17M records and then reducing that with DISTINCT to
> only 114k.  Hard to say if it's faster to just brute-force your way
> through that with a HashAgg (as your explain analyze shows), or if a
> loose index scan would work better (check out how to do one in PG here:
> https://wiki.postgresql.org/wiki/Loose_indexscan).
>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>
> That explain analyze shows a whole ton of heap fetches.  When was the
> last time a VACUUM was run on this table, to build the visibility map?
> Without the visibility map being current, an Index-Only Scan, as is
> happening here, can really suck.
>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>>
>> Perhaps add a BRIN index on created_at
>
> BRIN would be helpful if you wanted a smaller index.  That doesn't seem
> to be the case here.
>
>> I'm wondering if... we can use an index to cache, all user_ids seen on
>> a given day. If we change our index to be more discrete, e.g.
>> created_at::date, would this help? The set union of user_ids for 365
>> days should be pretty fast?
>
> Materialized views are often useful, particularly when the results are
> (relatively) slow moving.
>
>> I'm open to any ideas or suggestions, ideally we can keep
>> optimisations within the database, rather than adding a layer of
>> caching on top.
>
> If you need to go the materialized view route, I'd definitely recommend
> doing that in the database rather than outside it.
>
> Thanks!
>
> Stephen

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general