Re: seq vs index scan in join query

Emanuel Alvarez <ema@xxxxxxxxxxxxxxxxxx> · Thu, 30 Nov 2017 01:20:42 -0300

Thank you all for your responses!

On Wed, Nov 29, 2017 at 7:31 AM, legrand legrand
<legrand_legrand@xxxxxxxxxxx> wrote:
> Hi,
>
> Could you give us the partitions (ranges values) and indexes definition for
> result table ?

We partition by month, they usually start the 20th of each month (this
was the date we partitioned the table), for the tables in questions
constraints look like this:
"constraint_37" CHECK (created_at >= '2017-10-21 00:00:00'::timestamp
without time zone AND created_at < '2017-11-20 00:00:00'::timestamp
without time zone)
"constraint_110" CHECK (created_at >= '2017-11-20 00:00:00'::timestamp
without time zone AND created_at < '2017-12-20 00:00:00'::timestamp
without time zone)

Indexes are:
    "results_account_id_created_at_idx" btree (account_id, created_at DESC)
    "results_id_idx" btree (id)
    "results_keyword_id_created_at_idx" btree (keyword_id, created_at DESC)

On Wed, Nov 29, 2017 at 11:57 AM, Marti Raudsepp <marti@xxxxxxxxx> wrote:
> -> Index Scan using keywords_pkey on keywords
>    Buffers: shared hit=284808 read=4093
> vs
> -> Seq Scan on keywords
>    Buffers: shared read=36075
>
> Looks like the index scan's advantage in this example is a much higher
> cache hit ratio (despite touching so many more pages) and PostgreSQL
> is underestimating it.

Interesting, we were completely missing this.

> Have you tuned the effective_cache_size setting? A good starting point
> is half the total RAM in your machine. It would be interesting to see
> how high you need to set it for the planner to switch to the index
> scan plan.

We did this, and although it has an effect it's rapidly shadowed by
the results size. For example, setting it to 10GB is OK for one month
worth of data, but it'll fallback to seq scan for two months data.
Although we might be able to live with that for now.

On Wed, Nov 29, 2017 at 2:17 PM, Laurenz Albe <laurenz.albe@xxxxxxxxxxx> wrote:
> The optimizer is right here.

Never doubted it :)

> Even though your second execution without sequential scans ran faster,
> it is worse.
>
> That is because the execution with the sequential scan touched
> 26492  + 80492 = 106984 blocks, while the second execution touched
> 311301 + 48510 = 359811 blocks, more than three times as many.
>
> The second execution was just lucky because most of these blocks were
> already cached, and it had to read only half as many blocks from disk.
>
> If you repeat the execution a couple of times, you should see that
> the execution using the sequential scans becomes faster.

In a live environment, execution times for sequential scans are always
slower, though they do vary in time. The reason for this is that the
partition for last month's results is accessed frequently, and as such
is kept in the cache; while the other two tables, keywords and
keyword_data, are accessed sparsely, and mostly through their indexes.
This way, indexes have a much bigger chance of being cached in memory
than other parts of the table. In other words: the second execution is
always lucky.

Is this interpretation correct? Is there any option to deal with this
issue? Besides adding more RAM.

We could, theoretically, partition keyword_data as it's also time
base, but it's not that big to justify a partitioning. It's also not
small enough to be doing sequential scan on it all the time.

> You can boost performance even more by increasing work_mem
> so that the hash can be created in memory.

This is interesting, and has a positive effect on our queries. We are
currently testing a combination of work_mem with effective_cache_size
settings, though I'm afraid refactoring the query will be inevitable.

Thank you!