Re: Slow index scan on B-Tree index over timestamp field

Caio Casimiro <casimiro.listas@xxxxxxxxx> · Mon, 4 Nov 2013 19:10:15 -0200

Hi Neyman, thank you for your answer.

Unfortunately this query runs almost at the same time:

Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual time=25820.291..25821.845 rows=1640 loops=1)
  Sort Key: tt.tweet_id
  Sort Method: quicksort  Memory: 97kB
  Buffers: shared hit=1849 read=32788
  ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual time=486.839..25814.120 rows=1640 loops=1)
        Buffers: shared hit=1849 read=32788
        ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8) (actual time=431.654..13209.159 rows=597 loops=1)
              Hash Cond: (t.user_id = relationship.followed_id)
              Buffers: shared hit=3 read=31870
              ->  Index Scan using tweet_creation_time_index on tweet t  (cost=0.57..83308.25 rows=1781234 width=16) (actual time=130.144..10037.764 rows=1759645 loops=1)
                    Index Cond: ((creation_time >= '2013-05-05 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 00:00:00-03'::timestamp with time zone))
                    Buffers: shared hit=1 read=31867
              ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual time=94.365..94.365 rows=106 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 3kB
                    Buffers: shared hit=2 read=3
                    ->  Index Only Scan using relationship_id on relationship  (cost=0.42..227.12 rows=154 width=8) (actual time=74.540..94.101 rows=106 loops=1)
                          Index Cond: (follower_id = 335093362)
                          Heap Fetches: 0
                          Buffers: shared hit=2 read=3
        ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63 rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
              Recheck Cond: (tweet_id = t.id)
              Buffers: shared hit=1846 read=918
              ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78 rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
                    Index Cond: (tweet_id = t.id)
                    Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms

I have noticed that in both queries the index scan on tweet_creation_time_index is very expensive. Is there anything I can do to make the planner choose a index only scan?

Thank you,
Caio

On Mon, Nov 4, 2013 at 6:52 PM, Igor Neyman <ineyman@xxxxxxxxxxxxxx> wrote:

From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Caio Casimiro

Sent: Monday, November 04, 2013 3:44 PM

To: Jeff Janes

Cc: pgsql-performance@xxxxxxxxxxxxxx

Subject: Re:  Slow index scan on B-Tree index over timestamp field

Thank you very much for your answers guys!

On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@xxxxxxxxx> wrote:

Hello all,

I have one query running at ~ 7 seconds and I would like to know if it's possible to make it run faster, once this query runs lots of time in my experiment.

Do you mean you want it to be fast because it runs many times, or that you want it to become fast after it runs many times (i.e. once the data is fully cached)?  The plan you show takes 24 seconds, not 7 seconds.

I want it to be fast because it runs many times. I have an experiment that evaluates recommendation algorithms  for a set of twitter users. This query returns recommendation candidates so it is called a lot of times for different users and time intervals.

Basically the query return the topics of tweets published by users that the user N follows and that are published between D1 and D2.

Query:

SELECT tt.tweet_id, tt.topic, tt.topic_value

            FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.id

            WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in

            (SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;

I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row.  Also, you could try changing the IN-list to an EXISTS subquery.

I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?

...........................................

Thank you very much again!

Caio

Just try the following:

SELECT tt.tweet_id, tt.topic, tt.topic_value

            FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id

                                                  AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in

                                         (SELECT followed_id FROM relationship WHERE follower_id = N))

 ORDER BY tt.tweet_id;

And see if it helps with performance.

Regards,

Igor Neyman