Re: Slow index scan on B-Tree index over timestamp field

Caio Casimiro <casimiro.listas@xxxxxxxxx> · Wed, 6 Nov 2013 17:24:24 -0200

Thank you for your considerations Jeff. Actually I'm running an experiment proposed by other researchers to evaluate a recommendation model.
My database is composed only by old tweets. In this experiment the recommendation model is evaluated in a daily basis, and that's the reason the query collect tweets published in a specific date.

In fact, this is not the only experiment I will run. As you pointed it is rather strange to recommend only tweets published in the present day. I have a different experiment that will collect tweets published in a wider time range.

With respect to the query's performance issue I have made some progress following what you and Mat D said about creating an index with creation_time,user_id and id.

I created an index over tweet(user_id, creation_time, id) and it made the planner to IOS this index.

Now the most expensive part of the query is the outer nested loop (as you can see at the end of the email). 
I'm considering to embed the topic information of tweets in a text field in the table tweet, this way I would not need to joint the tweet_topic table.

And this query without the join with the table tweet_topic is running at ~ 50ms!
Could you recommend anything different from this approach?

Thank you very much,
Casimiro

SELECT tt.tweet_id, tt.topic, tt.topic_value
            FROM tweet_topic AS tt  JOIN tweet AS t ON (tt.tweet_id = t.id
                                                  AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in

                                         (SELECT followed_id FROM relationship WHERE follower_id = N))

"Nested Loop  (cost=1.57..5580452.55 rows=3961 width=20) (actual time=33.737..5106.898 rows=3058 loops=1)"

"  Buffers: shared hit=3753 read=1278"
"  ->  Nested Loop  (cost=1.00..1005.35 rows=1930 width=8) (actual time=0.070..77.244 rows=978 loops=1)"
"        Buffers: shared hit=484 read=5"

"        ->  Index Only Scan using relationship_id on relationship  (cost=0.42..231.12 rows=154 width=8) (actual time=0.034..0.314 rows=106 loops=1)"
"              Index Cond: (follower_id = 335093362)"

"              Heap Fetches: 0"
"              Buffers: shared hit=5"
"        ->  Index Only Scan using tweet_ios_index on tweet t  (cost=0.57..4.90 rows=13 width=16) (actual time=0.025..0.695 rows=9 loops=106)"

"              Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-21 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-22 00:00:00-03'::timestamp with time zone))"

"              Heap Fetches: 0"
"              Buffers: shared hit=479 read=5"
"  ->  Index Scan using tweet_topic_tweet_id_index on tweet_topic tt  (cost=0.57..2883.60 rows=731 width=20) (actual time=5.119..5.128 rows=3 loops=978)"

"        Index Cond: (tweet_id = t.id)"
"        Buffers: shared hit=3269 read=1273"
"Total runtime: 5110.217 ms"

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SELECT t.id FROM tweet AS t WHERE t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
                                         (SELECT followed_id FROM relationship WHERE follower_id = N)

"Nested Loop  (cost=1.00..1012.13 rows=2244 width=8) (actual time=0.074..51.855 rows=877 loops=1)"
"  Buffers: shared hit=432 read=4"
"  ->  Index Only Scan using relationship_id on relationship  (cost=0.42..227.12 rows=154 width=8) (actual time=0.034..0.218 rows=106 loops=1)"

"        Index Cond: (follower_id = 335093362)"
"        Heap Fetches: 0"
"        Buffers: shared hit=5"
"  ->  Index Only Scan using tweet_ios_index on tweet t  (cost=0.57..4.95 rows=15 width=16) (actual time=0.021..0.468 rows=8 loops=106)"

"        Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-22 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-23 00:00:00-03'::timestamp with time zone))"

"        Heap Fetches: 0"
"        Buffers: shared hit=427 read=4"
"Total runtime: 52.692 ms"

On Wed, Nov 6, 2013 at 5:00 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

On Mon, Nov 4, 2013 at 12:44 PM, Caio Casimiro <casimiro.listas@xxxxxxxxx> wrote:

Thank you very much for your answers guys!

On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@xxxxxxxxx> wrote:

SELECT tt.tweet_id, tt.topic, tt.topic_value
            FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.id
            WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in

            (SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;

I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row.  Also, you could try changing the IN-list to an EXISTS subquery.

I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?

A left join means you are telling it to make up an all-NULL tweet row for any tweet_topic that does not have a corresponding tweet.  But then once it did so, it would just filter out that row later, because the null creation_time and user_id cannot pass the WHERE criteria--so doing a left join can't change the answer, but it can fool the planner into making a worse choice.

Is there some patterns to D1 and D2 that could help the caching?  For example, are they both usually in the just-recent past?

The only pattern is that it is always a one day interval, e.g. D1 = '2013-05-01' and  D2 = '2013-05-02'.

If you only compare creation_time to dates, rather than ever using date+time, then it would probably be better to store them in the table as date, not timestamp.  This might make the index smaller, and can also lead to better estimates and index usage.

But why would you want to offer suggestions to someone based on tweets that were made on exactly one day, over 5 months ago?  I can see why would want a brief period in the immediate past, or a long period; but a brief period that is not the recent past just seems like a strange thing to want to do.  (And it is going to be hard to get good performance with that requirement.)

Cheers,

Jeff