Thank you for your considerations Jeff. Actually I'm running an experiment proposed by other researchers to evaluate a recommendation model.
My database is composed only by old tweets. In this experiment the recommendation model is evaluated in a daily basis, and that's the reason the query collect tweets published in a specific date.Could you recommend anything different from this approach?
Casimiro
SELECT tt.tweet_id, tt.topic, tt.topic_value
FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id
AND t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N))
"Nested Loop (cost=1.57..5580452.55 rows=3961 width=20) (actual time=33.737..5106.898 rows=3058 loops=1)"
" Buffers: shared hit=3753 read=1278"
" -> Nested Loop (cost=1.00..1005.35 rows=1930 width=8) (actual time=0.070..77.244 rows=978 loops=1)"
" Buffers: shared hit=484 read=5"
" -> Index Only Scan using relationship_id on relationship (cost=0.42..231.12 rows=154 width=8) (actual time=0.034..0.314 rows=106 loops=1)"
" Index Cond: (follower_id = 335093362)"
" Heap Fetches: 0"
" Buffers: shared hit=5"
" -> Index Only Scan using tweet_ios_index on tweet t (cost=0.57..4.90 rows=13 width=16) (actual time=0.025..0.695 rows=9 loops=106)"
" Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-21 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-22 00:00:00-03'::timestamp with time zone))"
" Heap Fetches: 0"
" Buffers: shared hit=479 read=5"
" -> Index Scan using tweet_topic_tweet_id_index on tweet_topic tt (cost=0.57..2883.60 rows=731 width=20) (actual time=5.119..5.128 rows=3 loops=978)"
" Index Cond: (tweet_id = t.id)"
" Buffers: shared hit=3269 read=1273"
"Total runtime: 5110.217 ms"
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT t.id FROM tweet AS t WHERE t.creation_time BETWEEN 'D1' AND 'D2' AND t.user_id in
(SELECT followed_id FROM relationship WHERE follower_id = N)
"Nested Loop (cost=1.00..1012.13 rows=2244 width=8) (actual time=0.074..51.855 rows=877 loops=1)"
" Buffers: shared hit=432 read=4"
" -> Index Only Scan using relationship_id on relationship (cost=0.42..227.12 rows=154 width=8) (actual time=0.034..0.218 rows=106 loops=1)"
" Index Cond: (follower_id = 335093362)"
" Heap Fetches: 0"
" Buffers: shared hit=5"
" -> Index Only Scan using tweet_ios_index on tweet t (cost=0.57..4.95 rows=15 width=16) (actual time=0.021..0.468 rows=8 loops=106)"
" Index Cond: ((user_id = relationship.followed_id) AND (creation_time >= '2013-06-22 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-06-23 00:00:00-03'::timestamp with time zone))"
" Heap Fetches: 0"
" Buffers: shared hit=427 read=4"
"Total runtime: 52.692 ms"
On Wed, Nov 6, 2013 at 5:00 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
On Mon, Nov 4, 2013 at 12:44 PM, Caio Casimiro <casimiro.listas@xxxxxxxxx> wrote:
Thank you very much for your answers guys!On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro.listas@xxxxxxxxx> wrote:SELECT tt.tweet_id, tt.topic, tt.topic_valueFROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id = t.idWHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in(SELECT followed_id FROM relationship WHERE follower_id = N) ORDER BY tt.tweet_id;I don't know if this affects the plan at all, but it is silly to do a left join to "tweet" when the WHERE clause has conditions that can't be satisfied with a null row. Also, you could try changing the IN-list to an EXISTS subquery.I'm sorry the ignorance, but I don't understand the issue with the left join, could you explain more?A left join means you are telling it to make up an all-NULL tweet row for any tweet_topic that does not have a corresponding tweet. But then once it did so, it would just filter out that row later, because the null creation_time and user_id cannot pass the WHERE criteria--so doing a left join can't change the answer, but it can fool the planner into making a worse choice.Is there some patterns to D1 and D2 that could help the caching? For example, are they both usually in the just-recent past?The only pattern is that it is always a one day interval, e.g. D1 = '2013-05-01' and D2 = '2013-05-02'.If you only compare creation_time to dates, rather than ever using date+time, then it would probably be better to store them in the table as date, not timestamp. This might make the index smaller, and can also lead to better estimates and index usage.But why would you want to offer suggestions to someone based on tweets that were made on exactly one day, over 5 months ago? I can see why would want a brief period in the immediate past, or a long period; but a brief period that is not the recent past just seems like a strange thing to want to do. (And it is going to be hard to get good performance with that requirement.)Cheers,Jeff