advice sought - general approaches to optimizing queries around "event streams"

Jonathan Vanasco <postgres@xxxxxxxx> · Fri, 26 Sep 2014 17:02:25 -0400

I have a growing database with millions of rows that track resources against an event stream.  

i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.  

no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.  

a simple form of my database would be:

	--  1k of
	create table stream (
		id int not null primary key,
	)

	-- 1MM of
	create table resource (
		id int not null primary key,
		col_a bool,
		col_b bool,
		col_c text,
	);

	-- 10MM of
	create table streamevent (
		id int not null,
		event_timestamp timestamp not null,
		stream_id int not null references stream(id)
	);

	-- 10MM of
	create table resource_2_stream_event(
		resource_id int not null references resource(id),
		streamevent_id int not null references streamevent(id)
	)

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.  
better performance has come from limiting the number of "stream events"  ( which are only the timestamp and resource_id off a joined table ) 

The bottlenecks I've encountered have primarily been:

1.	When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
	I've figured out a novel way to work with the most recent events, but distant events are troublesome

	using no limit, the query takes 3500 ms
	using a limit of 10000, the query takes 320ms
	using a limit of 1000, the query takes 20ms

	there is a dedicated index of on event_timestamp (desc) , and it is being used
	according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)

2. 	I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)

	I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
		- scanning the entire table for a term takes about 10 seconds on an initial hit.  subsequent queries for the same terms end up using the cache, and complete within 20ms.

	I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.  

i'm hoping someone can enlighten me into looking at new ways to solve these problems.   i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general