I'm wanting to optimize and improve a query to get the maximum number of users over a period of time. What I'm trying to accomplish is to get graphable data points of the maximum number of simultaneous users at a specified interval over a period of time, preferably with only a single pass through the data.
----------
streamlog table (I've only included relevant fields and indexes):
id bigint
ts timestamp
viewtime integer
client integer
-- primary key on id field
-- streamlog_tsrange_index btree (client, ts, startts(ts, viewtime))
\df+ startts
List of functions
Schema | Name | Result data type | Argument data types | Volatility | Owner | Language | Sou
rce code | Description
--------+---------+-----------------------------+--------------------------------------+------------+-------+----------+-----------------------
-----------------------------+-------------
public | startts | timestamp without time zone | timestamp without time zone, integer | immutable | root | sql | select $1 - ($2::varch
ar || ' seconds')::interval; |
The ts timestamp is the time which the data was logged.
The viewtime is the amount of time the user was online in seconds
the startts function determines when the session started by subtracting viewtime from ts and returning a timestamp
-----------------
My current procedure...
1) Create temporary table with each possible data point. This example uses recursive functions from pgsql 8.4 but was originally implemented by using large numbers of queries from php. My knowledge of the recursive functions is pretty week, but I was able to put this together without too much trouble.
create temp table timerange as with recursive f as (
select '2009-03-21 18:20:00'::timestamp as a
union all
select a+'30 seconds'::interval as a from f where a < '2009-03-21 20:20:00'::timestamp
) select a from f;
2) Update table with record counts
alter table timerange add column mycount integer;
explain analyze update timerange set mycount = (select count(*) from streamlogfoo where client = 3041 and a between startts(ts,viewtime) and ts);
-----------------
This seems to work reasonably well, with the following exceptions...
1) The number reported is the number at the set time period, not the highest value between each data point. With a 30 second interval, this isn't a big problem, but with larger intervals gives results that do not match what I'm looking for (maximum users).
2) This does not scale well for large numbers of points, as internally each data point is a complete scan through the data, even though most of the data points will be common for the entire range.
I'm thinking this would be a good use for the new window functions, but I'm not sure where to begin. Any ideas?
-----------------