On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi <alessandro@xxxxxxxx> wrote: > Hi folks, > > I have a system that racks up about 40M log lines per day. I'm able to COPY > the log files into a PostgreSQL table that looks like this: Since you're using a COPY command and the table has a simple column with exactly the value you want, why not filter it using grep(1) or something similar and load the filtered result directly into the hourly table? Craig > > CREATE TABLE activity_unlogged > ( > user_id character(24) NOT NULL, > client_ip inet, > hr_timestamp timestamp without time zone, > locale character varying, > log_id character(36), > method character varying(6), > server_ip inet, > uri character varying, > user_agent character varying > ) > > Now, I want to reduce that data to get the last activity that was performed > by each user in any given hour. It should fit into a table like this: > > CREATE TABLE hourly_activity > ( > activity_hour timestamp without time zone NOT NULL, > user_id character(24) NOT NULL, > client_ip inet, > hr_timestamp timestamp without time zone, > locale character varying, > log_id character(36), > method character varying(6), > server_ip inet, > uri character varying, > user_agent character varying, > CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id ) > ) > > where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary > key constraint) > > I am attempting to do that with the following: > > INSERT INTO hourly_activity > SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour, > activity_unlogged.user_id, > client_ip, hr_timestamp, locale, log_id, method, > server_ip, uri, user_agent > FROM activity_unlogged, > (SELECT user_id, MAX(hr_timestamp) AS last_timestamp > FROM activity_unlogged GROUP BY user_id, date_trunc('hour', > hr_timestamp)) AS last_activity > WHERE activity_unlogged.user_id = last_activity.user_id AND > activity_unlogged.hr_timestamp = last_activity.last_timestamp; > > I have two problems: > > It's incredibly slow (like: hours). I assume this is because I am scanning > through a huge unindexed table twice. I imagine there is a more efficient > way to do this, but I can't think of what it is. If I were doing this in a > procedural programming language, it might look something like: > for row in activity_unlogged: > if (date_trunc('hour', hr_timestamp), user_id) in > hourly_activity[(activity_hour, user_id)]: > if hr_timestamp > hourly_activity[(date_trunc('hour', > hr_timestamp), user_id)][hr_timestamp]: > hourly_activity <- row # UPDATE > else: > hourly_activity <- row # INSERT > I suspect some implementation of this (hopefully my pseudocode is at least > somewhat comprehensible) would be very slow as well, but at least it would > only go through activity_unlogged once. (Then again, it would have to > rescan hourly_activity each time, so it really wouldn't be any faster at > all, would it?) I feel like there must be a more efficient way to do this in > SQL though I can't put my finger on it. > Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE > activity_unlogged.user_id = last_activity.user_id AND > activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to > multiple records leading to a primary key collision. In such cases, I don't > really care which of the two rows are picked, I just want to make sure that > no more than one row is inserted per user per hour. In fact, though I would > prefer to get the last row for each hour, I could probably get much the same > effect if I just limited it to one per hour. Though I don't know if that > really helps at all. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance