Re: efficient data reduction (and deduping)

Alessandro Gagliardi <alessandro@xxxxxxxx> · Thu, 1 Mar 2012 11:30:06 -0800

Hah! Yeah, that might would work. Except that I suck at grep. :(Perhaps that's a weakness I should remedy.

On Thu, Mar 1, 2012 at 10:35 AM, Craig James <cjames@xxxxxxxxxxxxxx> wrote:

On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi

<alessandro@xxxxxxxx> wrote:

> Hi folks,

>

> I have a system that racks up about 40M log lines per day. I'm able to COPY

> the log files into a PostgreSQL table that looks like this:

Since you're using a COPY command and the table has a simple column

with exactly the value you want, why not filter it using grep(1) or

something similar and load the filtered result directly into the

hourly table?

Craig

>

> CREATE TABLE activity_unlogged

> (

>   user_id character(24) NOT NULL,

>   client_ip inet,

>   hr_timestamp timestamp without time zone,

>   locale character varying,

>   log_id character(36),

>   method character varying(6),

>   server_ip inet,

>   uri character varying,

>   user_agent character varying

> )

>

> Now, I want to reduce that data to get the last activity that was performed

> by each user in any given hour. It should fit into a table like this:

>

> CREATE TABLE hourly_activity

> (

>   activity_hour timestamp without time zone NOT NULL,

>   user_id character(24) NOT NULL,

>   client_ip inet,

>   hr_timestamp timestamp without time zone,

>   locale character varying,

>   log_id character(36),

>   method character varying(6),

>   server_ip inet,

>   uri character varying,

>   user_agent character varying,

>   CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )

> )

>

> where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary

> key constraint)

>

> I am attempting to do that with the following:

>

> INSERT INTO hourly_activity

>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,

> activity_unlogged.user_id,

>                     client_ip, hr_timestamp, locale, log_id, method,

> server_ip, uri, user_agent

>         FROM activity_unlogged,

>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp

>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',

> hr_timestamp)) AS last_activity

>     WHERE activity_unlogged.user_id = last_activity.user_id AND

> activity_unlogged.hr_timestamp = last_activity.last_timestamp;

>

> I have two problems:

>

> It's incredibly slow (like: hours). I assume this is because I am scanning

> through a huge unindexed table twice. I imagine there is a more efficient

> way to do this, but I can't think of what it is. If I were doing this in a

> procedural programming language, it might look something like:

> for row in activity_unlogged:

>     if (date_trunc('hour', hr_timestamp), user_id) in

> hourly_activity[(activity_hour, user_id)]:

>         if hr_timestamp > hourly_activity[(date_trunc('hour',

> hr_timestamp), user_id)][hr_timestamp]:

>             hourly_activity <- row # UPDATE

>     else:

>         hourly_activity <- row # INSERT

> I suspect some implementation of this (hopefully my pseudocode is at least

> somewhat comprehensible) would be very slow as well, but at least it would

> only go through activity_unlogged once. (Then again, it would have to

> rescan hourly_activity each time, so it really wouldn't be any faster at

> all, would it?) I feel like there must be a more efficient way to do this in

> SQL though I can't put my finger on it.

> Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE

> activity_unlogged.user_id = last_activity.user_id AND

> activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to

> multiple records leading to a primary key collision. In such cases, I don't

> really care which of the two rows are picked, I just want to make sure that

> no more than one row is inserted per user per hour. In fact, though I would

> prefer to get the last row for each hour, I could probably get much the same

> effect if I just limited it to one per hour. Though I don't know if that

> really helps at all.