Re: Including unique users in huge data warehouse in Postgresql...

Mark Jensen <musicnyman1974@xxxxxxxxx> · Wed, 29 Nov 2006 11:40:58 -0800 (PST)

thanks Richard.  I've talking to Ron Mayer about this as well offline.  I think the main problem is dedupping users, and not being able to aggregate visits in the fact table.  that's where most of the query time takes place.  but the business guys just won't accept using visits, but not actual uniques dedupped.  if visits was in the fact table as an integer i could sum up, i'd be fine.  Ron also said he's put the unique user ids into arrays so it's faster to count them, but placing them into aggregate tables.  only problem is i'd still have to know what things to aggregate by to create these, which is impossible since we have so many dimensions and facts that are going to be ad-hoc.  i have about 20 summary tables i create per day, but most of the time, they have something new they want to query that's not in summary.  and will never come up again.

I tried installing Bizgres using their Bizgres loader and custom postgresql package with bitmap indexes, but doesn't seem to increase performance "that" much.  or as much as i would like compared to the normal postgresql install.  loads are pretty slow when using their bitmap indexes compared to just using btree indexes in the standard postgresql install.  Query time is pretty good, but i also have to make sure load times are acceptable as well.  and had some problems with the bizgres loader losing connection to the database for no reason at all, but when using the normal copy command in 8.2RC1, works fine.  love the new query inclusion in the copy command by the way, makes it so easy to aggregrate hourly fact tables into daily/weekly/monthly in one shot :)

and yes, work_mem is optimized as much as possible.  postgresql is using about 1.5 gigs of working memory when it runs these queries.  looking into getting 64 bit hardware with 16-32 gigs of RAM so i can throw most of this into memory to speed it up.  we're also using 3par storage which is pretty fast.  we're going to try and put postgresql on a local disk array using RAID 5 as well to see if it makes a difference.

and yes, right now, these are daily aggregate tables summed up from the hourly.  so about 17 million rows per day.  hourly fact tables are impossible to query right now, so i have to at least put these into daily fact tables.  so when you have 30 days in this database, then yes, table scans are going to be huge, thus why it's taking so long, plus dedupping on unique user id :)

and you're right, i should put this on the performance mailing list... see you there :)

thanks guys.

____________________________________
Mark Jensen

----- Original Message ----
From: Richard Huxton <dev@xxxxxxxxxxxx>
To: Mark Jensen <musicnyman1974@xxxxxxxxx>
Cc: pgsql-general@xxxxxxxxxxxxxx
Sent: Wednesday, November 29, 2006 2:29:35 PM
Subject: Re: [GENERAL] Including unique users in huge data warehouse in Postgresql...

Mark Jensen wrote:
> So i've been given the task of designing a data warehouse in
> either Postgresql or Mysql for our clickstream data for our sites.  I
> started with Mysql but the joins in Mysql are just way too slow
> compared to Postgresql when playing with star schemas.

Mark - it's not my usual area, but no-one else has picked up your 
posting, so I'll poke my nose in. The other thing you might want to do 
is post this on the performance list - that's probably the best place. 
Might be worth talking to those at www.bizgres.org too (although I think 
they all hang out on the performance list).

 > I can't say
> which sites i'm working on, but we get close to 3-5 million uniques
> users per day, so over time, that's a lot of unique users to keep
> around and de-dup your fact tables by.  Need to be able to query normal
> analytics like:
<snip>

> i've
> made a lot of optimizations in postgresql.conf by playing with work_mem
> and shared_buffers and such and i think the database is using as much
> as it can disk/memory/cpu wise. 

Big work_mem, I'm guessing. Limiting factor is presumably disk I/O.

<snip>
> here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.
> 
> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as users
> from uus as A, areas as B, daily_area_fact as C
> where A.uu_id = C.uu_id
> and B.area_id = C.area_id
> group by gender,area;
> 
> so
> by just having one day of data, with 3,168,049 rows in the user
> dimension table (uus), 17,213,420 in the daily_area_fact table that
> joins all the dimension tables, takes about 15 minutes.  if i had 30-90
> days in this fact table, who knows how long this would take... i know
> doing a distinct on uu_id is very expensive, so that's the main problem
> here i guess and would want to know if anyone else is doing it this way
> or better.

In the end, I'd suspect the seq-scan over the fact table will be your 
biggest problem. Can you pre-aggregate your fact-table into daily summaries?

See you over on the performance list, where there are more experienced 
people than myself to help you.
-- 
   Richard Huxton
   Archonet Ltd

____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com