Mark, This fits the typical pattern of the "Big Honking Datamart" for clickstream analysis, a usage pattern that stresses the capability of all DBMS. Large companies spend $1M + on combinations of SW and HW to solve this problem, and only the large scale parallel DBMS can handle the load. Players in the market include Oracle, IBM, Teradata, Netezza and of course Greenplum. Unfortunately, techniques like bitmap indexes only improve things by factors O(10). Parallelism is the only proven answer to get O(10,000) improvements in response time. Furthermore, simply speeding the I/O underneath one CPU per query is insufficient, the query and loading engine need to scale CPU and storage access together. =========== start description of commercial Postgres solution =========== =========== If commercial solutions offend, skip this section =========== The parallel I/O and CPU of Greenplum DB (formerly Bizgres MPP) is designed for exactly this workload, where a combination of scalable I/O and CPU is required to speed these kinds of queries (sessionizing weblogs, creating aggregates, direct ad-hoc analysis). One of our customers doing clickstream analysis uses a combination of sessionizing ELT processing with Greenplum DB + Bizgres KETL and Microstrategy for the reporting frontend. The complete system is 1/100 the price of systems that are slower. We routinely see speedups of over 100 compared to large scale multi-million dollar commercial solutions and have reference customers who are regularly working with Terabytes of data. ============ end commercial solution description ======================== - Luke On 11/29/06 11:43 AM, "Mark Jensen" <musicnyman1974@xxxxxxxxx> wrote: > posting this here instead of the GENERAL list...richard is right, this is more > of a performance question than a general question. > > thanks, > > ____________________________________ > Mark Jensen > > ----- Forwarded Message ---- > From: Mark Jensen <musicnyman1974@xxxxxxxxx> > To: Richard Huxton <dev@xxxxxxxxxxxx> > Cc: pgsql-general@xxxxxxxxxxxxxx > Sent: Wednesday, November 29, 2006 2:40:58 PM > Subject: Re: [GENERAL] Including unique users in huge data warehouse in > Postgresql... > > thanks Richard. I've talking to Ron Mayer about this as well offline. I > think the main problem is dedupping users, and not being able to aggregate > visits in the fact table. that's where most of the query time takes place. > but the business guys just won't accept using visits, but not actual uniques > dedupped. if visits was in the fact table as an integer i could sum up, i'd > be fine. Ron also said he's put the unique user ids into arrays so it's > faster to count them, but placing them into aggregate tables. only problem is > i'd still have to know what things to aggregate by to create these, which is > impossible since we have so many dimensions and facts that are going to be > ad-hoc. i have about 20 summary tables i create per day, but most of the > time, they have something new they want to query that's not in summary. and > will never come up again. > > I tried installing Bizgres using their Bizgres loader and custom postgresql > package with bitmap indexes, but doesn't seem to increase performance "that" > much. or as much as i would like compared to the normal postgresql install. > loads are pretty slow when using their bitmap indexes compared to just using > btree indexes in the standard postgresql install. Query time is pretty good, > but i also have to make sure load times are acceptable as well. and had some > problems with the bizgres loader losing connection to the database for no > reason at all, but when using the normal copy command in 8.2RC1, works fine. > love the new query inclusion in the copy command by the way, makes it so easy > to aggregrate hourly fact tables into daily/weekly/monthly in one shot :) > > and yes, work_mem is optimized as much as possible. postgresql is using about > 1.5 gigs of working memory when it runs these queries. looking into getting > 64 bit hardware with 16-32 gigs of RAM so i can throw most of this into memory > to speed it up. we're also using 3par storage which is pretty fast. we're > going to try and put postgresql on a local disk array using RAID 5 as well to > see if it makes a difference. > > and yes, right now, these are daily aggregate tables summed up from the > hourly. so about 17 million rows per day. hourly fact tables are impossible > to query right now, so i have to at least put these into daily fact tables. > so when you have 30 days in this database, then yes, table scans are going to > be huge, thus why it's taking so long, plus dedupping on unique user id :) > > and you're right, i should put this on the performance mailing list... see you > there :) > > thanks guys. > > ____________________________________ > Mark Jensen > > ----- Original Message ---- > From: Richard Huxton <dev@xxxxxxxxxxxx> > To: Mark Jensen <musicnyman1974@xxxxxxxxx> > Cc: pgsql-general@xxxxxxxxxxxxxx > Sent: Wednesday, November 29, 2006 2:29:35 PM > Subject: Re: [GENERAL] Including unique users in huge data warehouse in > Postgresql... > > Mark Jensen wrote: >> So i've been given the task of designing a data warehouse in >> either Postgresql or Mysql for our clickstream data for our sites. I >> started with Mysql but the joins in Mysql are just way too slow >> compared to Postgresql when playing with star schemas. > > Mark - it's not my usual area, but no-one else has picked up your > posting, so I'll poke my nose in. The other thing you might want to do > is post this on the performance list - that's probably the best place. > Might be worth talking to those at www.bizgres.org too (although I think > they all hang out on the performance list). > >> I can't say >> which sites i'm working on, but we get close to 3-5 million uniques >> users per day, so over time, that's a lot of unique users to keep >> around and de-dup your fact tables by. Need to be able to query normal >> analytics like: > <snip> > >> i've >> made a lot of optimizations in postgresql.conf by playing with work_mem >> and shared_buffers and such and i think the database is using as much >> as it can disk/memory/cpu wise. > > Big work_mem, I'm guessing. Limiting factor is presumably disk I/O. > > <snip> >> here's a sample query that takes a while to run... just a simple report that >> shows gender by area of the site. >> >> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) >> as clicks, count(distinct(C.uu_id)) as users >> from uus as A, areas as B, daily_area_fact as C >> where A.uu_id = C.uu_id >> and B.area_id = C.area_id >> group by gender,area; >> >> so >> by just having one day of data, with 3,168,049 rows in the user >> dimension table (uus), 17,213,420 in the daily_area_fact table that >> joins all the dimension tables, takes about 15 minutes. if i had 30-90 >> days in this fact table, who knows how long this would take... i know >> doing a distinct on uu_id is very expensive, so that's the main problem >> here i guess and would want to know if anyone else is doing it this way >> or better. > > In the end, I'd suspect the seq-scan over the fact table will be your > biggest problem. Can you pre-aggregate your fact-table into daily summaries? > > See you over on the performance list, where there are more experienced > people than myself to help you.