Hi, On 28 September 2012 04:34, Ryan Kelly <rpkelly22@xxxxxxxxx> wrote: > On Thu, Sep 27, 2012 at 08:51:31AM +1000, Ondrej Ivanič wrote: >> - aggregation job ran every 15 minutes and completed under 2 minutes: >> 5mil rows -> aggregation -> 56 tables > 5mil overall, or matching your aggregation query? And is that the 2TB > mentioned above? We have more than 100 times that many rows, but less > data. Let me explain. ETL process imports several thousands row every 5 minutes or so. Aggregation job runs every 15 minutes and it grabs the everything new since last run which could be up to 5 mil rows. Next step is to compute aggregates -- 56 queries like insert into mat_table1 select attr1, attr2, count(*) from tmp_table; 2TB was the size of the live dataset - 6 months, 30-40mil rows per month. >> - all queries can be executed over date range up to several months >> (monthly partitioned tables, 6 months history) > Yeah we have to be able to query over various date ranges. Partitioning works nicely in this case. >> - AsterData: nice SQL-MR feature and analytics (decision trees, >> frequent items, clustering, ...); No libpq support and you have to use >> JDBC or selected ODBC manager > I don't think no libpq support is a deal-breaker, but other missing > features could be problematic. It was for us -- we ended up with one option which was ODBC and unixODBC segfaulted on simple queries like 'select 1::numeric'. Aster removed many PG features (i think arrays and composite types are not supported) but they added several cool things. >> - Greenplum (winer): performance comparable to FusionIO (10 to 50 >> times); we were able to remove aggregation job (because of columnar >> store model); easy to port from postgres but could be complicated if >> you are heavy pgpsql user > Not using any pl/pgpsql, but a number of other features: arrays and > hstore, which I doubt (hopefully wrongly) that Greenplum supports. Anything which you can compile against 8.2 might work... >> At this time I would try: >> - Postgres-XC > From what I understand, more of a write-scaleable-oriented solution. We > mostly will need read scalability. I also don't think it really handles > redundancy. read scalability is there as well: it can use multiple nodes for select quires and push-down (execute it on node) certain operations. Check this talk: http://www.pgcon.org/2012/schedule/events/424.en.html redundancy is up to you -- you can deploy as many coordinator nodes as you need. Data distribution is quite flexible, see DISTRIBUTE BY and TO GROUP / NODE clauses (http://postgres-xc.sourceforge.net/docs/1_0/sql-createtable.html). >> - Stado > Looks promising, sounded very promising, but it doesn't seem to be > particularly active or well-documented. It also doesn't support window > functions (which I could probably get by without) or CTEs (which will be > trickier, but doable. I'm also not sure of how easy it is to handle node > failure or adding more nodes, as it appears the number of nodes is > essentially fixed. yup, documentations is not the best. You might have a look at pgpool-II parallel query mode (docs is skimpy, not sure about window functions and CTEs support) http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html >> PS. For any commercial solution is better to buy their appliance (if >> you can afford it...). > Thanks for the advice. Is it just better supported, or more performant, > or...? Usually both. You get support, monitoring, performance. Some of appliances do dial-home calls hence you get support call back with concrete advice / solution. Hardware is fine-tuned and proven: good RAIDs controller, disks, 10GbE interconnects, redundant network / storage paths. You can build something like that by your self but you are not going to save in the long run. -- Ondrej Ivanic (ondrej.ivanic@xxxxxxxxx) -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general