Hello all! I would like to ask some advice about the following problem (related to the Dehems project: http://www.dehems.eu/ ): * there are some clients; (the clients are in fact house holds;) * each device has a number of sensors (about 10), and not all the clients have the same sensor; also sensors might appear and disappear dynamicaly; (the sensors are appliances;) * for each device and each sensor a reading is produced (at about 6 seconds); (the values could be power consumptions;) * I would like to store the following data: (client, sensor, timestamp, value); * the usual queries are: * for a given client (and sensor), and time interval, I need the min, max, and avg of the values; * for a given time interval (and sensor), I need min, max, and avg of the values; * other statistics; Currently I'm benchmarking the following storage solutions for this: * Hypertable (http://www.hypertable.org/) -- which has good insert rate (about 250k inserts / s), but slow read rate (about 150k reads / s); (the aggregates are manually computed, as Hypertable does not support other queries except scanning (in fact min, and max are easy beeing the first / last key in the ordered set, but avg must be done by sequential scan);) * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but fabulos read rate (about 2M reads / s); (the same issue with aggregates;) * Postgres -- which behaves quite poorly (see below)... * MySQL -- next to be tested; So what can I do / how could I optimize the use of Postgres for this usage? (I'm aware that there could be optimizations for this problem (like computing the aggregates in memory and storing only these aggregates at 10 minutes, or other interval), but I want to have the full history (for data mining tasks for example);) I will also like to present how was the Postgres benchmark implemented: * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb, SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2); * Postgres version: 8.3.3; * database schema: > create table sds_benchmark_data ( > client int4 not null, > sensor int4 not null, > timestamp int4 not null, > value int4 > ); > > alter table sds_benchmark_data add primary key (client, sensor, timestamp); > > create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor); * postgres.conf (the default values, I will list only what has been changed): > max_connections = 20 > shared_buffers = 24MB > work_mem = 64MB > maintenance_work_mem = 256MB > fsync = off > checkpoint_segments = 32 > effective_cache_size = 1024MB * inserts are done like this: * generated 100 million readings by using the following rule: * client is randomly chosen between 0 and 10 thousand; * sensor is randomly chosen between 0 and 10; * the timestamp is always increasing by one; * the insert is done in batches of 500 thousand inserts (I've also tried 5, 25, 50 and 100 thousand without big impact); * the banch inserts are done through COPY sds_benchmark_data FROM STDIN through libpq (by using UNIX (local) sockets); What have I observed / tried: * I've tested without the primary key and the index, and the results were the best for inserts (600k inserts / s), but the readings, worked extremly slow (due to the lack of indexing); * with only the index (or only the primary key) the insert rate is good at start (for the first 2 million readings), but then drops to about 200 inserts / s; So could someone point me where I'me wrong, or what can I do to optimize Postgres for this particular task? Thanks for your help, Ciprian Dorin Craciun. P.S.: I'll want to publish the benchmark results after they are done, and I want to squeeze as much power out of Postgres as possible. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general