Re: Using Postgres to store high volume streams of sensor readings

"Ciprian Dorin Craciun" <ciprian.craciun@xxxxxxxxx> · Fri, 21 Nov 2008 15:26:00 +0200



On Fri, Nov 21, 2008 at 3:18 PM, Grzegorz Jaśkiewicz <gryzman@xxxxxxxxx> wrote:> you'll have to provide us with some sort of test-case to get some answers,> please. (set of scripts, queries, etc).
    Bellow is the content of my original post. Inside I mentionexactly the may the benchmark was conducted.
    In short the data is inserted by using COPY sds_benchmark_datafrom STDIN, in batches of 500 thousand data points.
    I'll also paste the important part here:
>    * Postgres version: 8.3.3;>>    * database schema:>> create table sds_benchmark_data (>>     client int4 not null,>>     sensor int4 not null,>>     timestamp int4 not null,>>     value int4>> );>>>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);>>>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);>>    * postgres.conf (the default values, I will list only what has> been changed):>> max_connections = 20>> shared_buffers = 24MB>> work_mem = 64MB>> maintenance_work_mem = 256MB>> fsync = off>> checkpoint_segments = 32>> effective_cache_size = 1024MB>>    * inserts are done like this:>        * generated 100 million readings by using the following rule:>            * client is randomly chosen between 0 and 10 thousand;>            * sensor is randomly chosen between 0 and 10;>            * the timestamp is always increasing by one;>        * the insert is done in batches of 500 thousand inserts (I've> also tried 5, 25, 50 and 100 thousand without big impact);>        * the banch inserts are done through COPY sds_benchmark_data> FROM STDIN through libpq (by using UNIX (local) sockets);
    Unfortunately I don't know what more information to give...
    Thanks,    Ciprian Dorin Craciun.

On Fri, Nov 21, 2008 at 2:50 PM, Ciprian Dorin Craciun<ciprian.craciun@xxxxxxxxx> wrote:>    Hello all!>>    I would like to ask some advice about the following problem> (related to the Dehems project: http://www.dehems.eu/ ):>    * there are some clients; (the clients are in fact house holds;)>    * each device has a number of sensors (about 10), and not all the> clients have the same sensor; also sensors might appear and disappear> dynamicaly; (the sensors are appliances;)>    * for each device and each sensor a reading is produced (at about> 6 seconds); (the values could be power consumptions;)>    * I would like to store the following data: (client, sensor,> timestamp, value);>    * the usual queries are:>        * for a given client (and sensor), and time interval, I need> the min, max, and avg of the values;>        * for a given time interval (and sensor), I need min, max, and> avg of the values;>        * other statistics;>>    Currently I'm benchmarking the following storage solutions for this:>    * Hypertable (http://www.hypertable.org/) -- which has good insert> rate (about 250k inserts / s), but slow read rate (about 150k reads /> s); (the aggregates are manually computed, as Hypertable does not> support other queries except scanning (in fact min, and max are easy> beeing the first / last key in the ordered set, but avg must be done> by sequential scan);)>    * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but> fabulos read rate (about 2M reads / s); (the same issue with> aggregates;)>    * Postgres -- which behaves quite poorly (see below)...>    * MySQL -- next to be tested;>>    So what can I do / how could I optimize the use of Postgres for this usage?>>    (I'm aware that there could be optimizations for this problem> (like computing the aggregates in memory and storing only these> aggregates at 10 minutes, or other interval), but I want to have the> full history (for data mining tasks for example);)>>    I will also like to present how was the Postgres benchmark implemented:>>    * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,> SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);>    * Postgres version: 8.3.3;>>    * database schema:>> create table sds_benchmark_data (>>     client int4 not null,>>     sensor int4 not null,>>     timestamp int4 not null,>>     value int4>> );>>>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);>>>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);>>    * postgres.conf (the default values, I will list only what has> been changed):>> max_connections = 20>> shared_buffers = 24MB>> work_mem = 64MB>> maintenance_work_mem = 256MB>> fsync = off>> checkpoint_segments = 32>> effective_cache_size = 1024MB>>    * inserts are done like this:>        * generated 100 million readings by using the following rule:>            * client is randomly chosen between 0 and 10 thousand;>            * sensor is randomly chosen between 0 and 10;>            * the timestamp is always increasing by one;>        * the insert is done in batches of 500 thousand inserts (I've> also tried 5, 25, 50 and 100 thousand without big impact);>        * the banch inserts are done through COPY sds_benchmark_data> FROM STDIN through libpq (by using UNIX (local) sockets);>>    What have I observed / tried:>    * I've tested without the primary key and the index, and the> results were the best for inserts (600k inserts / s), but the> readings, worked extremly slow (due to the lack of indexing);>    * with only the index (or only the primary key) the insert rate is> good at start (for the first 2 million readings), but then drops to> about 200 inserts / s;>>    So could someone point me where I'me wrong, or what can I do to> optimize Postgres for this particular task?>>    Thanks for your help,>    Ciprian Dorin Craciun.>>    P.S.: I'll want to publish the benchmark results after they are> done, and I want to squeeze as much power out of Postgres as possible.
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general