Re: Help with bulk read performance

Andy Colson <andy@xxxxxxxxxxxxxxx> · Tue, 14 Dec 2010 09:27:19 -0600

On 11/1/2010 9:15 AM, Dan Schaffer wrote:
Hello

We have an application that needs to do bulk reads of ENTIRE Postgres
tables very quickly (i.e. select * from table). We have observed that
such sequential scans run two orders of magnitude slower than observed
raw disk reads (5 MB/s versus 100 MB/s). Part of this is due to the
storage overhead we have observed in Postgres. In the example below, it
takes 1 GB to store 350 MB of nominal data. However that suggests we
would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest that
the queries are CPU bound, not I/O bound. In fact, repeating the queries
yields similar response times. Presumably if it were an I/O issue the
response times would be much shorter the second time through with the
benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored
procedures, and libpq. In all cases the client code ran on the same box
as the server.
We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters
such as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following
(there are no indexes):

Table "bulk_performance.counts"
Column | Type | Modifiers
--------+---------+-----------
i1 | integer |
i2 | integer |
i3 | integer |
i4 | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster. In this
case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.

Our business logic does operations on the resulting data such that the
output is several orders of magnitude smaller than the input. So we had
hoped that by putting our business logic into stored procedures (and
thus drastically reducing the amount of data flowing to the client) our
throughput would go way up. This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls SPI
directly) or some other approach to get close to the expected 35 MB/s
doing these bulk reads? Or is this the price we have to pay for using
SQL instead of some NoSQL solution. (We actually tried Tokyo Cabinet and
found it to perform quite well. However it does not measure up to
Postgres in terms of replication, data interrogation, community support,
acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson

Whoa... Deja Vu

Is this the same thing Nick is working on?  How'd he get along?

http://archives.postgresql.org/message-id/4CD1853F.2010800@xxxxxxxx

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance