Re: Simple (hopefully) throughput question?

Nick Matheson <Nick.D.Matheson@xxxxxxxx> · Thu, 04 Nov 2010 15:08:42 +0000

mark-

Thanks for all the good questions/insights. 
People are probably going to want more detail on the list to give alternate
ways of attacking the problem. That said....

I am going to try and fill in some of the gaps where I can...
The copy suggestion is a good one if you are unloading to another
application for actual data processing. 

Are you watching an avg of all cores or just one core when you see this cpu
bound issue ? I would expect a simple "select * from table " to get some IO
wait. 

Our application doesn't have many concurrent requests so this 
observation is of a single query resulting in a single cpu being maxed. 
We have played with making multiple requests each for a subset of the 
data and this scales fairly well, but again with the process seemingly 
very cpu bound for a conceptually simple data retrieval. Our problem is 
really about answering a single large question as quickly as possible vs 
the more typical requests per/s type throughput (more OLAP than OLTP).
Does the table your reading from have most of the data stored in TOAST? I
ask because as a default there is some compression with TOAST and maybe your
spending more time with decompression that expected. Not the first thing
that I would think to change or where I suspect the problem comes from. 

No the example table isn't too far from the real thing (simple I know) 
just add a few int/short metadata columns and you have it. And the 
example table exhibits the same performance issue so I think it is a 
reasonable test case.
More detailed hardware list and more detailed example case will probably
help a lot with getting some of the really smart PG people on it, (like Tom
Lane or some of people who work for a postgresql paid support company)

Hardware
------------------------------------
2 x Dual Core 2.4GHz Opteron
8G Ram
4-Drive Raid 5
For the record: 35MB/s seq reads isn't that fast so a lot of people are
going to wonder why that number is so low.

I completely agree. (With the large RAID arrays or even SSDs arrays I 
have seen here on the boards 3G/s isn't crazy any longer)

I think our dilemma was that we weren't seemingly able to make use of 
what we had for IO throughput.  This was most evident when we did some 
tests with smaller datasets that could fit entirely into the disk cache, 
we saw (via iostat) that indeed the query ran almost entirely from the 
disk cache, but yielded nearly the same 5 MB/s throughput. This seemed 
to indicate that our problem was somewhere other than the storage 
infrastructure and lead us to the single CPU bottleneck discovery.
Anyways since I suspect that improving IO some will actually speed up your
select * case I submit the following for you to consider.

My suggestions to improve a "select * from table" case (and copy
performance): 

First, if you haven't, bump your read ahead value, this should improve
things some - however probably not enough by itself. 

blockdev --setra 16384 /dev/<devicename>

Will check into this one.
The default on most linux installs is IMO woefully small at 256. 16384 might
be a little high for you but it has worked well for us and our hardware.

If your data directory is mounted as its own partition or logical disk you
might consider mounting it with the noatime flag. 

Also are you running ext3?  If you can pull in XFS (and do noatime there as
well) you should see about a 20% increase. It looks like this is a redhat or
cent box. If this is RHEL-ES you will need to either do a custom import for
xfsdump and xfsprogs your self and risk support issues from RH, or if it is
cent you can pull in the "extras". (if Redhat and not Adv. Server you can
buy it from Redhat for ES servers) CENT/RH 6 should have XFS support by
default that might be too far off for you.

(this 20% number came from our own inhouse testing of sequential read tests
with dd) but there are plenty of other posts on web showing how much of an
improvement XFS is over ext3. 

If you haven't broken out the data directory to it's own partition (and
hopefully spindles) there are some problems with using noatime on a system,
be aware of what they are. 

You will probably still be annoyed with a single long running query getting
bottlenecked at a single cpu core but without a more detailed example case
people might have a hard time helping with solving that problem.. 

Anyways try these and see if that gets you anywhere. 

ext3 currently, our support/IT layer may balk at non-Redhat RPM stuff, 
have to see on that one.

I think if we can get past the seeming CPU bound then all of the above 
would be candidates for optimizing the IO. We had actually slated to buy 
a system with an SSD array separate spindles for OS, data, etc, but when 
I couldn't make a convincing case for the end user response time 
improvements it was put on hold, and hence the source of our questions 
here. ;)
You could always play with the greenplum singlenode addition if you want to
see a way to still sort-of be on postgres and use all cores... but that
introduces a whole host of other issues to solve. 

Interesting. I had heard of Greenplum, but thought it was more about 
scaling to clusters rather than single node improvements. We will have 
to look into that.

Thanks again for all the ideas, questions and things to look into I 
think you have opened a number of new possibilities.

Cheers,

Nick

-----Original Message-----
From: pgsql-performance-owner@xxxxxxxxxxxxxx
[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Nick Matheson
Sent: Wednesday, November 03, 2010 9:53 AM
To: pgsql-performance@xxxxxxxxxxxxxx
Subject:  Simple (hopefully) throughput question?

Hello

We have an application that needs to do bulk reads of ENTIRE
Postgres tables very quickly (i.e. select * from table).  We have
observed that such sequential scans run two orders of magnitude slower
than observed raw disk reads (5 MB/s versus 100 MB/s).  Part of this is
due to the storage overhead we have observed in Postgres.  In the
example below, it takes 1 GB to store 350 MB of nominal data.  However
that suggests we would expect to get 35 MB/s bulk read rates.

Observations using iostat and top during these bulk reads suggest
that the queries are CPU bound, not I/O bound.  In fact, repeating the
queries yields similar response times.  Presumably if it were an I/O
issue the response times would be much shorter the second time through
with the benefit of caching.

We have tried these simple queries using psql, JDBC, pl/java stored
procedures, and libpq.  In all cases the client code ran on the same
box as the server. We have experimented with Postgres 8.1, 8.3 and 9.0.

We also tried playing around with some of the server tuning parameters such
as shared_buffers to no avail.

Here is uname -a for a machine we have tested on:

Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06
EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

A sample dataset that reproduces these results looks like the following
(there are no indexes):

Table "bulk_performance.counts"
 Column |  Type   | Modifiers
--------+---------+-----------
 i1     | integer |
 i2     | integer |
 i3     | integer |
 i4     | integer |

There are 22 million rows in this case.

We HAVE observed that summation queries run considerably faster.  In this
case,

select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts

runs at 35 MB/s.

Our business logic does operations on the resulting data such that
the output is several orders of magnitude smaller than the input.  So
we had hoped that by putting our business logic into stored procedures
(and thus drastically reducing the amount of data flowing to the
client) our throughput would go way up.  This did not happen.

So our questions are as follows:

Is there any way using stored procedures (maybe C code that calls
SPI directly) or some other approach to get close to the expected 35
MB/s doing these bulk reads?  Or is this the price we have to pay for
using SQL instead of some NoSQL solution.  (We actually tried Tokyo
Cabinet and found it to perform quite well. However it does not measure
up to Postgres in terms of replication, data interrogation, community
support, acceptance, etc).

Thanks

Dan Schaffer
Paul Hamer
Nick Matheson

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance