Re: Hardware/OS recommendations for large databases (

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bruce Momjian wrote:
Greg Stark wrote:
Alan Stange <stange@xxxxxxxxxx> writes:

The point your making doesn't match my experience with *any* storage or program
I've ever used, including postgresql.   Your point suggests that the storage
system is idle  and that postgresql is broken because it isn't able to use the
resources available...even when the cpu is very idle.  How can that make sense?
Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.
For sequential scans, you do have a background reader. It's the kernel. As long as you don't issue a seek() between read() calls, the kernel will get the hint about sequential IO and begin to perform a read ahead for you. This is where the above analysis isn't quite right: while postgresql is processing the returned data from the read() call, the kernel has also issued reads as part of the read ahead, keeping the device busy while the cpu is busy. (I'm assuming these details for Linux; Solaris/UFS does work this way). Issue one seek on the file and the read ahead algorithm will back off for a while. This was my point about some descriptions of how the system works not being sensible.

If your goal is sequential IO, then one must use larger block sizes. No one would use 8KB IO for achieving high sequential IO rates. Simply put, read() is about the slowest way to get 8KB of data. Switching to 32KB blocks reduces all the system call overhead by a large margin. Larger blocks would be better still, up to the stripe size of your mirror. (Of course, you're using a mirror and not raid5 if you care about performance.)

I don't think the memcpy of data from the kernel to userspace is that big of an issue right now. dd and all the high end network interfaces manage OK doing it, so I'd expect postgresql to do all right with it now yet too. Direct IO will avoid that memcpy, but then you also don't get any caching of the files in memory. I'd be more concerned about any memcpy calls or general data management within postgresql. Does postgresql use the platform specific memcpy() in libc? Some care might be needed to ensure that the memory blocks within postgresql are all properly aligned to make sure that one isn't ping-ponging cache lines around (usually done by padding the buffer sizes by an extra 32 bytes or L1 line size). Whatever you do, all the usual high performance computing tricks should be used prior to considering any rewriting of major code sections.

Personally, I'd like to see some detailed profiling being done using hardware counters for cpu cycles and cache misses, etc. Given the poor quality of work that has been discussed here in this thread, I don't have much confidence in any other additional results at this time. None of the analysis would be acceptable in any environment in which I've worked. Be sure to take a look at Sun's free Workshop tools as they are excellent for this sort of profiling and one doesn't need to recompile to use them. If I get a little time in the next week or two I might take a crack at this.

Cheers,

-- Alan



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux