Hello and thanks for the feedback!
On Wed, 12 Jul 2023, Thomas Munro wrote:
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou <jimis@xxxxxxx> wrote:
Note that I suspect my setup being related, (btrfs compression behaving
suboptimally) since the raw device can give me up to 1GB/s rate. It is however
evident that reading in bigger chunks would mitigate such setup inefficiencies.
On a system that reads are already optimal and the read rate remains the same,
then bigger block size would probably reduce the sys time postgresql consumes
because of the fewer system calls.
I don't know about btrfs but maybe it can be tuned to prefetch
sequential reads better...
I tried a lot to tweak the kernel's block layer read-ahead and to change
different I/O schedulers, but it made no difference. I'm now convinced
that the problem manifests specially on compressed btrfs: the filesystem
doesn't do any read-ahed (pre-fetch) so no I/O requests "merge" on the
block layer.
Iostat gives an interesting insight in the above measurements. For both
postgres doing sequential scan and for dd with bs=8k, the kernel block
layer does not appear to merge the I/O requests. `iostat -x` shows 16
sectors average read request size, 0 merged requests, and very high
reads/s IOPS number.
The dd commands with bs=32k block size show fewer IOPS on `iostat -x` but
higher speed(!), larger average block size and high number of merged
requests.
Example output for some random second out of dd bs=8k:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz
sdc 1313.00 20.93 2.00 0.15 0.53 16.32
with dd bs=32k:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz
sdc 290.00 76.44 4528.00 93.98 1.71 269.92
On the same filesystem, doing dd bs=8k reads from a file that has not been
compressed by the filesystem I get 1GB/s device read throughput!
I sent this feedback to the btrfs list, but got no feedback yet:
https://www.spinics.net/lists/linux-btrfs/msg137200.html
So would it make sense for postgres to perform reads in bigger blocks? Is it
easy-ish to implement (where would one look for that)? Or must the I/O unit be
tied to postgres' page size?
It is hard to implement. But people are working on it. One of the
problems is that the 8KB blocks that we want to read data into aren't
necessarily contiguous so you can't just do bigger pread() calls
without solving a lot more problems first.
This kind of overhaul is good, but goes much deeper. Same with async I/O
of course. But what I have in mind should be much simpler (add grains
of salt since I don't know postgres internals :-)
+ A process wants to read a block from a file
+ Postgres' buffer cache layer (shared_buffers?) looks it up in the cache,
if not found it passes the request down to
+ postgres' block layer; it submits an I/O request for 32KB that include
the 8K block requested; it returns the 32K block to
+ postgres' buffer cache layer; it stores all 4 blocks read from the disk
into the buffer cache, and returns only the 1 block requested.
The danger here is that in random non-contiguous 8K reads, the buffer
cache gets satsurated by 4x the amount of data because of 32K reads, and
75% of that data is useless, but may still evict useful data. The answer
is that is should be marked as unused then (by putting it in front of the
cache's LRU for example) so that those unused read-ahead pages are re-used
for upcoming read-ahead, without evicting too much useful pages.
The project at
https://wiki.postgresql.org/wiki/AIO aims to deal with the
"clustering" you seek plus the "gathering" required for non-contiguous
buffers by allowing multiple block-sized reads to be prepared and
collected on a pending list up to some size that triggers merging and
submission to the operating system at a sensible rate, so we can build
something like a single large preadv() call. In the current
prototype, if io_method=worker then that becomes a literal preadv()
call running in a background "io worker" process, but it could also be
OS-specific stuff (io_uring, ...) that starts an asynchronous IO
depending on settings. If you take that branch and run your test you
should see 128KB-sized preadv() calls.
Interesting and kind of sad that the last update on the wiki page is from
2021. What is the latest prototype? I'm not sure I'm up to the task of
putting my database to the test. ;-)
Thanks and regards,
Dimitris