Re: [PATCHSET v5 0/12] Add support for async buffered reads

Andres Freund <andres@xxxxxxxxxxx> · Wed, 3 Jun 2020 18:30:45 -0700

Hi,

On 2020-06-03 19:04:17 -0600, Jens Axboe wrote:
> > The workload that triggers the bug within a few seconds is postgres
> > doing a parallel sequential scan of a large table (and aggregating the
> > data, but that shouldn't matter). In the triggering case that boils down
> > to 9 processes sequentially reading a number of 1GB files (we chunk
> > tables internally into smaller files). Each process will read a 512kB
> > chunk of the file on its own, and then claim the next 512kB from a
> > shared memory location. Most of the IO will be READV requests, reading
> > 16 * 8kB into postgres' buffer pool (which may or may not be neighboring
> > 8kB pages).
> 
> I'll try and reproduce this, any chance you have a test case that can
> be run so I don't have to write one from scratch? The more detailed
> instructions the better.

It shouldn't be too hard to write you a detailed script for reproducing
the issue. But it'd not be an all that minimal reproducer, unless it
also triggers on smaller scale (it's a 130GB database that triggers the
problem reliably, and small tables don't seem to do so reliably).

I'll try to write that up after I set up kvm / repro there.

One thing I forgot in the earlier email: I ran the benchmark using 'perf
stat -a -e ...'. I'm fairly, but not absolutely, certain that it also
triggered without that. I don't think it's related, but I thought I
better mention it.

> I have a known issue with request starvation, wonder if that could be it.
> I'm going to rebase the branch on top of the aops->readahead() changes
> shortly, and fix that issue. Hopefully that's what's plaguing your run
> here, but if not, I'll hunt that one down.

FWIW, I had iostat -xm /dev/nvme1n1 1 running during this. Shortly
before the crash I see:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       6221.00    956.09  3428.00  35.53    0.24   157.38    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.48  99.00

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       6456.00    978.83  3439.00  34.75    0.21   155.25    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.38  98.70

It's maybe also worth noting that in this workload the results are
*worse* than when using 5.7-rc7 io_uring. So perhaps request starvation
isn't the worst guess...

Greetings,

Andres Freund