Hi, On 2020-06-03 19:04:17 -0600, Jens Axboe wrote: > > The workload that triggers the bug within a few seconds is postgres > > doing a parallel sequential scan of a large table (and aggregating the > > data, but that shouldn't matter). In the triggering case that boils down > > to 9 processes sequentially reading a number of 1GB files (we chunk > > tables internally into smaller files). Each process will read a 512kB > > chunk of the file on its own, and then claim the next 512kB from a > > shared memory location. Most of the IO will be READV requests, reading > > 16 * 8kB into postgres' buffer pool (which may or may not be neighboring > > 8kB pages). > > I'll try and reproduce this, any chance you have a test case that can > be run so I don't have to write one from scratch? The more detailed > instructions the better. It shouldn't be too hard to write you a detailed script for reproducing the issue. But it'd not be an all that minimal reproducer, unless it also triggers on smaller scale (it's a 130GB database that triggers the problem reliably, and small tables don't seem to do so reliably). I'll try to write that up after I set up kvm / repro there. One thing I forgot in the earlier email: I ran the benchmark using 'perf stat -a -e ...'. I'm fairly, but not absolutely, certain that it also triggered without that. I don't think it's related, but I thought I better mention it. > I have a known issue with request starvation, wonder if that could be it. > I'm going to rebase the branch on top of the aops->readahead() changes > shortly, and fix that issue. Hopefully that's what's plaguing your run > here, but if not, I'll hunt that one down. FWIW, I had iostat -xm /dev/nvme1n1 1 running during this. Shortly before the crash I see: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 6221.00 956.09 3428.00 35.53 0.24 157.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.48 99.00 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 6456.00 978.83 3439.00 34.75 0.21 155.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.38 98.70 It's maybe also worth noting that in this workload the results are *worse* than when using 5.7-rc7 io_uring. So perhaps request starvation isn't the worst guess... Greetings, Andres Freund