On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote: > Hi, > > At LSF/MM there was a slot about postgres' problems with the kernel. Our > top#1 concern is frequent slow read()s that happen while another process > calls fsync(), even though we'd be perfectly fine if that fsync() took > ages. > The "conclusion" of that part was that it'd be very useful to have a > demonstration of the problem without needing a full blown postgres > setup. I've quickly hacked something together, that seems to show the > problem nicely. > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > and the "IO Scheduling" bit in > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > > The tools output looks like this: > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf > ... > wal[12155]: avg: 0.0 msec; max: 0.0 msec > commit[12155]: avg: 0.2 msec; max: 15.4 msec > wal[12155]: avg: 0.0 msec; max: 0.0 msec > read[12157]: avg: 0.2 msec; max: 9.4 msec > ... > read[12165]: avg: 0.2 msec; max: 9.4 msec > wal[12155]: avg: 0.0 msec; max: 0.0 msec > starting fsync() of files > finished fsync() of files > read[12162]: avg: 0.6 msec; max: 2765.5 msec > > So, the average read time is less than one ms (SSD, and about 50% cached > workload). But once another backend does the fsync(), read latency > skyrockets. > > A concurrent iostat shows the problem pretty clearly: > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 1.00 0.00 6322.00 337.00 51.73 4.38 17.26 2.09 0.32 0.19 2.59 0.14 90.00 > sda 0.00 0.00 6016.00 303.00 47.18 3.95 16.57 2.30 0.36 0.23 3.12 0.15 94.40 > sda 0.00 0.00 6236.00 1059.00 49.52 12.88 17.52 5.91 0.64 0.20 3.23 0.12 88.40 > sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.37 5.42 27.73 5.33 0.04 100.00 > sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.30 5.25 11.08 5.23 0.04 100.00 > sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.41 4.07 100.40 4.04 0.03 100.00 > sda 0.00 0.00 3819.00 10096.00 31.14 120.47 22.31 42.80 3.10 0.32 4.15 0.07 96.00 > sda 0.00 0.00 6482.00 346.00 52.98 4.53 17.25 1.93 0.28 0.20 1.80 0.14 93.20 > > While the fsync() is going on (or the kernel decides to start writing > out aggressively for some other reason) the amount of writes to the disk > is increased by two orders of magnitude. Unsurprisingly with disastrous > consequences for read() performance. We really want a way to pace the > writes issued to the disk more regularly. Hi Andreas, I've finally dug myself out from under the backlog from LSFMM far enough to start testing this on my local IO performance test rig. tl;dr: I can't reproduce this peaky behaviour on my test rig. I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and an unmodified benchmark on a current 3.15-linus tree. All storage (guest and host) is XFS based, guest VMs use virtio and direct IO to the backing storage. The host is using noop IO scheduling. The first IO setup I ran was a 100TB XFS filesystem in the guest. The backing file is a sparse file on an XFS filesystem on a pair of 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are exported as JBOD from a RAID controller which has 1GB of FBWC. The guest is capable of sustaining around 65,000 random read IOPS and 40,000 write IOPS through this filesystem depending on the tests being run. The iostat output looks like this: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 1817.00 315.40 18.80 6.93 24.71 0.80 0.38 0.38 0.37 0.31 66.24 vdc 0.00 0.00 2094.20 323.20 21.82 7.10 24.50 0.81 0.33 0.34 0.27 0.28 68.48 vdc 0.00 0.00 1989.00 4500.20 20.50 56.64 24.34 24.82 3.82 0.43 5.32 0.12 80.16 vdc 0.00 0.00 2019.80 320.80 20.83 7.05 24.39 0.83 0.35 0.36 0.32 0.29 69.04 vdc 0.00 0.00 2206.60 323.20 22.57 7.10 24.02 0.87 0.34 0.34 0.33 0.28 71.92 vdc 0.00 0.00 2437.20 329.60 25.79 7.24 24.45 0.83 0.30 0.30 0.27 0.26 71.76 vdc 0.00 0.00 1224.40 11263.80 12.88 136.38 24.48 64.90 5.20 0.69 5.69 0.07 84.96 vdc 0.00 0.00 2074.60 319.40 21.03 7.01 23.99 0.84 0.35 0.36 0.30 0.29 68.96 vdc 0.00 0.00 1999.00 881.60 20.81 13.61 24.47 4.23 1.43 0.35 3.88 0.23 67.36 vdc 0.00 0.00 1193.40 2273.00 12.42 29.68 24.87 11.70 3.40 0.53 4.91 0.18 60.96 vdc 0.00 0.00 1724.00 314.40 17.80 6.91 24.83 0.74 0.36 0.36 0.35 0.30 61.52 vdc 0.00 0.00 2605.20 325.60 24.67 7.15 22.24 0.90 0.31 0.31 0.25 0.25 72.72 vdc 0.00 0.00 2340.60 324.40 23.85 7.12 23.80 0.83 0.31 0.31 0.29 0.26 68.96 vdc 0.00 0.00 2749.60 329.40 28.51 7.24 23.78 0.90 0.29 0.30 0.25 0.24 75.04 vdc 0.00 0.00 2619.60 324.60 27.72 7.13 24.24 0.88 0.30 0.30 0.28 0.24 71.76 vdc 0.00 0.00 1608.60 4532.40 17.17 56.40 24.54 25.78 4.20 0.50 5.51 0.12 75.36 vdc 0.00 0.00 2361.00 326.40 23.62 7.17 23.47 0.87 0.33 0.33 0.30 0.26 69.92 vdc 0.00 0.00 2460.00 326.00 25.89 7.16 24.30 0.88 0.32 0.32 0.26 0.26 72.72 vdc 0.00 0.00 2519.00 325.40 25.96 7.14 23.83 0.90 0.32 0.32 0.30 0.26 74.32 vdc 0.00 0.00 2709.80 326.20 28.91 7.17 24.34 0.94 0.31 0.30 0.36 0.25 75.52 vdc 0.00 0.00 2676.65 329.74 28.82 7.24 24.56 0.86 0.28 0.29 0.25 0.24 71.22 vdc 0.00 0.00 1788.40 4506.80 18.77 56.66 24.54 23.22 3.69 0.45 4.97 0.12 74.88 vdc 0.00 0.00 1850.40 319.60 19.76 7.02 25.28 0.80 0.37 0.37 0.34 0.30 64.80 Its obvious where the fsyncs are hitting, but they are making almost no impact on the read performance. The benchmark is simple not generating enough dirty data to overload the IO subsystem, and hence there's no latency spikes to speak of. Benchmark output across fsyncs also demonstrates that: .... read[12494]: avg: 0.4 msec; max: 9.8 msec read[12499]: avg: 0.4 msec; max: 9.8 msec read[12495]: avg: 0.4 msec; max: 9.8 msec commit[12491]: avg: 0.0 msec; max: 6.3 msec wal[12491]: avg: 0.0 msec; max: 0.0 msec wal[12491]: avg: 0.0 msec; max: 0.0 msec starting fsync() of files finished fsync() of files wal[12491]: avg: 0.0 msec; max: 0.7 msec commit[12491]: avg: 0.1 msec; max: 15.8 msec wal[12491]: avg: 0.0 msec; max: 0.1 msec wal[12491]: avg: 0.0 msec; max: 0.1 msec read[12492]: avg: 0.6 msec; max: 10.0 msec read[12496]: avg: 0.6 msec; max: 10.0 msec read[12507]: avg: 0.6 msec; max: 10.0 msec read[12505]: avg: 0.6 msec; max: 10.0 msec .... So, I though switching to spinning disks might show the problem. Same VM, this time using a 17TB linearly preallocated image file on an 18TB XFS filesystem on the host (virtio in the guest again) on a 12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD from a RAID controller with 512MB of BBWC. This is capable of about 2,000 random read IOPs, and write IOPS is dependent on the BBWC flushing behaviour (peaks at about 15,000, sustains 1500). Again, I don't see any bad behaviour: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdd 0.00 0.00 128.00 154.20 1.46 6.02 54.24 1.02 3.62 7.22 0.62 3.17 89.44 vdd 0.00 0.00 133.80 154.40 1.43 6.03 53.04 1.08 3.76 7.42 0.59 3.05 87.84 vdd 0.00 0.00 132.40 751.60 1.36 12.75 32.70 7.26 8.21 7.90 8.27 1.06 93.76 vdd 0.00 0.00 138.60 288.60 1.47 7.62 43.57 2.69 6.30 7.24 5.85 2.15 91.68 vdd 0.00 0.00 141.20 157.00 1.51 6.12 52.36 1.12 3.75 7.26 0.59 3.11 92.64 vdd 0.00 0.00 136.80 154.00 1.52 6.02 53.09 1.04 3.57 6.92 0.60 3.10 90.08 vdd 0.00 0.00 135.00 154.20 1.46 6.02 52.99 1.04 3.60 7.00 0.62 3.03 87.68 vdd 0.00 0.00 147.20 155.80 1.50 6.09 51.26 1.19 3.92 7.45 0.59 3.06 92.72 vdd 0.00 0.00 139.60 154.00 1.50 6.02 52.43 1.03 3.53 6.79 0.58 2.97 87.28 vdd 0.00 0.00 126.80 913.80 1.27 14.81 31.63 7.87 7.56 9.02 7.36 0.89 92.16 vdd 0.00 0.00 148.20 156.80 1.51 6.11 51.15 1.17 3.83 7.25 0.59 3.03 92.32 vdd 0.00 0.00 142.60 155.80 1.49 6.09 52.03 1.12 3.76 7.23 0.60 3.05 90.88 vdd 0.00 0.00 146.00 156.60 1.49 6.12 51.48 1.16 3.84 7.32 0.60 3.03 91.76 vdd 0.00 0.00 148.80 157.20 1.56 6.14 51.54 1.10 3.58 6.78 0.54 2.92 89.36 vdd 0.00 0.00 149.80 156.80 1.52 6.12 51.05 1.14 3.73 7.10 0.52 2.96 90.88 vdd 0.00 0.00 127.20 910.60 1.29 14.82 31.80 8.17 7.87 8.62 7.77 0.88 91.44 vdd 0.00 0.00 150.80 157.80 1.58 6.15 51.27 1.09 3.53 6.67 0.52 2.91 89.92 vdd 0.00 0.00 153.00 156.60 1.53 6.12 50.56 1.17 3.80 7.07 0.60 2.95 91.20 vdd 0.00 0.00 161.20 157.00 1.54 6.13 49.36 1.27 4.00 7.36 0.55 2.97 94.48 vdd 0.00 0.00 152.10 157.09 1.55 6.14 50.91 1.15 3.73 7.06 0.50 2.98 92.22 vdd 0.00 0.00 154.80 157.80 1.50 6.16 50.18 1.19 3.82 7.14 0.57 2.88 90.00 vdd 0.00 0.00 122.20 922.80 1.24 14.93 31.70 8.30 7.94 8.81 7.83 0.90 93.68 vdd 0.00 0.00 139.80 153.60 1.46 5.98 51.98 1.09 3.74 7.14 0.65 3.02 88.48 vdd 0.00 0.00 134.60 153.60 1.44 6.00 52.87 1.09 3.78 7.38 0.63 3.15 90.88 vdd 0.00 0.00 141.00 153.40 1.55 5.99 52.49 1.05 3.57 6.79 0.60 3.04 89.44 vdd 0.00 0.00 130.20 154.20 1.44 6.02 53.76 1.04 3.65 7.28 0.59 3.14 89.44 vdd 0.00 0.00 136.60 154.20 1.30 6.02 51.52 1.19 4.07 8.02 0.58 3.03 88.24 vdd 0.00 0.00 119.00 872.20 1.29 14.29 32.20 7.98 8.05 8.59 7.98 0.93 92.64 vdd 0.00 0.00 139.00 154.00 1.58 6.00 53.01 1.06 3.62 6.95 0.61 3.12 91.52 vdd 0.00 0.00 146.00 155.40 1.52 6.07 51.60 1.12 3.71 7.05 0.57 3.01 90.72 vdd 0.00 0.00 145.20 156.60 1.55 6.12 52.01 1.13 3.73 7.10 0.60 3.07 92.80 vdd 0.00 0.00 147.80 156.80 1.51 6.12 51.36 1.13 3.72 7.07 0.56 2.97 90.56 vdd 0.00 0.00 142.80 157.00 1.48 6.12 51.98 1.11 3.70 7.20 0.52 3.02 90.56 vdd 0.00 0.00 125.20 922.00 1.36 14.92 31.85 8.34 7.96 8.47 7.90 0.88 92.56 Again, you can see exactly where the fsyncs are hitting, and again they are not massive spikes of write IO. And read latencies are showing: read[12711]: avg: 6.1 msec; max: 182.2 msec ... read[12711]: avg: 6.0 msec; max: 198.8 msec .... read[12711]: avg: 5.9 msec; max: 158.8 msec .... read[12711]: avg: 6.1 msec; max: 127.5 msec .... read[12711]: avg: 5.9 msec; max: 355.6 msec .... read[12711]: avg: 6.0 msec; max: 262.8 msec .... No significant outliers. The same for commits and wal: wal[12701]: avg: 0.0 msec; max: 0.2 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec commit[12701]: avg: 0.0 msec; max: 1.0 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec starting fsync() of files finished fsync() of files wal[12701]: avg: 0.0 msec; max: 0.1 msec commit[12701]: avg: 0.1 msec; max: 129.2 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec commit[12701]: avg: 0.0 msec; max: 0.9 msec wal[12701]: avg: 0.0 msec; max: 0.2 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec wal[12701]: avg: 0.0 msec; max: 0.1 msec commit[12701]: avg: 0.0 msec; max: 23.1 msec I'm not sure how you were generating the behaviour you reported, but the test program as it stands does not appear to be causing any problems at all on the sort of storage I'd expect large databases to be hosted on.... I've tried a few tweaks to the test program, but I haven't been able to make it misbehave. What do I need to tweak in the test program or my test VM to make the kernel misbehave like you reported? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx` -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>