Mel Gorman <mgorman@xxxxxxx> writes: > (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) Thanks. I've got a few comments and corrections for you below. > TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads > starve flusher writes. [snip] > 3. The blktrace indicates that reads can starve writes from flusher > > While there are people that can look at a blktrace and find problems > like they are rain man, I'm more like an ADHD squirrel when looking at > a trace. I wrote a script to look for what unrelated requests completed > while an request got stalled for over a second. It seemed like something > that a tool shoudl already exist for but I didn't find one unless btt > can give the information somehow. Care to share that script? [snip] > I recognise that the output will have a WTF reaction but the key > observations to me are > > a) a single write request from flusher took over a second to complete > b) at the time it was queued, it was mostly other writes that were in > the queue at the same time > c) The write request and the parallel writes were all asynchronous write > requests > D) at the time the request completed, there were a LARGE number of > other requested queued and completed at the same time. > > Of the requests queued and completed in the meantime the breakdown was > > 22 RM > 31 RA > 82 W > 445 R > > If I'm reading this correctly, it is saying that 22 reads were merged (RM), > 31 reads were remapped to another device (RA) which is probably reads from > the dm-crypt partition, 82 were writes (W) which is not far off the number > of writes that were in the queue and 445 were other reads. The delay was > dominated by reads that were queued after the write request and completed > before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. > That's saying that the 27128th request in the trace took over 7 seconds > to complete and was an asynchronous write from flusher. The contents of > the queue are displayed at that time and the breakdown of requests is > > 23 WS [JEM: write sync] > 86 RM [JEM: Read Meta] > 124 RA [JEM: Read Ahead] > 442 W > 1931 R > > 7 seconds later when it was completed the breakdown of completed > requests was > > 25 WS > 114 RM > 155 RA > 408 W > 2457 R > > In combination, that confirms for me that asynchronous writes from flush > are being starved by reads. When a process requires a buffer that is locked > by that asynchronous write from flusher, it stalls. > >> The thing is, we do want to make ext4 work well with cfq, and >> prioritizing non-readahead read requests ahead of data writeback does >> make sense. The issue is with is that metadata writes going through >> the block device could in some cases effectively cause a priority >> inversion when what had previously been an asynchronous writeback >> starts blocking a foreground, user-visible process. >> >> At least, that's the theory; > > I *think* the data more or less confirms the theory but it'd be nice if > someone else double checked in case I'm seeing what I want to see > instead of what is actually there. Looks sane. You can also see a lot of "preempt"s in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000001 0.000992259 8.898375882 2300861 Q2G 0.000000843 10.193261239 2064.079501935 1016463 <==== G2I 0.000000461 0.000044702 3.237065090 1015803 Q2M 0.000000101 8.203147238 2064.079367557 1311662 I2D 0.000002012 1.476824812 2064.089774419 1014890 M2D 0.000003283 6.994306138 283.573348664 1284872 D2C 0.000061889 0.014438316 0.857811758 2291996 Q2C 0.000072284 13.363007244 2064.092228625 2292191 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% ---------- | --------- --------- --------- --------- --------- Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over the data doesn't show any such delays (making me question the tools), but I'll write some code tomorrow to verify the btt output. Jan, if I were to come up with a way of promoting a particular async queue to the front of the line, where would I put such a call in the ext4/jbd2 code to be effective? Mel, can you reproduce this at will? Do you have a reproducer that I could run so I'm not constantly bugging you? Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html