On 03/28/2018 10:13 PM, Paolo Valente wrote: > > >> Il giorno 29 mar 2018, alle ore 05:22, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >> >> On 3/28/18 9:13 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>> On 03/28/2018 06:02 PM, Jens Axboe wrote: >>>> On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>>>> I am not subscribed to any of the lists on the To list here, please CC >>>>> me on any replies. >>>>> >>>>> I am encountering a fairly consistent crash anywhere from 15 minutes to >>>>> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >>>>> The crash looks like: >>>>> >>> >>>>> >>>>> Looking through the code, I'd guess that this is dying inside >>>>> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >>>>> is pointing at. >>>> >>>> Leaving the whole thing here for Paolo - it's crashing off insertion of >>>> a request coming out of SG_IO. Don't think we've seen this BFQ failure >>>> case before. >>>> >>>> You can mitigate this by switching the scsi-mq devices to mq-deadline >>>> instead. >>>> >>> >>> I'm thinking that I should also be able to mitigate it by disabling >>> CONFIG_DEBUG_BLK_CGROUP. >>> >>> That should remove that entire chunk of code. >>> >>> Of course, that won't help if this is actually a symptom of a bigger >>> problem. >> >> Yes, it's not a given that it will fully mask the issue at hand. But >> turning off BFQ has a much higher chance of working for you. >> >> This time actually CC'ing Paolo. >> > > Hi Zephaniah, > if you are actually interested in the benefits of BFQ (low latency, > high responsiveness, fairness, ...) then it may be worth to try what > you yourself suggest: disabling CONFIG_DEBUG_BLK_CGROUP. Also because > this option activates the heavy computation of debug cgroup statistics, > which probably you don't use. I definitely am. > > In addition, the outcome of your attempt without > CONFIG_DEBUG_BLK_CGROUP would give us useful bisection information: > - if no failure occurs, then the issue is likely to be confined in > that debugging code (which, on the bright side, is likely to be of > occasional interest, for only a handful of developers) > - if the issue still shows up, then we may have new hints on this odd > failure > > Finally, consider that this issue has been reported to disappear from > 4.16 [1], and, as a plus, that the service quality of BFQ had a > further boost exactly from 4.16. I look forward to that either way then. > > Looking forward to your feedback, in case you try BFQ without > CONFIG_DEBUG_BLK_CGROUP, I'm running that now, judging from the past if it survives until tomorrow evening then we're good, so I should hopefully know in the next day. Thank you, Zephaniah E. Loss-Cutler-Hull. > Paolo > > [1] https://www.spinics.net/lists/linux-block/msg21422.html > >> >> -- >> Jens Axboe >
Attachment:
signature.asc
Description: OpenPGP digital signature