> Il giorno 14 feb 2018, alle ore 16:44, Jens Axboe <axboe@xxxxxxxxx> ha scritto: > > On 2/14/18 8:39 AM, Paolo Valente wrote: >> >> >>> Il giorno 14 feb 2018, alle ore 16:19, Jens Axboe <axboe@xxxxxxxxx> ha scritto: >>> >>> On 2/14/18 1:56 AM, Paolo Valente wrote: >>>> >>>> >>>>> Il giorno 14 feb 2018, alle ore 08:15, Mike Galbraith <efault@xxxxxx> ha scritto: >>>>> >>>>> On Wed, 2018-02-14 at 08:04 +0100, Mike Galbraith wrote: >>>>>> >>>>>> And _of course_, roughly two minutes later, IO stalled. >>>>> >>>>> P.S. >>>>> >>>>> crash> bt 19117 >>>>> PID: 19117 TASK: ffff8803d2dcd280 CPU: 7 COMMAND: "kworker/7:2" >>>>> #0 [ffff8803f7207bb8] __schedule at ffffffff81595e18 >>>>> #1 [ffff8803f7207c40] schedule at ffffffff81596422 >>>>> #2 [ffff8803f7207c50] io_schedule at ffffffff8108a832 >>>>> #3 [ffff8803f7207c60] blk_mq_get_tag at ffffffff8129cd1e >>>>> #4 [ffff8803f7207cc0] blk_mq_get_request at ffffffff812987cc >>>>> #5 [ffff8803f7207d00] blk_mq_alloc_request at ffffffff81298a9a >>>>> #6 [ffff8803f7207d38] blk_get_request_flags at ffffffff8128e674 >>>>> #7 [ffff8803f7207d60] scsi_execute at ffffffffa0025b58 [scsi_mod] >>>>> #8 [ffff8803f7207d98] scsi_test_unit_ready at ffffffffa002611c [scsi_mod] >>>>> #9 [ffff8803f7207df8] sd_check_events at ffffffffa0212747 [sd_mod] >>>>> #10 [ffff8803f7207e20] disk_check_events at ffffffff812a0f85 >>>>> #11 [ffff8803f7207e78] process_one_work at ffffffff81079867 >>>>> #12 [ffff8803f7207eb8] worker_thread at ffffffff8107a127 >>>>> #13 [ffff8803f7207f10] kthread at ffffffff8107ef48 >>>>> #14 [ffff8803f7207f50] ret_from_fork at ffffffff816001a5 >>>>> crash> >>>> >>>> This has evidently to do with tag pressure. I've looked for a way to >>>> easily reduce the number of tags online, so as to put your system in >>>> the bad spot deterministically. But at no avail. Does anyone know a >>>> way to do it? >>> >>> The key here might be that it's not a regular file system request, >>> which I'm sure bfq probably handles differently. So it's possible >>> that you are slowly leaking those tags, and we end up in this >>> miserable situation after a while. >>> >> >> Could you elaborate more on this? My mental model of bfq hooks in >> this respect is that they do only side operations, which AFAIK cannot >> block the putting of a tag. IOW, tag getting and putting is done >> outside bfq, regardless of what bfq does with I/O requests. Is there >> a flaw in this? >> >> In any case, is there any flag in or the like, in requests passed to >> bfq, that I could make bfq check, to raise some warning? > > I'm completely guessing, and I don't know if this trace is always what > Mike sees when things hang. It just seems suspect that we end up with a > "special" request here, since I'm sure the regular file system requests > outnumber them greatly. That raises my suspicion that the type is > related. > > But no, there should be no special handling on the freeing side, my > guess was that BFQ ends them a bit differently. > Hi Jens, whatever the exact cause of leakage is, a leakage in its turn does sound like a reasonable cause for these hangs. But also if leakage is the cause, it seems to me that reducing tags to just 1 might help trigger the problem quickly and reliably on Mike's machine. If you agree, Jens, which would be the quickest/easiest way to reduce tags? Thanks, Paolo > Mike, when you see a hang like that, would it be possible for you to > dump the contents of /sys/kernel/debug/block/<dev in question/* for us > to inspect? That will tell us a lot about the internal state at that > time. > > -- > Jens Axboe