On Wed, Aug 09, 2017 at 05:10:01PM +0000, Bart Van Assche wrote: > On Wed, 2017-08-09 at 12:43 -0400, Laurence Oberman wrote: > > Your latest patch on stock upstream without Ming's latest patches is > > behaving for me. > > > > As already mentioned, the requeue -11 and clone failure messages are > > gone and I am not actually seeing any soft lockups or hard lockups. > > > > When Ming gets back I will work with him on his patch set and the lockups. > > > > Running 10 parallel writes which easily trips into soft lockups on > > Ming's kernel (even with your patch) has been stable here on 4.13-RC3 > > with your patch. > > > > I will leave it running for a while now but the patch is good. > > > > If it survives 4 hours I will add a Tested-by to your latest patch. > > Hello Laurence, > > I'm working on an additional patch that should reduce unnecessary requeuing > even further. I will let you know when it's ready. > > Additionally, please trim e-mails when replying such that e-mails do not get > too long. soft lockup still can be observed easily with patch d4acf3650c7c( block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time), but no hard lockup. With the patchset of 'blk-mq-sched: improve SCSI-MQ performance', hard lockup can be observed following some failure log: [ 269.277653] device-mapper: multipath: blk_get_request() returned -11 - requeuing [ 269.321244] device-mapper: multipath: blk_get_request() returned -11 - requeuing ... [ 273.421688] scsi host2: SRP abort called [ 273.444577] scsi host2: Sending SRP abort for tag 0x6007e [ 273.673871] scsi host2: Null scmnd for RSP w/tag 0x0000000006007e received on ch 6 / QP 0x30 ... [ 274.372110] device-mapper: multipath: blk_get_request() returned -11 - requeuing [ 278.658671] scsi host2: SRP abort called [ 278.690630] scsi host2: SRP abort called [ 278.717634] scsi host2: SRP abort called [ 278.745629] scsi host2: SRP abort called [ 279.083227] multipath_clone_and_map: 1092 callbacks suppressed .... [ 296.210503] scsi host2: SRP reset_device called .... [ 303.784287] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10 The trick thing is that both hard lockup and soft lockup share one same stack trace. Another question, I don't understand why request is allocated with GFP_ATOMIC in multipath_clone_and_map(), looks it shouldn't be necessary. -- Ming