On 4/19/21 8:06 PM, Douglas Gilbert wrote: > I have always suspected under extreme pressure the block layer (or scsi > mid-level) does strange things, like an IO hang, attempts to prove that > usually lead back to my own code :-). But I have one example recently > where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) > and the following one stalled (all on the same thread). Seconds later > those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and > my dd variant went to its conclusion (reporting 10 errors). Following > copies showed no ill effects. > > My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last > two monitor for stalls during the copy. Each submitted READ and WRITE > command gets its pack_id from an incrementing atomic and a management > thread in those copies checks every 300 milliseconds that that atomic > value is greater than the previous check. If not, dump the state of the > sg driver. The stalled request was in busy state with a timeout of 1 > nanosecond which indicated that blk_execute_rq_nowait() had not been > called. So the chief suspect would be blk_get_request() followed by > the bio setup calls IMO. > > So it certainly looked like an IO hang, not a locking, resource nor > corruption issue IMO. That was with a branch off MKP's > 5.13/scsi-staging branch taken a few weeks back. So basically > lk 5.12.0-rc1 . Hi Doug, If it would be possible to develop a script that reproduces this hang and if that script can be shared I will help with root-causing and fixing this hang. Thanks, Bart.