Re: [bug report] shared tags causes IO hang and performance drop

Bart Van Assche <bvanassche@xxxxxxx> · Mon, 19 Apr 2021 20:22:52 -0700

On 4/19/21 8:06 PM, Douglas Gilbert wrote:
> I have always suspected under extreme pressure the block layer (or scsi
> mid-level) does strange things, like an IO hang, attempts to prove that
> usually lead back to my own code :-). But I have one example recently
> where upwards of 10 commands had been submitted (blk_execute_rq_nowait())
> and the following one stalled (all on the same thread). Seconds later
> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and
> my dd variant went to its conclusion (reporting 10 errors). Following
> copies showed no ill effects.
> 
> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last
> two monitor for stalls during the copy. Each submitted READ and WRITE
> command gets its pack_id from an incrementing atomic and a management
> thread in those copies checks every 300 milliseconds that that atomic
> value is greater than the previous check. If not, dump the state of the
> sg driver. The stalled request was in busy state with a timeout of 1
> nanosecond which indicated that blk_execute_rq_nowait() had not been
> called. So the chief suspect would be blk_get_request() followed by
> the bio setup calls IMO.
> 
> So it certainly looked like an IO hang, not a locking, resource nor
> corruption issue IMO. That was with a branch off MKP's
> 5.13/scsi-staging branch taken a few weeks back. So basically
> lk 5.12.0-rc1 .

Hi Doug,

If it would be possible to develop a script that reproduces this hang and
if that script can be shared I will help with root-causing and fixing this
hang.

Thanks,

Bart.