Re: [PATCH] scsi: fix race between simultaneous decrements of ->host_failed

Wei Fang <fangwei1@xxxxxxxxxx> · Mon, 30 May 2016 15:27:27 +0800

Hi James, Christoph,

On 2016/5/29 23:41, James Bottomley wrote:
> On Sat, 2016-05-28 at 23:54 -0700, Christoph Hellwig wrote:
>> On Sat, May 28, 2016 at 11:51:11AM +0800, Wei Fang wrote:
>>> async_sas_ata_eh(), which will call scsi_eh_finish_cmd() in some 
>>> case, would be performed simultaneously in 
>>> sas_ata_strategy_handler(). In this case, ->host_failed may be 
>>> decreased simultaneously in scsi_eh_finish_cmd() on different CPUs,
>>> and become abnormal.
>>>
>>> It will lead to permanently inequal between ->host_failed and
>>>  ->host_busy. Then SCSI error handler thread won't become running,
>>> SCSI errors after that won't be handled forever.
>>>
>>> Use atomic type for ->host_failed to fix this race.
>>
>> Looks fine,
> 
> Actually, it doesn't look fine at all.  The same mechanism that's
> supposed to protect the host_failed decrement is also supposed to
> protect the list_move_tail().  If there's a problem with the former
> then we're also in danger of corrupting the list.

Scmd is moved to local eh_done_q list here, and I checked that the
list won't be touched concurrently.

> Can we go back to the theory of what the problem is, since it's not
> spelled out very clearly in the change log.  Our usual reason for not
> requiring locking in eh routines is that the eh is single threaded on
> the eh thread per host, so any host manipulations can't have
> concurrency problems.  In this case, the sas_ata routines are trying to
> be clever and use asynchronous workqueues for the port error handler
> and you theorise that these can execute concurrently on two CPUs, thus
> causing the problem?

Yes, it's the case. The works of the port error handler are added to
system_unbound_wq, and will be performed concurrently on different CPUs.
We have already met that problem on our machine.

Thanks,
Wei

> James
> 
> 
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html