Re: [PATCHv2]sd: Don't treat succeeded SYNC as error

Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> · Mon, 9 May 2016 18:41:31 +0200

On Wed, May 4, 2016 at 7:02 PM, Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> wrote:
> On Mon, May 2, 2016 at 3:44 PM, James Bottomley <jejb@xxxxxxxxxxxxxxxxxx> wrote:
>> On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote:
>>> On 04/29/2016 02:49 PM, Jinpu Wang wrote:
>>> > Hi, all
>>> >
>>> > We hit IO error on fsync, it turns out was because sd treat
>>> > succeeded
>>> > SYNC as error. From what I checked in SBC spec there is no
>>> > indication
>>> > we should fail IO in this case, so we create this patch.
>>> >
>>> >
>>> > Best Regards,
>>> >
>>> > Jack Wang
>>> >
>>> > v2:
>>> > No change on patch itself, only resend in body as suggested by
>>> > Bart,
>>> > still keep the attachment in case mail client break the format.
>>> >
>>> > From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 00:00:00
>>> > 2001
>>> > From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
>>> > Date: Mon, 25 Apr 2016 12:05:22 +0200
>>> > Subject: [PATCH] sd: Don't treat succeeded SYNC as error
>>> >
>>> > We hit IO error in our production on multipath devices during
>>> > resize
>>> > device on target side, the problem turns out sd driver passes up as
>>> > IO
>>> > error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate
>>> > Capacity data has changed, even storage side sync the data
>>> > properly.
>>> >
>>> > In order to fix this check in sd_done, report success if condition
>>> > matches.
>>> >
>>> > Sebastian Parschauer report/analyze the bug here:
>>> > https://sourceforge.net/p/scst/mailman/message/34953416/
>>> >
>>> > Signed-off-by: Sebastian Parschauer <s.parschauer@xxxxxx>
>>> > Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
>>> > ---
>>> >  drivers/scsi/sd.c | 13 +++++++++++++
>>> >  1 file changed, 13 insertions(+)
>>> >
>>> Well.
>>> Is there anything which guarantees us that 'capacity data has
>>> changed' will be the only sense code which we'll be seeing as a
>>> response to SYNCHRONIZE CACHE?
>>> I sincerely doubt so.
>>> So why don't you fall back to the default action (ie retry the
>>> command) whenever you hit an UNIT ATTENTION?
>>> This way we would cove any resulting sense code, _and_ would get rid
>>> of the rather ugly special case here.
>>
>> Actually, why are we getting here at all?  should we be eating this
>> unit attention once we've reported it in scsi_check_sense()?
>>
>> I also don't quite understand why the normal retry mechanism in
>> scsi_io_completion() (called after drv->done()) isn't handling this.
>>  We set retries on a flush command and we give sd_sync_cache three
>> goes.  Any one of those should also cause the CC/UA to be ignored.
>>
>> James
>>
>>
>
> Sorry for delay, I agree safer to retry this command.
> I checked the code path, in scsi_io_completion, we call
> __scsi_error_from_host_byte for FLUSH request,
> and we set error to EIO by default, somehow the code report error
> directly to user space without retry.
> [  647.206270] sd 1:0:0:0: [sdb] tag#0 Send: scmd 0xffff8800b6558480
> [  647.206422] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35
> 00 00 00 00 00 00 00 00 00
> [  647.209748] sd 1:0:0:0: Capacity data has changed
> [  647.209896] sd 1:0:0:0: [sdb] tag#0 Done: SUCCESS Result:
> hostbyte=DID_OK driverbyte=DRIVER_OK
> [  647.210157] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) 35
> 00 00 00 00 00 00 00 00 00
> [  647.210419] sd 1:0:0:0: [sdb] tag#0 Sense Key : Unit Attention [current]
> [  647.210567] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Capacity data has changed
> [  647.210741] sd 1:0:0:0: [sdb] tag#0 scsi host busy 1 failed 0
> [  647.210888] sd 1:0:0:0: Notifying upper driver of completion (result 8000002)
> [  647.211035] sd 1:0:0:0: [sdb] tag#0 sd_done: completed 0 of 0 bytes
> [  647.211182] sd 1:0:0:0: [sdb] tag#0 0 sectors total, 0 bytes done, error -5
> [  647.211330] blk_update_request: I/O error, dev sdb, sector 0
>
> Will figure out why retry doesn't work.
>
> Thanks James and Hannes for all your input.
>
> Regards,
> Jack

Hi James, Hannes and all,

I find out it's code below which report error directly back to user
space without any retry.
 913         /*
 914          * If we finished all bytes in the request we are done now.
 915          */
 916         if (!scsi_end_request(req, error, good_bytes, 0))
 917                 return;

But not sure, what's the best way to fix the behavior to let it retry,
 maybe add condition with sense key && asc && ascq direct go to
requeue before line 913?

Thanks
Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html