On 05/10/2016 05:08 PM, James Bottomley wrote: > On Tue, 2016-05-10 at 16:48 +0200, Jinpu Wang wrote: >> On Mon, May 9, 2016 at 6:41 PM, Jinpu Wang < >> jinpu.wang@xxxxxxxxxxxxxxxx> wrote: >>> On Wed, May 4, 2016 at 7:02 PM, Jinpu Wang < >>> jinpu.wang@xxxxxxxxxxxxxxxx> wrote: >>>> On Mon, May 2, 2016 at 3:44 PM, James Bottomley < >>>> jejb@xxxxxxxxxxxxxxxxxx> wrote: >>>>> On Mon, 2016-05-02 at 12:05 +0200, Hannes Reinecke wrote: >>>>>> On 04/29/2016 02:49 PM, Jinpu Wang wrote: >>>>>>> Hi, all >>>>>>> >>>>>>> We hit IO error on fsync, it turns out was because sd treat >>>>>>> succeeded >>>>>>> SYNC as error. From what I checked in SBC spec there is no >>>>>>> indication >>>>>>> we should fail IO in this case, so we create this patch. >>>>>>> >>>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> Jack Wang >>>>>>> >>>>>>> v2: >>>>>>> No change on patch itself, only resend in body as suggested >>>>>>> by >>>>>>> Bart, >>>>>>> still keep the attachment in case mail client break the >>>>>>> format. >>>>>>> >>>>>>> From 5d1f72d9643ce61cd9f3d312377378c43f171d0c Mon Sep 17 >>>>>>> 00:00:00 >>>>>>> 2001 >>>>>>> From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> >>>>>>> Date: Mon, 25 Apr 2016 12:05:22 +0200 >>>>>>> Subject: [PATCH] sd: Don't treat succeeded SYNC as error >>>>>>> >>>>>>> We hit IO error in our production on multipath devices >>>>>>> during >>>>>>> resize >>>>>>> device on target side, the problem turns out sd driver >>>>>>> passes up as >>>>>>> IO >>>>>>> error when sense data is UNIT_ATTENTION and ASC && ASCQ >>>>>>> indicate >>>>>>> Capacity data has changed, even storage side sync the data >>>>>>> properly. >>>>>>> >>>>>>> In order to fix this check in sd_done, report success if >>>>>>> condition >>>>>>> matches. >>>>>>> >>>>>>> Sebastian Parschauer report/analyze the bug here: >>>>>>> https://sourceforge.net/p/scst/mailman/message/34953416/ >>>>>>> >>>>>>> Signed-off-by: Sebastian Parschauer <s.parschauer@xxxxxx> >>>>>>> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> drivers/scsi/sd.c | 13 +++++++++++++ >>>>>>> 1 file changed, 13 insertions(+) >>>>>>> >>>>>> Well. >>>>>> Is there anything which guarantees us that 'capacity data has >>>>>> changed' will be the only sense code which we'll be seeing as >>>>>> a >>>>>> response to SYNCHRONIZE CACHE? >>>>>> I sincerely doubt so. >>>>>> So why don't you fall back to the default action (ie retry >>>>>> the >>>>>> command) whenever you hit an UNIT ATTENTION? >>>>>> This way we would cove any resulting sense code, _and_ would >>>>>> get rid >>>>>> of the rather ugly special case here. >>>>> >>>>> Actually, why are we getting here at all? should we be eating >>>>> this >>>>> unit attention once we've reported it in scsi_check_sense()? >>>>> >>>>> I also don't quite understand why the normal retry mechanism in >>>>> scsi_io_completion() (called after drv->done()) isn't handling >>>>> this. >>>>> We set retries on a flush command and we give sd_sync_cache >>>>> three >>>>> goes. Any one of those should also cause the CC/UA to be >>>>> ignored. >>>>> >>>>> James >>>>> >>>>> >>>> >>>> Sorry for delay, I agree safer to retry this command. >>>> I checked the code path, in scsi_io_completion, we call >>>> __scsi_error_from_host_byte for FLUSH request, >>>> and we set error to EIO by default, somehow the code report error >>>> directly to user space without retry. >>>> [ 647.206270] sd 1:0:0:0: [sdb] tag#0 Send: scmd >>>> 0xffff8800b6558480 >>>> [ 647.206422] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) >>>> 35 >>>> 00 00 00 00 00 00 00 00 00 >>>> [ 647.209748] sd 1:0:0:0: Capacity data has changed >>>> [ 647.209896] sd 1:0:0:0: [sdb] tag#0 Done: SUCCESS Result: >>>> hostbyte=DID_OK driverbyte=DRIVER_OK >>>> [ 647.210157] sd 1:0:0:0: [sdb] tag#0 CDB: Synchronize Cache(10) >>>> 35 >>>> 00 00 00 00 00 00 00 00 00 >>>> [ 647.210419] sd 1:0:0:0: [sdb] tag#0 Sense Key : Unit Attention >>>> [current] >>>> [ 647.210567] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Capacity data >>>> has changed >>>> [ 647.210741] sd 1:0:0:0: [sdb] tag#0 scsi host busy 1 failed 0 >>>> [ 647.210888] sd 1:0:0:0: Notifying upper driver of completion >>>> (result 8000002) >>>> [ 647.211035] sd 1:0:0:0: [sdb] tag#0 sd_done: completed 0 of 0 >>>> bytes >>>> [ 647.211182] sd 1:0:0:0: [sdb] tag#0 0 sectors total, 0 bytes >>>> done, error -5 >>>> [ 647.211330] blk_update_request: I/O error, dev sdb, sector 0 >>>> >>>> Will figure out why retry doesn't work. >>>> >>>> Thanks James and Hannes for all your input. >>>> >>>> Regards, >>>> Jack >>> >>> Hi James, Hannes and all, >>> >>> I find out it's code below which report error directly back to user >>> space without any retry. >>> 913 /* >>> 914 * If we finished all bytes in the request we are done >>> now. >>> 915 */ >>> 916 if (!scsi_end_request(req, error, good_bytes, 0)) >>> 917 return; >>> >>> But not sure, what's the best way to fix the behavior to let it >>> retry, >>> maybe add condition with sense key && asc && ascq direct go to >>> requeue before line 913? >>> >>> Thanks >>> Jack >> >> >> Hi James , Hannes and all, >> >> I created a patch below, I did basic test on my test machines. Please >> share your comments! >> >> Thanks, >> Jack >> From 72ab860811e14e37db81fb409abf0fa7e7fe32cb Mon Sep 17 00:00:00 >> 2001 >> From: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> >> Date: Tue, 10 May 2016 10:10:59 +0200 >> Subject: [PATCH] scsi: requeue command on capacity data has changed >> >> We hit IO error in our production on multipath devices during resize >> device on target side, the problem turns out scsi driver passes up as >> IO >> error when sense data is UNIT_ATTENTION and ASC && ASCQ indicate >> Capacity data has changed, even storage side sync the data properly. >> >> To fix this, when condition met, we simply requeue the command. >> >> Reported-by: Sebastian Parschauer <s.parschauer@xxxxxx> >> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> >> --- >> drivers/scsi/scsi_lib.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >> index 8106515..b00310f 100644 >> --- a/drivers/scsi/scsi_lib.c >> +++ b/drivers/scsi/scsi_lib.c >> @@ -910,6 +910,11 @@ void scsi_io_completion(struct scsi_cmnd *cmd, >> unsigned int good_bytes) >> error = 0; >> } >> >> + if (sense_valid && (sshdr.sense_key == UNIT_ATTENTION)) { >> + if ((sshdr.asc == 0x2a && sshdr.ascq == 0x09)) >> + goto requeue; >> + } >> + > > Actually, I think this is symptomatic of a much bigger problem. Now > that the FS can send zero length non BLOCK_PC request, we're not > treating failure correctly. blk_update_request() will always finish > them becuase they have no bytes outstanding (not having any in the > first case). So I think we need a special exception for all zero > length commands which complete with a failure to allow them to retry > (if required). > > James > > --- > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 8106515..5a97866 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -911,9 +911,12 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) > } > > /* > - * If we finished all bytes in the request we are done now. > + * special case: failed zero length commands always need to > + * drop down into the retry code. Otherwise, if we finished > + * all bytes in the request we are done now. > */ > - if (!scsi_end_request(req, error, good_bytes, 0)) > + if (!(good_bytes == 0 && blk_rq_bytes(req) == 0 && result != 0) && > + !scsi_end_request(req, error, good_bytes, 0)) > return; > > /* > My, this is ugly. Plus most of this _should_ have been handled by these lines just above: } else if (blk_rq_bytes(req) == 0 && result && !sense_deferred) { /* * Certain non BLOCK_PC requests are commands that don't * actually transfer anything (FLUSH), so cannot use * good_bytes != blk_rq_bytes(req) as the signal for an error. * This sets the error explicitly for the problem case. */ error = __scsi_error_from_host_byte(cmd, result); } Wouldn't this patch fix it as well? diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 7cb66b0..68c0e74 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -693,7 +693,7 @@ static bool scsi_end_request(struct request *req, int error, struct scsi_device *sdev = cmd->device; struct request_queue *q = sdev->request_queue; - if (blk_update_request(req, error, bytes)) + if (bytes && blk_update_request(req, error, bytes)) return true; /* Bidi request must be completed as a whole */ Which actually looks like a valid patch anyway ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html