Re: If I have a single bad sector, how many failed reads should simple dd report?

Greg Freemyer <greg.freemyer@xxxxxxxxx> · Sun, 11 Jul 2010 08:58:19 -0400

On Sat, Jul 10, 2010 at 10:14 AM, James Bottomley
<James.Bottomley@xxxxxxx> wrote:
> On Fri, 2010-07-09 at 21:24 -0400, Mark Lord wrote:
>> On 09/07/10 09:19 PM, Mark Lord wrote:
>> > On 09/07/10 03:04 PM, Greg Freemyer wrote:
>> > ..
>> >>> When I re-ran it, /var/log/messages reported 10 bad logical blocks.
>> >>> And even worse, dd reported 20 bad blocks. I examined the data dd
>> >>> read and it had 80KB of zero'ed out data. So that's 160 sectors worth
>> >>> of data lost because of a single bad sector. At most I was expecting
>> >>> 4KB of zero'ed out data.
>> > ..
>> >
>> > That's just the standard, undesirable result of the current SCSI EH
>> > when used with libata for (mainly) desktop computers.
>> >
>> > I have patches (against older kernels) to fix it, but have yet to
>> > get both myself and James B. interested enough simultaneously to
>> > actually get the kernel fixed. :)
>> ..
>>
>> Here (attached and inline below) are my most recent patches for this.
>> Still outdated, though.  These are against the SLES11 2.6.27.19 kernel:
>>
>> -----------------------------snip----------------------------
>>
>> Stop the SCSI EH from performing tons of retries on unrecoverable medium errors,
>> so that error-handling fails more quickly and we (EMC) avoid unneeded node resets.
>>
>> The ugliness of this patch matches the ugliness of SCSI EH.
>> Does *anyone* actually understand this code completely?
>>
>> Signed-off-by: Mark Lord <mlord@xxxxxxxxx>
>>
>> --- old/drivers/scsi/scsi_error.c     2009-06-04 09:46:55.000000000 -0400
>> +++ linux/drivers/scsi/scsi_error.c   2009-06-04 12:08:48.000000000 -0400
>> @@ -423,6 +423,52 @@
>>       }
>>   }
>>
>> +/*
>> + * The problem with scsi_check_sense(), is that it is (designed to be)
>> + * called only after retries are exhausted.  But for MEDIUM_ERRORs (only)
>> + * we don't want any retries here at all.
>> + *
>> + * So this function below is a clone of the necessary parts from scsi_check_sense(),
>> + * to check for unrecoverable MEDIUM_ERRORs when deciding whether to retry or not.
>> + */
>> +static int scsi_unrecoverable_medium_error(struct scsi_cmnd *scmd)
>> +{
>> +     struct scsi_sense_hdr sshdr;
>> +
>> +     if (! scsi_command_normalize_sense(scmd, &sshdr))
>> +             return 0;       /* no valid sense data */
>> +
>> +     if (scsi_sense_is_deferred(&sshdr))
>> +             return 0;
>> +
>> +     if (sshdr.response_code == 0x70) {
>> +             /* fixed format */
>> +             if (scmd->sense_buffer[2] & 0xe0)
>> +                     return 0;
>> +     } else {
>> +             /*
>> +              * descriptor format: look for "stream commands sense data
>> +              * descriptor" (see SSC-3). Assume single sense data
>> +              * descriptor. Ignore ILI from SBC-2 READ LONG and WRITE LONG.
>> +              */
>> +             if ((sshdr.additional_length > 3) &&
>> +                 (scmd->sense_buffer[8] == 0x4) &&
>> +                 (scmd->sense_buffer[11] & 0xe0))
>> +                     return 0;
>> +     }
>> +
>> +     switch (sshdr.sense_key) {
>> +     case MEDIUM_ERROR:
>> +             if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
>> +                 sshdr.asc == 0x13 || /* AMNF DATA FIELD */
>> +                 sshdr.asc == 0x14) { /* RECORD NOT FOUND */
>> +                     //printk(KERN_WARNING "%s: MEDIUM_ERROR\n", __func__);
>> +                     return 1;
>> +             }
>> +     }
>> +     return 0;
>> +}
>
>
>> +
>>   /**
>>    * scsi_eh_completed_normally - Disposition a eh cmd on return from LLD.
>>    * @scmd:   SCSI cmd to examine.
>> @@ -1334,6 +1380,8 @@
>>
>>       switch (status_byte(scmd->result)) {
>>       case CHECK_CONDITION:
>> +             if (scsi_unrecoverable_medium_error(scmd))
>> +                     return 1;
>>               /*
>>                * assume caller has checked sense and determinted
>>                * the check condition was retryable.
>
> The check is redundant:  scsi_decide_disposition is where we check for
> retries or error handling.  If you look, it already picks out your three
> cases and returns success for them (meaning pass straight through).  in
> sd_done() MEDIUM_ERROR means complete immediately.
>
>
>>
>> On encountering a bad sector, report and skip over it,
>> then continue with the remainder of the request.
>> Otherwise we would fail perfectly good sectors,
>> making a bad situation even worse.
>>
>> Signed-off-by: Mark Lord <mlord@xxxxxxxxx>
>>
>> --- old/drivers/scsi/scsi_lib.c       2009-06-04 12:26:52.000000000 -0400
>> +++ linux/drivers/scsi/scsi_lib.c     2009-06-04 14:40:11.000000000 -0400
>> @@ -952,6 +952,12 @@
>>        */
>>       if (sense_valid && !sense_deferred) {
>>               switch (sshdr.sense_key) {
>> +             case MEDIUM_ERROR:
>> +             /* Bad sector.  Fail it, and then continue the rest of the request. */
>> +             if (this_count && scsi_end_request(cmd, -EIO, cmd->device->sector_size, 1) == NULL) {
>> +                     cmd->retries = 0;       // go around again..
>> +                     return;
>> +             }
>>               case UNIT_ATTENTION:
>>                       if (cmd->device->removable) {
>>                               /* Detected disc change.  Set a bit
>
> This one's in the wrong place.  Normal MEDIUM_ERRORS complete above
> this.
>
> I also don't think the skip is right.  We're supposed to communicate to
> block what we've done, and all we have to do that with is good bytes.
> If we skip over a bad sector and try to complete the rest, we've lost
> the error position.
>
> The failure return from Medium error is defined to be the number of
> bytes we actually wrote, which is everything up to the medium error.
>
> James

If you guys have patches you want me to test, I'm happy to do so, but
I urge the team to work on this simple test case.

It may have been this way since day one of libata, but it certainly
looks like a collection of bugs to me.

Again, assuming /dev/sdb is a test drive without valuable data the
test case can be as simple as just:

==============
#create a 1000 sectors of random data starting at sector 1000
dd if=/dev/urandom of=/dev/sdb bs=512 seek=1000 count=1000

#make a clean copy of the data, starting at sector 0 and going well
past. 4MB total
dd if=/dev/sdb of=clean.dd conv=noerror,sync bs=4K count=1000

#corrupt a sector in the middle of the test data
hdparm --make-bad-sector 1500 --yes-i-know-what-i-am-doing /dev/sdb

#make another copy of the first 4MB
dd if=/dev/sdb of=corrupt.dd conv=noerror,sync bs=4K count=1000

#restore the drive sector so it is not permanently lost
hdparm --write-sector 1500 --yes-i-know-what-i-am-doing /dev/sdb

(I choose sector 1500 this time to make the testing process faster.)

cmp -bl clean.dd corrupt.dd > delta.log

Review delta.log for disagrees.  And /var/log/warn for logical block
failures as well as extraneous media errors.
==============

I found:

1) Corruption stated several bytes prior to the corrupt sector
2) With dd using 512 byte blocks, 64 sectors worth of corrupt data in corrupt.dd
3) With dd using 4K blocks, 160 sectors worth of corrupt data in
corrupt.dd.   (I don't know why the difference and I was using a
customized version of dd called dcfldd for this test.)
4) a 50% reduction in throughput based on one bad sector.  I'm not
sure that's a bug, but it is clearly an issue that can be looked at.
You need a much bigger dd run to see that impact.

Thanks
Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html