On Thu, 17 Jun 2010 08:48:07 -0400 Brett Russ <bruss@xxxxxxxxxxx> wrote: > On 06/06/2010 08:07 PM, NeilBrown wrote: > > The goal of these patches is to add a 'bad block list' to each device > > and use it to allow us to fail single blocks rather than whole > > devices. > > Hi Neil, > > This is a worthwhile addition, I think. However, one concern we have is > there appears to be no distinction between media errors (i.e. bad > blocks) and other SCSI errors. One situation we commonly see in the > enterprise is non-media SCSI errors due to i.e. path failure. We've > tested dm multipath as a solution for that but it has its own problems, > primarily performance due to its apparent decomposition of large > contiguous I/Os into smaller I/Os and we're investigating that. Until > that is fixed, we have patched md to retry failed writes (md already has > a mechanism for failed reads). Commonly these retries will succeed as > many of the path failures we've seen have been transient (i.e. a SAS > expander undergoes a reset). Today in the vanilla md code that would > cause a drive failure. In this patch, it would identify a range of > blocks as bad. Presumably later they might be revalidated and removed > from the bad block list if the original error(s) were in fact transient, > but in the meantime we lose that member from any reads. Hi Brett, thanks for your thoughts. No, md doesn't differentiate between different types of errors. There are two reasons for this. 1/ I don't think it gets told what sort of error there was. The bi_end_io function is passed an error code, but I don't think that can be used to differentiate between e.g. media and transport errors. Maybe that has changed since I last looked.... 2/ I don't think it would help. md currently treats all errors as media errors. i.e. it assumes just that block is bad. If it can deal with that (and bad-block-lists expand the options of dealing with it) it does. If it cannot, it just rejects the device. If the error were actually a transport error, it would be very likely to quickly lead to an error that it could not deal with (e.g. updating metadata) and would have to reject the whole device. And that action is the only thing that it makes sense for md to do in the face of a transport error. Such an error says that we cannot reliable talk to the device, so md should stop trying. It is simply not appropriate for md to re-try on failure just as it is not appropriate for md to implement any timeouts. Both these actions imply some knowledge of the characteristics of the underlying device, and md simply does not have that knowledge. If you have a device where temporary path failures are possible, then it is up to the driver to deal with that possibility. For example, I believe the 'dasd' driver (which is for some sort of fibre connected drives on an IBM mainframe) normally treats cable problems as a transient error and retries indefinitely until they are repaired, or until the sysadmin says otherwise. This seems a reasonable approach. The only situation where it might make sense for md to retry is if it could retry in a 'different' way (trying the same thing again and expecting a different result is not entirely rational after all...). e.g. md/raid1 could issue reads with the FASTFAIL flag which - for dasd at least - says to not retry transport errors indefinitely. After an error from that read it would be sensible not to reject the device but just direct the read to a different device. If all devices failed with FASTFAIL, then try again without FASTFAIL - then treat such a failure as hard. That might be nice, but the last time I tried it different drivers treated FASTFAIL quite differently. e.g. my SATA devices would fairly often fail FASTFAIL requests even when they were otherwise working perfectly. I don't think that FASTFAIL is/was very well specified: 'fast' is a relative term after all. I note that there are now 3 different FAILFAST flags (DEV, TRANSPORT, and DRIVER). Maybe they have more useful implementations so maybe it is time to revisit this issue again. However it remains that if no FAILFAST flags are present, then it is up to the driver to do any retries that might be appropriate - md cannot be involved in retries at that level. > > As an aside, it would be handy to have mechanisms exposed to userspace > (via mdadm) to display, test, and possibly override the memory of these > bad blocks such that in these instances where md has (possibly > incorrectly) forced a range of blocks unavailable on a member that we > can recover data if the automated recovery doesn't succeed. Yes, the bad block list is entirely exposed to user-space via sysfs. Removing entries from the list directly is not currently supported (except for debugging purposes). To remove a bad block you just need to arrange a successful write to the device which can be done with the 'check' feature. Adding and examining bad blocks is easy. > > Do you have thoughts or plans to behave differently based on the type of > error? I believe today the SCSI layer only provides pass/fail, is that > correct? If so, plumbing would need to be added to make the upper layer > aware of the nature of the failure. It seems that the bad block > management in md should only take effect for media errors and that there > should be more intelligent handling of other types of errors. We would > be happy to help in this area if it aligns with your/the community's > longer term view of things. I've probably answered this question above, but to summarise: I think there could be some place for responding differently to different types of errors, but it would only be to respond more harshly than we currently do. I think that any differentiation should come by md making different sorts of requests (e.g. with or without FAILFAST), and possibly retrying such requests in a more forceful way, or after other successes have shown that it might be appropriate. Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html