On Wed, Jun 29, 2016 at 08:01:56AM +0200, Hannes Reinecke wrote: > On 06/28/2016 07:33 PM, Chris Murphy wrote: > > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@xxxxxxx> wrote: > >> On 06/27/2016 06:42 PM, Chris Murphy wrote: > >>> Hi, > >>> > >>> Drives with SCT ERC not supported or unset, result in potentially long > >>> error recoveries for marginal or bad sectors: upwards of 180 second > >>> recovers are suggested. > >>> > >>> The kernel's SCSI command timer default of 30 seconds, i.e. > >>> > >>> cat /sys/block/<dev>/device/timeout > >>> > >>> conspires to undermine the deep recovery of most drives now on the > >>> market. This by default misconfiguration results in problems list > >>> regulars are very well aware of. It affects all raid configurations, > >>> and even affects the non-RAID single drive use case. And it does so in > >>> a way that doesn't happen on either Windows or macOS. Basically it is > >>> linux kernel induced data loss, the drive very possibly could present > >>> the requested data upon deep recovery being permitted, but the > >>> kernel's command timer is reached before recovery completes, and > >>> obliterates any possibility of recovering that data. By default. > >>> > >>> This now seems to affect the majority of use cases. At one time 30 > >>> seconds might have been sane for a world with drives that had less > >>> than 30 second recoveries for bad sectors. But that's no longer the > >>> case. > >>> > >> 'Majority of use cases'. > >> Hardly. I'm not aware of any issues here. > > > > This list is prolific with this now common misconfiguration. It > > manifests on average about weekly, as a message from libata that it's > > "hard resetting link". In every single case where the user is > > instructed to either set SCT ERC lower than 30 seconds if possible, or > > increase the kernel SCSI command timer well above 30 seconds (180 is > > often recommended on this list), suddenly the user's problems start to > > go away. > > > > Now the md driver gets an explicit read failure from the drive, after > > 30 seconds, instead of a link reset. And this includes the LBA for the > > bad sector, which is apparently what md wants to write the fixup back > > to that drive. > > > > However the manifestation of the problem and the nature of this list > > self-selects the user reports. Of course people with failed mdadm > > based RAID come here. But this problem is also manifesting on Btrfs > > for the same reasons. It also manifests, more rarely, with users who > > have just a single drive if the drive does "deep recovery" reads on > > marginally bad sectors, but the kernel flips out at 30 seconds > > preventing that recovery. Of course not every drive model has such > > deep recoveries, but by now it's extremely common. I have yet to see a > > single consumer hard drive, ever, configured out of the box with SCT > > ERC enabled. > > > So we should rather implement SCT ERC support in libata, and set ERC to > the scsi command timeout, no? > Then the user could tweak the scsi command timeout however he likes it > to, and that timeout would be reflected into the ERC setting. > > And then we could add an initialisation bit which reads the current ERC > values, increasing the SCSI command timeout as required. > But this still leaves the "consumer" (non-NAS, non-RAID) drives broken as a default, until the user tweaks the SCSI command timeout for the disk to much bigger value (longer than the drive's internal timeout, whatever it is, 180 seconds or so..) ? -- Pasi > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Teamlead Storage & Networking > hare@xxxxxxx +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton > HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html