On Fri, Apr 2, 2021 at 4:23 AM Patrick O'Callaghan <pocallaghan@xxxxxxxxx> wrote: > > On Thu, 2021-04-01 at 23:52 -0600, Chris Murphy wrote: > > It's not an SMR concern, it's making sure the drive gives up on > > errors > > faster than the kernel tries to reset due to what it thinks is a > > hanging drive. > > > > smartctl -l scterc /dev/sdX > > > > That'll tell you the default setting. I'm pretty sure Blues come with > > SCT ERC disabled. Some support it. Some don't. If it's supported > > you'll want to set it for something like 70-100 deciseconds (the > > units > > SATA drives use for this feature). > > One doesn´t and one does: > > # smartctl -l scterc /dev/sdd > smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] > (local build) > Copyright (C) 2002-20, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control command not supported > > # smartctl -l scterc /dev/sde > smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] > (local build) > Copyright (C) 2002-20, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: > Read: 85 (8.5 seconds) > Write: 85 (8.5 seconds) > > So I guess the /dev/sde drive is set correctly, right? Or would you > recommend disabling SCT ERC for this drive? Leave /dev/sde alone, 85 deciseconds is fine. Not much can be done with /dev/sdd itself directly. But it is possible to increase the kernel's command timer for this drive. The usual way of doing this is via sysfs. I think it can be done with a udev rule as well, but I'm having a bit of a lapse how to do it. Udev needs to identify the device by serial number or wwn, but changing the timeout via sysfs requires knowing that the /dev node is - which of course can change each time you boot or plug the device in. I don't know enough about udev. But there should be examples on the internet or you can just fudge it with the linux-raid wiki guide. The alternatives? Change the timeout for all /dev/ nodes. That's how things are by default on Windows and macOS, they just wait a long time before resetting a drive, giving it enough time for it to give up on its own. The negative side effect is you might get a long delay without errors, should the device develop marginally bad sectors. Another alternative is to just leave it alone, and periodically check (manually or automate it somehow) for the telltale signs of bad sectors masked by SATA link resets. Looks like this: kernel: ata7.00: status: { DRDY } kernel: ata7.00: failed command: READ FPDMA QUEUED kernel: ata7.00: cmd 60/40:f0:98:d2:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) With this interlaced occasionally kernel: ata7: hard resetting link If it happens *then* you can increase the timeout manually, and initiate a scrub. As long as the timeout is set high enough (most sources suggest 180 seconds which, yes, it's incredible) eventually the drive will give up, spit out an error, and Btrfs will fix up that sector by overwriting it with good data. It could be months, years, or never, before it happens. -- Chris Murphy _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure