Hello all, thanks for giving notice that the attachment didn't come through the mailinglist. The following is an improved version with its README, inserted inline. It has changes to support configuring specific timeouts, and switching between them. Cheers, Chris -------- # smartctl-timeouts_defaults # Defaults used by smartctl-timeouts scripts: NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS="183" # Should always be set. (e.g. 183) # Used for disks without SCTERC support, to prevent too early # resets. This long controller timeout value should be above # the usual error recovery time of the harddrives without # SCTERC support. Unfortunately, these values don't seem # to be readable from the drives nor published. NONREDUNDANT_UNSURE_RESET_ALL_DISKS="" # If "true", ERC timout gets disabled for non-redundant disks # an the value is used as the controller timeout. # Can be set to "true" to try letting non-redundant disks fully # complete their error recovery attempt. # The configuration options below can be left blank or commmented # out. This results in working with the hardware, kernel, or # distribution defaults, and doing only necessary adaptions when # initializing. # But without configuring specific values, switching between # the redundancy modes may not work well. #NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS="63" # May be set to allow ample ERC time (e.g. 63). # If blank the current timeout will not be changed, if possible. # Note that the max. ERC timout is 99 seconds, so an exceeding # controller timeout won't result in longer error correction # attempts. Possibly use NONREDUNDANT_UNSURE_RESET_ALL_DISKS if your # disk will do longer error correction attempts, if the ERC # timeout is disabled. #POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS="48" # May be set to allow some (-5s ) ERC timeout, yet not blocking # redundant disks for too long. # If blank the current setting will not be changed, if possible. #REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS="29" # May be set to quickly reset blocking disks. # If blank the current setting will not be changed, if possible. TIMING_CMD="/usr/sbin/smartctl -l scterc" set -o nounset -o errexit ------- # do not edit this file, it will be overwritten on update # Don't process any events if anaconda is running as anaconda brings up # raid devices manually ENV{ANACONDA}=="?*", GOTO="md_inc_end" # assemble md arrays SUBSYSTEM!="block", GOTO="md_inc_end" # handle potential components of arrays (the ones supported by md) ENV{ID_FS_TYPE}=="linux_raid_member", GOTO="md_inc" # "noiswmd" on kernel command line stops mdadm from handling # "isw" (aka IMSM - Intel RAID). # "nodmraid" on kernel command line stops mdadm from handling # "isw" or "ddf". IMPORT{cmdline}="noiswmd" IMPORT{cmdline}="nodmraid" ENV{nodmraid}=="?*", GOTO="md_inc_end" ENV{ID_FS_TYPE}=="ddf_raid_member", GOTO="md_inc" ENV{noiswmd}=="?*", GOTO="md_inc_end" ENV{ID_FS_TYPE}=="isw_raid_member", GOTO="md_inc" GOTO="md_inc_end" LABEL="md_inc" # initialize redundancy possibility status # (only the kernel module could set actual run-time state, and may in the future # set a dynamic FASTFAIL kernel device property instead of calling smartctl-timeout scripts) IMPORT{program}="BINDIR/mdadm --examine --export $tempnode" ENV{MD_LEVEL}=="raid[1-9]*", ENV{REDUNDANT_DEV}="possibly" ENV{MD_LEVEL}=="raid0", ENV{REDUNDANT_DEV}="false" # remember you can limit what gets auto/incrementally assembled by # mdadm.conf(5)'s 'AUTO' and selectively whitelist using 'ARRAY' ACTION=="add|change", IMPORT{program}="BINDIR/mdadm --incremental --export $tempnode --offroot ${DEVLINKS}" ACTION=="add|change", ENV{MD_STARTED}=="*unsafe*", ENV{MD_FOREIGN}=="no", ENV{SYSTEMD_WANTS}+="mdadm-last-resort@$env{MD_DEVICE}.timer" ACTION=="remove", ENV{ID_PATH}=="?*", RUN+="BINDIR/mdadm -If $name --path $env{ID_PATH}" ACTION=="remove", ENV{ID_PATH}!="?*", RUN+="BINDIR/mdadm -If $name" LABEL="md_inc_end" # initialize redundancy status for all surely non-redundant devices # (The mdadm, btrfs, zfs, lvm, ... devices need too be adjusted by their own packages) ENV{ID_FS_TYPE}!="linux_raid*|ddf_raid*|isw_raid*|lvm_*|LVM*|btrfs*|zfs*", ENV{REDUNDANT_DEV}="false" # call initial HDD error correction timeouts adjustment ENV{DEVTYPE}=="partition", ENV{REDUNDANT_DEV}=="possibly", TEST="/usr/sbin/smartctl", RUN+="BINDIR/smartctl-timeouts_possibly-redundant-partition.sh $parent" ENV{DEVTYPE}=="partition", ENV{REDUNDANT_DEV}=="false", TEST="/usr/sbin/smartctl", RUN+="BINDIR/smartctl-timeouts_non-redundant-partition.sh $parent" ENV{DEVTYPE}=="disk", ENV{REDUNDANT_DEV}=="possibly", TEST="/usr/sbin/smartctl", RUN+="BINDIR/smartctl-timeouts_posibly-redundant-disk.sh $devnode" ENV{DEVTYPE}=="disk", ENV{REDUNDANT_DEV}=="false", TEST="/usr/sbin/smartctl", RUN+="BINDIR/smartctl-timeouts_non-redundant-disk.sh $devnode" ------ #!/bin/sh # smartctl-timeouts_possibly-redundant-disk.sh SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" . $SCRIPT_DIR/smartctl-timeouts_defaults HDD_DEV="$1" echo "Adjusting $HDD_DEV timeouts:" if ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q Disabled \ && ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q seconds then # ERC timeout is not supported (not disabled and not set): # * Set the controller timeout to be considerably loooooong. # - To allow the drive to give up its ERC attempts by itself. # - Let the drive return a proper read error, so that the redundancy # provider (md, lvm, btrfs, ...) can re-write the bad block. # - Disk read errors thus result in long i/o blocking periods with # no error messages that may not be watched by or reported to the user, # - but waiting this long should prevent unecessary controller resets of the # entire drive and the corresponding loss of redundancy/data. echo "Drive without ERC timeout support, setting NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS (${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS}s)" echo ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS} >/sys/block/${HDD_DEV}/device/timeout else SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT="false" # reset controller timeout, if a configured value was previously set if [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] then SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT="true" echo "resetting controller from another configured value (`cat /sys/block/${HDD_DEV}/device/timeout`s) to ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-30}s" echo ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-30} >/sys/block/${HDD_DEV}/device/timeout else # set possibly-redundant timeout anyway, if configured if [ ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] ; then echo "setting controller timeout to POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS (${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS}s)" echo ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS} >/sys/block/${HDD_DEV}/device/timeout fi fi if ${TIMING_CMD} /dev/${HDD_DEV} | grep -q Disabled \ || [ $SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT = "true" ] \ || [ ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] then # ERC timeout is disabled or configured: # * set it to controller timeout -5 seconds # - Allows redundancy provider to read data from another disk and re-write the bad block # before the controller resets the entire drive and the raid looses redundancy/data completely. # - Longer than the usual 7s default of dedicated raid drives, to allow as much # ERC time as possible (good if degraded and for non-redundant partitions on same drive). ERC_TENTHS=$(expr `cat /sys/block/${HDD_DEV}/device/timeout` \* 10 - 50) # prevent exceeding max. scterc value if [ $ERC_TENTHS -gt 999 ] ; then ERC_TENTHS="999" fi echo "maximizing ERC timeout to controller timeout -5 seconds (`expr $ERC_TENTHS / 10`s)" ${TIMING_CMD},$ERC_TENTHS,$ERC_TENTHS /dev/${HDD_DEV} > /dev/null fi fi ---------- #! /bin/sh # smartctl-timeouts_possibly-redundant-partition.sh # This script sets the timeouts for "mixed drives" that contain redundant # and non-redundant partitions. # A single, possibly-redundant partition is enough to set the entire drive's # timeouts to possibly-redundant settings (with a determined ERC timeout slightly # below the default or configured controller timout, if possible). # # This avoids to risk unknown disk recovery times and needing a very long # controller timeouts. Where configuring such a ERC timout is possible, # this means the disk recovery may be terminated quicker than the drive # would without the timout set, but it ensures that there will be no resets # leading to data loss and redundancy loss. SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" $SCRIPT_DIR/smartctl-timeouts_possibly-redundant-disk.sh $1 -------------- #!/bin/sh # smartctl-timeouts_non-redundant-disk.sh SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" . $SCRIPT_DIR/smartctl-timeouts_defaults HDD_DEV="$1" echo "Adjusting $HDD_DEV timeouts:" if [ ${NONREDUNDANT_UNSURE_RESET_ALL_DISKS:-false} = "true" ] ; then # * disable any ERC timeout # - Allows the drive to do ERC without imposing a timeout. ${TIMING_CMD},0,0 /dev/${HDD_DEV} > /dev/null # * Set the controller timeout to be considerably loooooong. # - To allow the drive to give up its ERC attempts by itself. # - Let the drive return a proper read error, so that the redundancy # provider (md, lvm, btrfs, ...) can re-write the bad block. # - Disk read errors thus result in long i/o blocking periods with # no error messages that may not be watched by or reported to the user, # - but waiting this long should prevent unecessary controller resets of the # entire drive and the corresponding loss of redundancy/data. echo "NONREDUNDANT_UNSURE_RESET_ALL_DISKS is true, setting NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS (${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS}s)" echo ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS} >/sys/block/${HDD_DEV}/device/timeout else if ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q Disabled \ && ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q seconds then # ERC timeout is not supported (not disabled and not set) # * Set the controller timeout to be considerably loooooong. echo "Drive without ERC timeout support, setting NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS (${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS}s)" echo ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS} >/sys/block/${HDD_DEV}/device/timeout else if ${TIMING_CMD} /dev/${HDD_DEV} | grep -q seconds \ || [ ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] then # reset controller timeout, if a configured value was previously set if [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] then echo "resetting controller from another configured value (`cat /sys/block/${HDD_DEV}/device/timeout`s) to ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-60}s" echo ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-60} >/sys/block/${HDD_DEV}/device/timeout else # set non-redundant timeout anyway, if configured if [ ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] ; then echo "setting configured NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS (${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS}s)" echo ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS} >/sys/block/${HDD_DEV}/device/timeout fi fi # An ERC timeout is set or configured: # * change ERC timout to controller timeout -5 seconds # - Longer than the usual 7s default of dedicated raid drives, to allow as much # ERC time as possible. ERC_TENTHS=$(expr `cat /sys/block/${HDD_DEV}/device/timeout` \* 10 - 50) # prevent exceeding max. scterc value if [ $ERC_TENTHS -gt 999 ] ; then ERC_TENTHS="999" fi echo "maximizing ERC timeout to controller timeout -5 seconds (`expr $ERC_TENTHS / 10`s)" ${TIMING_CMD},$ERC_TENTHS,$ERC_TENTHS /dev/${HDD_DEV} > /dev/null else # ERC timeout disabled echo "found ERC timeout disabled, setting NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS (${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS}s)" echo ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS} >/sys/block/${HDD_DEV}/device/timeout fi fi fi -------------- #!/bin/sh # smartctl-timeouts_non-redundant-partition.sh # Because there may also be redundant partitions on this disk we must not # unconditionally alter the timeouts. SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" . $SCRIPT_DIR/smartctl-timeouts_defaults HDD_DEV="$1" REDUNDANT_DISK="unchecked" # TODO #for dev in `cd /sys/block/${HDD_DEV} ; ls -d ${HDD_DEV}*` ; do # if equvalent to udev's ENV{REDUNDANT_DEV}=="yes|possibly"; then # $REDUNDANT_DISK="possibly" # fi #done #if [ $REDUNDANT_DISK="unchecked" ] ; then # REDUNDANT_DISK="false" #fi if [ $REDUNDANT_DISK = "false" ] \ # TODO && all partitions have been detected by udev already then $SCRIPT_DIR/smartctl-timeouts_non-redundant-disk.sh $1 else $SCRIPT_DIR/smartctl-timeouts_possibly-redundant-disk.sh $1 fi ------------- #!/bin/sh # smartctl-timeouts_redundant-disk.sh # Redundant timouts are NEVER to be triggerd by udev rules! # Because only the redundancy providing kernel module knows the actual run-time # redundancy status, can adjust it and call this script dynamically. # Udev rules can only determine "possibly redundant" devices. SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" . $SCRIPT_DIR/smartctl-timeouts_defaults HDD_DEV="$1" echo "Adjusting $HDD_DEV timeouts:" if ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q Disabled \ && ! ${TIMING_CMD} /dev/${HDD_DEV} | grep -q seconds then # ERC timeout is not supported (not disabled and not set): # * Set the controller timeout to be considerably loooooong. # - To allow the drive to give up its ERC attempts by itself. # - Let the drive return a proper read error, so that the redundancy # provider (md, lvm, btrfs, ...) can re-write the bad block. # - Disk read errors thus result in long i/o blocking periods with # no error messages that may not be watched by or reported to the user, # - but waiting this long should prevent unecessary controller resets of the # entire drive and the corresponding loss of redundancy/data. echo "Drive without ERC timeout support, setting NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS (${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS}s)" echo ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS} >/sys/block/${HDD_DEV}/device/timeout else SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT="false" # reset controller timeout, if a configured value was previously set if [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${NONREDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] \ || [ `cat /sys/block/${HDD_DEV}/device/timeout` = ${POSSIBLY_REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:--1} ] then SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT="true" echo "resetting controller from another configured value (`cat /sys/block/${HDD_DEV}/device/timeout`s) to ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-30}s" echo ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-30} >/sys/block/${HDD_DEV}/device/timeout else # set possibly-redundant timeout anyway, if configured if [ ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] ; then echo "setting controller timeout to REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS (${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS}s)" echo ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS} >/sys/block/${HDD_DEV}/device/timeout else if [ `cat /sys/block/${HDD_DEV}/device/timeout` -gt 30 ] ; then echo "reducing controller timout to 30 seconds" echo 30 >/sys/block/${HDD_DEV}/device/timeout fi fi fi if ${TIMING_CMD} /dev/${HDD_DEV} | grep -q Disabled \ || [ $SWITCH_FROM_OTHER_CONFIGURED_SMARTCTL_TIMEOUT = "true" ] \ || [ ${REDUNDANT_DISK_CONTROLLER_TIMEOUT_SECONDS:-undefined} != "undefined" ] \ # TODO: || [ $(expr `cat /sys/block/${HDD_DEV}/device/timeout` \* 10 - 50) = read of current ERC timeout value ] then # ERC timeout is disabled, is configured, or has been "maximized" to the controller timeout -5 seconds: # * set it to 7 seconds # - The usual quick 7s default of dedicated raid drives. # - Allows redundancy provider to quickly read data from another disk and re-write the bad block # before the controller resets the entire drive and the raid looses redundancy/data completely. echo "setting ERC timeout to 7 seconds" ${TIMING_CMD},70,70 /dev/${HDD_DEV} > /dev/null fi fi ------ #!/bin/sh # smartctl-timeouts_redundant-partition.sh # Redundant timouts are NEVER to be triggerd by udev rules! # Because only the redundancy providing kernel module knows the actual run-time # redundancy status, can adjust it and call this script dynamically. # Udev rules can only determine "possibly redundant" devices. SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" . $SCRIPT_DIR/smartctl-timeouts_defaults HDD_DEV="$1" REDUNDANT_DISK="unchecked" # TODO #for dev in `cd /sys/block/${HDD_DEV} ; ls -d ${HDD_DEV}*` ; do # if equvalent to udev's ENV{REDUNDANT_DEV}=="false"; then # $REDUNDANT_DISK="false" # fi #done #if [ $REDUNDANT_DISK="unchecked" ] ; then # REDUNDANT_DISK="true" #fi if [ $REDUNDANT_DISK = "true" ] \ # TODO: && all partitions have been detected by udev already then $SCRIPT_DIR/smartctl-timeouts_redundant-disk.sh $1 else $SCRIPT_DIR/smartctl-timeouts_possibly-redundant-disk.sh $1 fi --------- smartctl-timeouts README The smartctl-timeouts scripts adjust controller and disk timeouts according to redundancy status, and fix commonly mismatching defaults with drives that have no error recovery timeout configured, which has often lead to data loss. The scripts are to be called by udev rules during device initialization, and by kernel modules acording to the run-time redundancy status changes. Every redundancy providing block device module may ship with proper udev rules that initialize the timeouts for their possibly redundant devices. An alternative to these scripts may be to investigate the FASTFAIL feature in the kernel. NOTE: Correct execution during boot requires that distro package managers hook smartctl and the smartctl-timeouts scripts into the initramfs. RATIONALE The error recovery (ERC) timeout *must* be shorter than the controller timeout. Otherwise read errors will cause controller resets, leading to direct data loss or, if it is a redundant disk, loss of redundancy and a very high probability of another read error and data loss when re-establishing the redundancy. If a drive does not support adjusting its ERC timeout, the controller timeout must be increased above the drive's maximal error recovery time. If you don't want that kind of long device timeout, you should look for a drive with SCT ERC timeout support. (smartctl -l scterc /dev/...) IMPACT (without having specific timeouts configured) For possibly redundant disks: If supported but simply disabled in the drive, the ERC timeout is adjusted to the current controller timeout minus 5 seconds. The controller timeout is only raised (to NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS) for drives without SCTERC support. As well as for entirely non-redundant-disks, in an attempt to allow these drives to finish their error recovery regularily before a reset is triggerd. As controller timeouts are only increased selectively (only drives without SCTERC support and surely non-redundant disks), the scripts only adapt mismatching timeouts, by default. Existing manufacturer or custom ERC timeout settings (as in professional, dedicated, redundant setups, e.g. storage servers etc.) won't be changed, except with specific configuration options. TODO * non-redundant-partitions: conditional udev triggering, or a test in the script could determine if all partions of the disk have been detected already and are all non-redundant, to call non-redundant-disk in this case. * parser to read ERC timeout values? - redundant-disk: a previously set "controller timeout - 5 seconds" ERC timeout (possibly-redundant), could also be reset to 7 seconds, not just a "Disabled" value. * If a redundancy controlling kernel module is to make dynamic adjustments, "redundant-partition" needs implementation. smartctl-timeouts README The smartctl-timeouts scripts adjust controller and disk timeouts according to redundancy status, and fix commonly mismatching defaults with drives that have no error recovery timeout configured, which has often lead to data loss. The scripts are to be called by udev rules during device initialization, and by kernel modules acording to the run-time redundancy status changes. Every redundancy providing block device module may ship with proper udev rules that initialize the timeouts for their possibly redundant devices. An alternative to these scripts may be to investigate the FASTFAIL feature in the kernel. NOTE: Correct execution during boot requires that distro package managers hook smartctl and the smartctl-timeouts scripts into the initramfs. RATIONALE The error recovery (ERC) timeout *must* be shorter than the controller timeout. Otherwise read errors will cause controller resets, leading to direct data loss or, if it is a redundant disk, loss of redundancy and a very high probability of another read error and data loss when re-establishing the redundancy. If a drive does not support adjusting its ERC timeout, the controller timeout must be increased above the drive's maximal error recovery time. If you don't want that kind of long device timeout, you should look for a drive with SCT ERC timeout support. (smartctl -l scterc /dev/...) IMPACT (without having specific timeouts configured) For possibly redundant disks: If supported but simply disabled in the drive, the ERC timeout is adjusted to the current controller timeout minus 5 seconds. The controller timeout is only raised (to NONREDUNDANT_UNSURE_CONTROLLER_RESET_SECONDS) for drives without SCTERC support. As well as for entirely non-redundant-disks, in an attempt to allow these drives to finish their error recovery regularily before a reset is triggerd. As controller timeouts are only increased selectively (only drives without SCTERC support and surely non-redundant disks), the scripts only adapt mismatching timeouts, by default. Existing manufacturer or custom ERC timeout settings (as in professional, dedicated, redundant setups, e.g. storage servers etc.) won't be changed, except with specific configuration options. TODO * non-redundant-partitions: conditional udev triggering, or a test in the script could determine if all partions of the disk have been detected already and are all non-redundant, to call non-redundant-disk in this case. * parser to read ERC timeout values? - redundant-disk: a previously set "controller timeout - 5 seconds" ERC timeout (possibly-redundant), could also be reset to 7 seconds, not just a "Disabled" value. * If a redundancy controlling kernel module is to make dynamic adjustments, "redundant-partition" needs implementation.
<<attachment: smartctl-timeouts_email2.zip>>