-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Andreas Dilger wrote: > On Jan 24, 2008 22:20 -0500, Bryan Kadzban wrote: >> # Run this from cron each night. > > Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists > on most systems and will ensure that if the system was off for more than > a week it will still be run on the next boot. Yeah, it's probably true that once per week is enough. Do you think it would still make sense to try and parse out the last-check time from the LV if this gets run each week, or just unconditionally check everything (if on AC)? Checking everything weekly might be too often (especially if the extra disk usage ends up exposing bad bits on a disk), but maybe not. > I would recommend also using "logger" to log something in /var/log/messages. Yeah, that makes sense. logger is part of util-linux{,-ng}, so that's not a huge extra dependency either. >> echo "Don't know how to set the last-check time on $fstype..." >&2 > > These error messages are incorrect, namely "set the last-check time" should > be replaced with "force a check". That's true. I was trying to get the errors to refer to what specific information needed to be added to the script (in this case, it needs to know how to set the last-check time), but "force a check" is probably safer anyway. Setting the last-check time may not be the method that every FS uses. > Since there isn't any reason to special > case reiserfs here, you may as well remove it. That's what I get for deciding to handle reiser separately everywhere, and then changing my mind later -- I forgot to go back and remove this case. Oops... :-) > I suspect that a nice email to the XFS and JFS folks would get them to add > some mechanism to force a filesystem check on the next reboot. Is the issue that those FSes don't have any such mechanism today, or is it just that I don't know how to do this on them? (I'll have to go look up the XFS/JFS lists, too, but that's not terribly difficult.) >> nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" && >> nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev" > > Hmm, I'm not sure I understand what it is you want to do? Well, neither do I, necessarily -- those arguments were copied from the initial script that I hacked the extra stuff into (the one that Ted posted at the start of this whole thing). :-) I see that your script just uses -fn; that's probably simpler anyway. What it doesn't determine is whether fsck would be able to automatically repair the damage that it finds; I guess the question is whether this condition should be treated as a fsck failure (requiring a reboot to fix) or not. It probably depends on the severity of the fixes that fsck makes... OTOH, if you give e2fsck the -fy option, and it does make changes, its exit status will not be zero, so it will already be treated as a failure by this script. So the only difference is that -fn stops it from writing to the snapshot just to have the writes thrown away; that's probably actually good. > and "-p" without "-f" will just check the superblock. Yeah, I think the idea was to check the superblock first, and then check the rest of the FS. But I think -fn is probably more explicit about what we want fsck to do, too. (Plus, even if we do take a read-write snapshot with LVM2, there's no point in taking up extra space by writing to the snapshot itself, if it's just going to get thrown away.) > For the log file it probably makes sense to keep this around with a > timestamp if there is a failure. And let e.g. logrotate get rid of older versions; yeah, that makes sense. > To find free space, use "vgs -o vg_size --noheadings ${vg}", and the > LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}". Free space can also be retrieved with -o vg_free, but yeah. > You can strip the size suffixes with "--units M --nosuffix" to get > units of MB. Ah, that was the bit I was missing yesterday (further down in the script): --nosuffix. Thanks! I also just got your message from yesterday about the guess behind the <LV size/500> (based on the frequency of writes to the main LV); that makes sense. And since I can get the size out of lvs, that makes that much easier, too, so I'll just use 1/500th the LV size. > Also good to create a more unique name than "${lv}-snap", since that > might conflict with an existing snapshot, and if the script crashes > the user might be wondering if that LV using 100% of the free space is > safe to delete or not. Yeah, that was left over from the original script as well. Changing it makes sense. > Please also add XFS support here, Done, I think. I assume xfs_check doesn't need any args? (Should fsck.xfs perhaps just exec xfs_check and pass it all the args? That's a whole separate discussion, probably.) > For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem. Done. >> echo 'Background scrubbing succeeded!' >> echo 'Background scrubbing failed! Reboot to fsck soon!' > > Printing the device name in these messages, and sending them to the syslog > via logger would probably be more useful. True; done. The severity may need a bit of tweaking, but hopefully not much. >> set -e > > Have you verified that the script doesn't exit if an fsck fails with an > error? No, the script exits if fsck fails with an error. That's obviously bad - -- I wasn't thinking that far ahead when I added that. It's gone now. >> . /etc/lvcheck.conf > > You should check that this file exists before sourcing it, or the script will > exit with an error That was intended; I figured the config file would be required (back when I first added it). But since we have decent default values for the settings in it, it probably makes sense to make it optional now. >> FSTYPE="`/lib/udev/vol_id -t "$DEV"`" > > Please use "blkid", since that is part of e2fsprogs already and avoids > an extra dependency. True. Looking at the manpages, it appears that vol_id does some extra checks to try to detect RAID members as RAID members, instead of partitions containing a filesystem. But that would only affect this script if someone had multiple LVs RAIDed together, and I doubt that's well-supported elsewhere, so blkid is fine. >> # if the date is unknown, run fsck every day. sigh. > > Better to write "run fsck each time the script is run". Yeah, that makes more sense. >> # ??? -- can lvs print vg_free in plain numbers, or do I have to >> # figure out what a suffix of "m" means? skip the check for now. > > "vgs", and --nosuffix, per above. Yep, done. >> EMAIL='root' >> INTERVAL=30 >> AC_UNKNOWN="ABORT" > > I would also make these all be defaults in the script (before this file is > parsed), so it works as expected if /etc/lvscan.conf doesn't exist. Since it's now optional, yes, that makes sense. > I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly > leave it unset by default and have the script not error out in this case, > so that the script does something useful for the majority of users. Well, it depends on whether the majority of users have laptops, or some other hardware type (desktops, servers, etc.). I was thinking that laptops would be more prevalent, but since this is Linux, it's probably actually servers. OK -- CONTINUE it is, by default. > If we are worried about the laptop case, we could add checks to see > if the system has a PC card, since very few desktop systems have them. > Both the commands "pccardctl info" and "cardctl info" produce no output > on stdout if there is no PC card slot, and this could be used to decide > between "CONTINUE" for desktops and "ABORT" for laptops. Or stuff it into comments in the config file. Pushing the decision back onto the user makes me a bit uncomfortable, but fuzzy decisions (ones that aren't necessarily based on the right info) make me even less comfortable. Hmm. And depending how the power_supply sysfs class ends up working, maybe this is all a moot point anyway: if it always has devices under it on >=2.6.24, then the setting won't even matter. For now, I'll just leave the default CONTINUE, but with comments in the config file aimed at laptop users. - ---- Create a script to transparently run fsck in the background on any active LVM logical volumes, as long as the machine is on AC power, and that LV has been last checked more than a configurable number of days ago. Also create an optional configuration file to set various options in the script. Signed-Off-By: Bryan Kadzban <bryan@xxxxxxxxxxxxxxxxxxxxx> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHmpTOS5vET1Wea5wRA2XXAKCZzt9SEOSBVs4EkrI4gt3Ztl0v5wCg3gq5 1ChmnEccT+hFVo/2B/RpU8U= =D4HV -----END PGP SIGNATURE-----
#!/bin/sh # # lvcheck # Released under the GNU General Public License, either version 2 or # (at your option) any later version. # Overview: # # Run this from cron periodically (e.g. once per week). If the # machine is on AC power, it will run the checks; otherwise they will # all be skipped. (If the script can't tell whether the machine is # on AC power, it will use a setting in the configuration file # (/etc/lvcheck.conf) to decide whether to continue with the checks, # or abort.) # # The script will then decide which logical volumes are active, and # can therefore be checked via an LVM snapshot. Each of these LVs # will be queried to find its last-check day, and if that was more # than $INTERVAL days ago (where INTERVAL is set in the configuration # file as well), or if the last-check day can't be determined, then # the script will take an LVM snapshot of that LV and run fsck on the # snapshot. The snapshot will be set to use 1/500 the space of the # source LV. After fsck finishes, the snapshot is destroyed. # (Snapshots are checked serially.) # # Any LV that passes fsck should have its last-check time updated (in # the real superblock, not the snapshot's superblock); any LV whose # fsck fails will send an email notification to a configurable user # ($EMAIL). This $EMAIL setting is optional, but its use is highly # recommended, since if any LV fails, it will need to be checked # manually, offline. Relevant messages are also sent to syslog. # Set default values for configuration params. Changes to these values # will be overwritten on an upgrade! To change these values, use # /etc/lvcheck.conf. EMAIL='root' INTERVAL=30 AC_UNKNOWN="CONTINUE" # send $2 to syslog, with severity $1 # severities are emerg/alert/crit/err/warning/notice/info/debug function log() { local sev="$1" local msg="$2" local arg= # log warning-or-higher messages to stderr as well [ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \ "$sev" == "err" || "$sev" == "warning" ] && arg=-s logger $arg -p user."$sev" -- "$msg" } # determine whether the machine is on AC power function on_ac_power() { local any_known=no # try sysfs power class first if [ -d /sys/class/power_supply ] ; then for psu in /sys/class/power_supply/* ; do if [ -r "${psu}/type" ] ; then type="`cat "${psu}/type"`" # ignore batteries [ "${type}" = "Battery" ] && continue online="`cat "${psu}/online"`" [ "${online}" = 1 ] && return 0 [ "${online}" = 0 ] && any_known=yes fi done [ "${any_known}" = "yes" ] && return 1 fi # else fall back to AC adapters in /proc if [ -d /proc/acpi/ac_adapter ] ; then for ac in /proc/acpi/ac_adapter/* ; do if [ -r "${ac}/state" ] ; then grep -q on-line "${ac}/state" && return 0 grep -q off-line "${ac}/state" && any_known=yes elif [ -r "${ac}/status" ] ; then grep -q on-line "${ac}/status" && return 0 grep -q off-line "${ac}/status" && any_known=yes fi done [ "${any_known}" = "yes" ] && return 1 fi if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then return 0 # assume on AC power elif [ "$AC_UNKNOWN" == "ABORT" ] ; then return 1 # assume on battery else log "err" "Invalid value for AC_UNKNOWN in the config file" exit 1 fi } # attempt to force a check of $1 on the next reboot function try_force_check() { local dev="$1" local fstype="$2" case "$fstype" in ext2|ext3) tune2fs -C 16000 -T "19000101" "$dev" ;; *) log "warning" "Don't know how to force a check on $fstype..." ;; esac } # attempt to set the last-check time on $1 to now, and the mount count to 0. function try_delay_checks() { local dev="$1" local fstype="$2" case "$fstype" in ext2|ext3) tune2fs -C 0 -T now "$dev" ;; reiserfs) # do nothing? ;; *) log "warning" "Don't know how to delay checks on $fstype..." ;; esac } # print the date that $1 was last checked, in a format that date(1) will # accept, or "Unknown" if we don't know how to find that date. function try_get_check_date() { local dev="$1" local fstype="$2" case "$fstype" in ext2|ext3) dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \ sed -e 's/Last checked:[[:space:]]*//' ;; *) # TODO: add support for various FSes here echo "Unknown" ;; esac } # check the FS on $1 passively, saving output to $3. function perform_check() { local dev="$1" local fstype="$2" local tmpfile="$3" case "$fstype" in ext2|ext3) nice logsave -as "${tmpfile}" e2fsck -fn "$dev" return $? ;; reiserfs) echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev" # apparently can't fail? let's hope not... return 0 ;; xfs) nice logsave -as "${tmpfile}" xfs_check "$dev" return $? ;; jfs) nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev" return $? ;; *) log "warning" "Don't know how to check $fstype filesystems passively: assuming OK." ;; esac } # do everything needed to check and reset dates and counters on /dev/$1/$2. function check_fs() { local vg="$1" local lv="$2" local fstype="$3" local snapsize="$4" local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`" local snaplvbase="${lv}-lvcheck-temp" local snaplv="${snaplvbase}-`date +'%Y%m%d'`" # clean up any left-over snapshot LVs for lvtemp in /dev/${vg}/${snaplvbase}* ; do if [ -e "$lvtemp" ] ; then # Assume the script won't run more than one instance at a time? lvremove -f "${lvtemp##/dev}" log "warning" "Found stale snapshot $lvtemp: deleting." fi done # and create this one lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}" if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded." try_delay_checks "/dev/${vg}/${lv}" "$fstype" else log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!" try_force_check "/dev/${vg}/${lv}" "$fstype" if test -n "$EMAIL"; then mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile fi # save the log file in /var/log in case mail is disabled mv "$tmpfile" "$errlog" fi rm -f "$tmpfile" lvremove -f "${vg}/${snaplv}" } # pull in configuration -- overwrite the defaults above if the file exists [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf # check whether the machine is on AC power: if not, skip fsck on_ac_power || exit 0 # parse up lvscan output lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \ while read DEV ; do # remove the single quotes around the device name DEV="`echo "$DEV" | tr -d \'`" # get the FS type: blkid prints TYPE="blah" eval `blkid -s TYPE "$DEV" | cut -d' ' -f2` # get the last-check time check_date=`try_get_check_date "$DEV" "$TYPE"` # if the date is unknown, run fsck every time the script runs. sigh. if [ "$check_date" != "Unknown" ] ; then # add $INTERVAL days, and throw away the time portion check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'` # get today's date, and skip the check if it's not within the interval today=`date +'%Y%m%d'` [ $check_day -gt $today ] && continue fi # get the free space and LV size (in megs) SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`" SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`" SNAPSIZE="`expr "$SIZE" / 500`" if [ "$SNAPSIZE" -gt "$SPACE" ] ; then log "err" "Can't take a snapshot of $DEV: not enough free space in the VG." continue fi # get the volume group and logical volume names VG="`lvs --noheadings -o vg_name "$DEV"`" LV="`lvs --noheadings -o lv_name "$DEV"`" # check it check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE" done
#!/bin/sh # e2check configuration file # This file follows the pattern of sshd_config: default # values are shown here, commented-out. # EMAIL # Address to send failure notifications to. If empty, # failure notifications will not be sent. #EMAIL='root' # INTERVAL # Days to wait between checks. All LVs use the same # INTERVAL, but the "days since last check" value can # be different per LV, since that value is stored in # the filesystem superblock. #INTERVAL=30 # AC_UNKNOWN # Whether to run the e2fsck checks if the script can't # determine whether the machine is on AC power. Laptop # users will want to set this to ABORT, while server and # desktop users will probably want to set this to # CONTINUE. Those are the only two valid values. #AC_UNKNOWN="CONTINUE"
_______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users