Re: forced fsck (again?)

Bryan Kadzban <bryan@xxxxxxxxxxxxxxxxxxxxx> · Mon, 28 Jan 2008 19:56:50 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 25, 2008  21:02 -0500, Bryan Kadzban wrote:
>> logger $arg -p user."$sev" -- "$msg"
> 
> This should use "-t lvcheck" so that it reports what program is generating
> the message.

Yep, that'd be useful.

>> tune2fs -C 16000 -T "19000101" "$dev"
> 
> I'm a tiny bit reluctant to overwrite the "last checked" date, since this
> might be useful information for the administrator (i.e. it will tell the
> interval wherein the corruption was detected).  Setting the "mount count"
> is enough to force a check, and the mount count itself can be reverse
> engineered from "reboot" messages in the "last" log.

Assuming the user doesn't set a maximum mount count higher than 16000
(but I think that's highly unlikely).  I think the benefit of being able
to know (approximately) when corruption started is probably worth it,
though.

> It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the
> "case" statement,

I see what you mean.  The script just uses vim's default autoindent
levels, but I can change the cases.

>> reiserfs)
>> 	# do nothing?
> 
> I thought you were going to remove the empty reiserfs cases?

Er, I was; I think I was looking at the wrong case last time around.
This one's gone now as well.

>> local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> 
> Shouldn't be "e2fsck.log"?  Maybe "lvcheck.log.XXXXXXXXX"?

Yeah, that'd be better; that's more leftover code from the original script.

>> # Assume the script won't run more than one instance at a time?
>> lvremove -f "${lvtemp##/dev}"
> 
> Should check the error return and bail out of script if there is an error.

Will that catch the "more than one instance at a time" case (e.g. if
another script run is still running e2fsck on this snapshot)?  Assuming
lvremove can fail (and it probably can), it's probably a good idea to
check it in any case, but if running e2fsck makes lvremove fail (until
e2fsck finishes), that's a decent way to get rid of the comment too.

Also, I think it'd be better to skip just the current FS, rather than an
"exit 1" type bail-out, right?

> MINFREE=0	# megabytes to leave free in each volume group
> MINSNAP=256	# megabytes for minimum snapshot size.

I've added something very similar to this logic, but I changed the
checks around a bit.  I think it makes more sense this way (doing the
overall space check first, and then the limits second), unless this
logic disallows some valid combinations?

(Still trying to decide how to handle logging *fsck output, and what to
do with the file, based on your other message...)

- -----

Create a script to transparently run fsck in the background on any
active LVM logical volumes, as long as the machine is on AC power, and
that LV has been last checked more than a configurable number of days
ago.  Also create an optional configuration file to set various options
in the script.

Signed-Off-By: Bryan Kadzban <bryan@xxxxxxxxxxxxxxxxxxxxx>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH
lPScP39vBYIIjOQPiftgDs8=
=XjFF
-----END PGP SIGNATURE-----
#!/bin/sh
#
# lvcheck

# Released under the GNU General Public License, either version 2 or
#  (at your option) any later version.

# Overview:
#
#  Run this from cron periodically (e.g. once per week).  If the
#  machine is on AC power, it will run the checks; otherwise they will
#  all be skipped.  (If the script can't tell whether the machine is
#  on AC power, it will use a setting in the configuration file
#  (/etc/lvcheck.conf) to decide whether to continue with the checks,
#  or abort.)
#
#  The script will then decide which logical volumes are active, and
#  can therefore be checked via an LVM snapshot.  Each of these LVs
#  will be queried to find its last-check day, and if that was more
#  than $INTERVAL days ago (where INTERVAL is set in the configuration
#  file as well), or if the last-check day can't be determined, then
#  the script will take an LVM snapshot of that LV and run fsck on the
#  snapshot.  The snapshot will be set to use 1/500 the space of the
#  source LV.  After fsck finishes, the snapshot is destroyed.
#  (Snapshots are checked serially.)
#
#  Any LV that passes fsck should have its last-check time updated (in
#  the real superblock, not the snapshot's superblock); any LV whose
#  fsck fails will send an email notification to a configurable user
#  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
#  recommended, since if any LV fails, it will need to be checked
#  manually, offline.  Relevant messages are also sent to syslog.

# Set default values for configuration params.  Changes to these values
#  will be overwritten on an upgrade!  To change these values, use
#  /etc/lvcheck.conf.
EMAIL='root'
INTERVAL=30
AC_UNKNOWN="CONTINUE"
MINSNAP=256
MINFREE=0

# send $2 to syslog, with severity $1
# severities are emerg/alert/crit/err/warning/notice/info/debug
function log() {
	local sev="$1"
	local msg="$2"
	local arg=

	# log warning-or-higher messages to stderr as well
	[ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
			"$sev" == "err" || "$sev" == "warning" ] && arg=-s

	logger -t lvcheck $arg -p user."$sev" -- "$msg"
}

# determine whether the machine is on AC power
function on_ac_power() {
	local any_known=no

	# try sysfs power class first
	if [ -d /sys/class/power_supply ] ; then
		for psu in /sys/class/power_supply/* ; do
			if [ -r "${psu}/type" ] ; then
				type="`cat "${psu}/type"`"

				# ignore batteries
				[ "${type}" = "Battery" ] && continue

				online="`cat "${psu}/online"`"

				[ "${online}" = 1 ] && return 0
				[ "${online}" = 0 ] && any_known=yes
			fi
		done

		[ "${any_known}" = "yes" ] && return 1
	fi

	# else fall back to AC adapters in /proc
	if [ -d /proc/acpi/ac_adapter ] ; then
		for ac in /proc/acpi/ac_adapter/* ; do
			if [ -r "${ac}/state" ] ; then
				grep -q on-line "${ac}/state" && return 0
				grep -q off-line "${ac}/state" && any_known=yes
			elif [ -r "${ac}/status" ] ; then
				grep -q on-line "${ac}/status" && return 0
				grep -q off-line "${ac}/status" && any_known=yes
			fi
		done

		[ "${any_known}" = "yes" ] && return 1
	fi

	if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then
		return 0   # assume on AC power
	elif [ "$AC_UNKNOWN" == "ABORT" ] ; then
		return 1   # assume on battery
	else
		log "err" "Invalid value for AC_UNKNOWN in the config file"
		exit 1
	fi
}

# attempt to force a check of $1 on the next reboot
function try_force_check() {
	local dev="$1"
	local fstype="$2"

	case "$fstype" in
	ext2|ext3)
		tune2fs -C 16000 "$dev"
		;;
	*)
		log "warning" "Don't know how to force a check on $fstype..."
		;;
	esac
}

# attempt to set the last-check time on $1 to now, and the mount count to 0.
function try_delay_checks() {
	local dev="$1"
	local fstype="$2"

	case "$fstype" in
	ext2|ext3)
		tune2fs -C 0 -T now "$dev"
		;;
	*)
		log "warning" "Don't know how to delay checks on $fstype..."
		;;
	esac
}

# print the date that $1 was last checked, in a format that date(1) will
#  accept, or "Unknown" if we don't know how to find that date.
function try_get_check_date() {
	local dev="$1"
	local fstype="$2"

	case "$fstype" in
	ext2|ext3)
		dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \
				sed -e 's/Last checked:[[:space:]]*//'
		;;
	*)
		# TODO: add support for various FSes here
		echo "Unknown"
		;;
	esac
}

# check the FS on $1 passively, saving output to $3.
function perform_check() {
	local dev="$1"
	local fstype="$2"
	local tmpfile="$3"

	case "$fstype" in
	ext2|ext3)
		nice logsave -as "${tmpfile}" e2fsck -fn "$dev"
		return $?
		;;
	reiserfs)
		echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev"
		# apparently can't fail?  let's hope not...
		return 0
		;;
	xfs)
		nice logsave -as "${tmpfile}" xfs_check "$dev"
		return $?
		;;
	jfs)
		nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev"
		return $?
		;;
	*)
		log "warning" "Don't know how to check $fstype filesystems passively: assuming OK."
		;;
	esac
}

# do everything needed to check and reset dates and counters on /dev/$1/$2.
function check_fs() {
	local vg="$1"
	local lv="$2"
	local fstype="$3"
	local snapsize="$4"

	local tmpfile=`mktemp -t lvcheck.log.XXXXXXXXXX`
	local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
	local snaplvbase="${lv}-lvcheck-temp"
	local snaplv="${snaplvbase}-`date +'%Y%m%d'`"

	# clean up any left-over snapshot LVs
	for lvtemp in /dev/${vg}/${snaplvbase}* ; do
		if [ -e "$lvtemp" ] ; then
			# Assume the script won't run more than one instance at a time?

			log "warning" "Found stale snapshot $lvtemp: attempting to remove."

			if ! lvremove -f "${lvtemp##/dev}" ; then
				log "error" "Could not delete stale snapshot $lvtemp"
				return 1
			fi
		fi
	done

	# and create this one
	lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}"

	if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then
		log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded."
		try_delay_checks "/dev/${vg}/${lv}" "$fstype"
	else
		log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!"
		try_force_check "/dev/${vg}/${lv}" "$fstype"

		if test -n "$EMAIL"; then
			mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
		fi

		# save the log file in /var/log in case mail is disabled
		mv "$tmpfile" "$errlog"
	fi

	rm -f "$tmpfile"
	lvremove -f "${vg}/${snaplv}"
}

# pull in configuration -- overwrite the defaults above if the file exists
[ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf

# check whether the machine is on AC power: if not, skip fsck
on_ac_power || exit 0

# parse up lvscan output
lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
while read DEV ; do
	# remove the single quotes around the device name
	DEV="`echo "$DEV" | tr -d \'`"

	# get the FS type: blkid prints TYPE="blah"
	eval `blkid -s TYPE "$DEV" | cut -d' ' -f2`

	# get the last-check time
	check_date=`try_get_check_date "$DEV" "$TYPE"`

	# if the date is unknown, run fsck every time the script runs.  sigh.
	if [ "$check_date" != "Unknown" ] ; then
		# add $INTERVAL days, and throw away the time portion
		check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'`

		# get today's date, and skip the check if it's not within the interval
		today=`date +'%Y%m%d'`
		[ $check_day -gt $today ] && continue
	fi

	# get the volume group and logical volume names
	VG="`lvs --noheadings -o vg_name "$DEV"`"
	LV="`lvs --noheadings -o lv_name "$DEV"`"

	# get the free space and LV size (in megs), guess at the snapshot
	#  size, and see how much the admin will let us use (keeping MINFREE
	#  available)
	SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`"
	SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`"
	SNAPSIZE="`expr "$SIZE" / 500`"
	AVAIL="`expr "$SPACE" - "$MINFREE"`"

	# if we don't even have MINSNAP space available, skip the LV
	if [ "$MINSNAP" -gt "$AVAIL" -o "$AVAIL" -le 0 ] ; then
		log "warning" "Not enough free space on volume group for ${DEV}; skipping"
		continue
	fi

	# make snapshot large enough to handle e.g. journal and other updates
	[ "$SNAPSIZE" -lt "$MINSNAP" ] && SNAPSIZE="$MINSNAP"

	# limit snapshot to available space (VG space minus min-free)
	[ "$SNAPSIZE" -gt "$AVAIL" ] && SNAPSIZE="$AVAIL"

	# don't need to check SNAPSIZE again: MINSNAP <= AVAIL, MINSNAP <= SNAPSIZE,
	#  and SNAPSIZE <= AVAIL, combined, means SNAPSIZE must be between MINSNAP
	#  and AVAIL, which is what we need -- assuming AVAIL > 0

	# check it
	check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE"
done

#!/bin/sh

# e2check configuration file

# This file follows the pattern of sshd_config: default
#  values are shown here, commented-out.

#  EMAIL
#   Address to send failure notifications to.  If empty,
#   failure notifications will not be sent.

#EMAIL='root'

#  INTERVAL
#   Days to wait between checks.  All LVs use the same
#   INTERVAL, but the "days since last check" value can
#   be different per LV, since that value is stored in
#   the filesystem superblock.

#INTERVAL=30

#  AC_UNKNOWN
#   Whether to run the e2fsck checks if the script can't
#   determine whether the machine is on AC power.  Laptop
#   users will want to set this to ABORT, while server and
#   desktop users will probably want to set this to
#   CONTINUE.  Those are the only two valid values.

#AC_UNKNOWN="CONTINUE"

#  MINSNAP
#   Minimum snapshot size to take, in megabytes.  The
#   default snapshot size is 1/500 the size of the logical
#   volume, but if that size is less than MINSNAP, the
#   script will use MINSNAP instead.  This should be large
#   enough to handle e.g. journal updates, and other disk
#   changes that require (semi-)constant space.

#MINSNAP=256

#  MINFREE
#   Minimum amount of space (in megabytes) to keep free in
#   each volume group when creating snapshots.

#MINFREE=0

_______________________________________________
Ext3-users mailing list
Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users