Re: Disk Monitoring

Wolfgang Denk <wd@xxxxxxx> · Wed, 28 Jun 2017 15:19:17 +0200

Dear Gandalf,

In message <CAJH6TXgvrVckHDmh1oiN9mupLrsS2NP3J44bG1_wE9Nnx4=yHQ@xxxxxxxxxxxxxx> you wrote:
> 
> 1) all raid controllers have proactive monitoring features, like
> patrol read, consistency check and (more or less) some SMART
> integration.
> Any counterpart in mdadm?

As Wol already pointed out, you should use  smaartctl  to monitor
the state of the disk drives, ideally on a regular base.  Changes
(increases) of numbers like "Reallocated Sectors", ""Current Pending
Sectors" or ""Offline Uncorrectable Sectors" are always suspicious.
If they increase just by one, and then stay constant for weeks you
can probably ignore it.  But if you see I/O errors in the system
logs and/or "Reallocated Sectors" increasing every few days then you
should not wait much longer and replace the respective drive.

Attached are two very simple scripts I use for this purpose;
"disk-test" simply runs smartctl on all /dev/sd? devices and parses
the output.  The result is something like this:

$ sudo disk-test
=== /dev/sda : ST1000NM0011 S/N Z1N2RA6E *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdb : ST2000NM0033-9ZM175 S/N Z1X1J1K9 OK
=== /dev/sdc : ST2000NM0033-9ZM175 S/N Z1X1JEF6 OK
=== /dev/sdd : ST2000NM0033-9ZM175 S/N Z1X4XSN9 OK
=== /dev/sde : ST2000NM0033-9ZM175 S/N Z1X4X6G8 OK
=== /dev/sdf : ST2000NM0033-9ZM175 S/N Z1X54EA1 OK
=== /dev/sdg : ST2000NM0033-9ZM175 S/N Z1X5443W OK
=== /dev/sdh : ST2000NM0033-9ZM175 S/N Z1X4XAHQ OK
=== /dev/sdi : ST2000NM0033-9ZM175 S/N Z1X4X6NB OK
=== /dev/sdj : TOSHIBA MK1002TSKB S/N 32E3K0K2F OK
=== /dev/sdk : TOSHIBA MK1002TSKB S/N 32F3K0PRF OK
=== /dev/sdl : TOSHIBA MK1002TSKB S/N 32H3K10CF *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdm : TOSHIBA MK1002TSKB S/N 32H3K0ZLF OK
=== /dev/sdn : TOSHIBA MK1002TSKB S/N 32H3K104F OK
=== /dev/sdo : TOSHIBA MK1002TSKB S/N 32H1K31DF OK
=== /dev/sdp : TOSHIBA MK1002TSKB S/N 32F3K0PUF OK
=== /dev/sdq : TOSHIBA MK1002TSKB S/N 32E3K0JZF OK

Here I have two drives with 1 reallocated sector each, which I
consider harmeless as it has stayed constant for several months.

The second script "disk-watch" is intended to be run as a cron job
on a regular base (here usually twice per day).  It will send out
email whenever the state changes (don't forget to adjust the MAIL_TO
setting).  You may also want to clean up the entries in /var/log/diskwatch
every now and then (or better add it to your logrotate
configuration).

HTH.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@xxxxxxx
Yes, it's a technical challenge, and  you  have  to  kind  of  admire
people  who go to the lengths of actually implementing it, but at the
same time you wonder about their IQ...
         --  Linus Torvalds in <5phda5$ml6$1@xxxxxxxxxxxxxxxxxxxxxxx>

#!/bin/sh

DISKS="$(echo /dev/sd?)"

PATH=$PATH:/sbin:/usr/sbin

for i in ${DISKS}
do
	SMARTDATA=$(smartctl -a $i | \
	egrep 'Device Model:|Serial Number:|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|failed|Unknown USB' | \
	grep -v ' -  *0$')
	LINES=$(echo "${SMARTDATA}" | wc -l)
	HEAD=$(echo "${SMARTDATA}" | \
	       sed -n -e 's/Device Model: //p' \
		      -e 's!Serial Number:!S/N!p')	
	BODY=$(echo "${SMARTDATA}" | \
	       awk '$2 ~ /Reallocated_Sector_Ct/	{ printf "Reallocated Sectors:   %3d\n", $10 }
		    $2 ~ /Current_Pending_Sector/	{ printf "Current Pending Sect:  %3d\n", $10 }
		    $2 ~ /Offline_Uncorrectable/	{ printf "Offline Uncorrectable: %3d\n", $10 }
		    $0 ~ /failed:.*AMCC/		{ printf "Unsupported AMCC/3ware controller\n" }
		    $0 ~ /SMART command failed/		{ printf "Device does not support SMART\n" }
		    $0 ~ /Unknown USB bridge/		{ printf "Unknown USB bridge\n" }
		'
	     )
	if [ $LINES -eq 2 ]
	then
		echo === $i : ${HEAD} OK
	else
		echo === $i : ${HEAD} "*** ERRORS ***"
		echo "${BODY}" | sed -e 's/^/	/'
	fi
done
#!/bin/sh

D_TEST=/usr/local/sbin/disk-test
D_LOGDIR=/var/log/diskwatch
MAIL_TO="root"

[ -x ${D_TEST} ] || { echo "ERROR: cannot execute ${D_TEST}" >&2 ; exit 1 ; }

[ -d ${D_LOGDIR} ] || \
	mkdir -p ${D_LOGDIR} || \
		{ echo "ERROR: cannot create ${D_LOGDIR}" >&2 ; exit 1 ; }

cd ${D_LOGDIR} || { echo "ERROR: cannot cd ${D_LOGDIR}" >&2 ; exit 1 ; }

rm -f previous

[ -L latest ] && mv latest previous

NOW=$(date "+%F-%T")

${D_TEST} >${NOW}

ln -s "${NOW}" latest

DIFF=''

[ -r previous ] && DIFF=$(diff -u previous latest)

[ -z "${DIFF}" ] && exit 0

mailx -s "$(hostname): SMART DISK WARNING" ${MAIL_TO} <<+++
Disk status change:
${DIFF}

Recent results:
$(cat latest)
+++