Keith Lewis wrote:
Thursday 22 Feb 2007
An attempt was made to make sure all the computers in a certain group
had a common set of rpms installed.
To make this easier, a non-RedHat rpm was copied to a disk that was
mounted on most of the machines, and installed from there. This broke the RPM
database on those machines. After the install, most rpm commands got this:
error: rpmdbNextIterator: skipping h# 2325 Header V3 DSA signature: BAD, key
ID 99b62126
This was fixed by rpm -e <the above rpm> ; rpm --rebuilddb
It was noticed that all the machines which had installed the shared
rpm had failed in this way, but none of the machines that had installed from a
copy on local disk.
Using `sum' it was noticed that all the machines except one saw the
file as corrupt. The one machine, (called `T' from here on), was the one
which had done the original copy. It still saw the file as pristine - i.e.
not corrupt.
The shared filesystem is based on GFS, but due to a history of network
and SAN problems causing fence events which seriously degrade our
servicelevel, GFS is restricted to as few machines as possible. Currently
only three machines, (called C, S and W from here on), mount the GFS disk
directly. Machines C and W export it to the rest of the group via NFS.
`T' mounted the GFS disk via NFS through W. `T' was the only machine
to see the GFS copy as pristine. All other machines, including C, S and W,
irrespective of whether they mounted the disk by GFS directly or by NFS saw
the file as corrupt.
`T' then dismounted the disk via W and remounted it via C. It then saw
the file as corrupt, but it then made another copy of the file from its local
disk to the GFS disk, and this copy too was seen as corrupt by all other
machines, while `T' itself saw it as pristine.
Other machines had no problems copying the same file from their local
disk to the GFS disk.
An attempt was made to mount the GFS disk directly on T:
/etc/init.d/pool start
/etc/init.d/ccsd start
/etc/init.d/lock_gulmd start
/etc/init.d/gfs start
mount /dev/pool/pool_gfs01 -t gfs /mnt
(I've never mounted a GFS disk in this way before, so this may be a
problem - usually its in fstab and `/etc/init.d/gfs start' mounts it)
The mount never completed. The log on the master lockserver showed
lock_gulm starting on `T' (New Client: idx 10 fd 15 from ...) and about a
minute later T missed a heartbeat... seven heartbeats later `T' was fenced,
and most embarrassingly, rebooted.
After the reboot `T' saw all the GFS copies (except those made by other
machines) as corrupt, but a further copy of the file by `T' to the GFS disk
showed as corrupt by all nodes except `T' which continued to see it as
pristine... i.e. the reboot had not cured the problem...
Summary - I have one file, R-2.3.1-1.rh3AS.i386.rpm, which one node,
`T', cannot successfully copy to the GFS disk, although it thinks it can, and
can even copy it back, producing a duplicate of the original...
# uname -r
2.4.21-47.0.1.ELsmp
# rpm -qa | grep -i gfs
GFS-devel-6.0.2.36-1
GFS-6.0.2.36-1
GFS-modules-smp-6.0.2.36-1
# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 3 (Taroon Update 8)
sum pristine file:
01904 22905
sum corrupt file:
57604 22905
The above account is an accurate description of the events, only the
confusion, disbelief and utter panic has been omitted.
Looking for suggestions, like what to do next, which list to take
it to and so on...
Thanks
Keith
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
Hi Keith,
Good question. In fact, I've answered similar ones before on this list.
I thought I had added it to the cluster faq, but apparently I was
remiss; sorry.
I just added it now:
http://sources.redhat.com/cluster/faq.html#gfs_corruption
The examples I gave assume that you're using lvm2, which you're not
because you're RHEL3, but it should still give you the gist.
Please let me know if the new faq entry needs some work.
BTW, it was noticed that almost all of your sentences were written
in the passive voice. The question why presents itself. ;)
Regards,
Bob Peterson
Red Hat Cluster Suite
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster