Thursday 22 Feb 2007 An attempt was made to make sure all the computers in a certain group had a common set of rpms installed. To make this easier, a non-RedHat rpm was copied to a disk that was mounted on most of the machines, and installed from there. This broke the RPM database on those machines. After the install, most rpm commands got this: error: rpmdbNextIterator: skipping h# 2325 Header V3 DSA signature: BAD, key ID 99b62126 This was fixed by rpm -e <the above rpm> ; rpm --rebuilddb It was noticed that all the machines which had installed the shared rpm had failed in this way, but none of the machines that had installed from a copy on local disk. Using `sum' it was noticed that all the machines except one saw the file as corrupt. The one machine, (called `T' from here on), was the one which had done the original copy. It still saw the file as pristine - i.e. not corrupt. The shared filesystem is based on GFS, but due to a history of network and SAN problems causing fence events which seriously degrade our servicelevel, GFS is restricted to as few machines as possible. Currently only three machines, (called C, S and W from here on), mount the GFS disk directly. Machines C and W export it to the rest of the group via NFS. `T' mounted the GFS disk via NFS through W. `T' was the only machine to see the GFS copy as pristine. All other machines, including C, S and W, irrespective of whether they mounted the disk by GFS directly or by NFS saw the file as corrupt. `T' then dismounted the disk via W and remounted it via C. It then saw the file as corrupt, but it then made another copy of the file from its local disk to the GFS disk, and this copy too was seen as corrupt by all other machines, while `T' itself saw it as pristine. Other machines had no problems copying the same file from their local disk to the GFS disk. An attempt was made to mount the GFS disk directly on T: /etc/init.d/pool start /etc/init.d/ccsd start /etc/init.d/lock_gulmd start /etc/init.d/gfs start mount /dev/pool/pool_gfs01 -t gfs /mnt (I've never mounted a GFS disk in this way before, so this may be a problem - usually its in fstab and `/etc/init.d/gfs start' mounts it) The mount never completed. The log on the master lockserver showed lock_gulm starting on `T' (New Client: idx 10 fd 15 from ...) and about a minute later T missed a heartbeat... seven heartbeats later `T' was fenced, and most embarrassingly, rebooted. After the reboot `T' saw all the GFS copies (except those made by other machines) as corrupt, but a further copy of the file by `T' to the GFS disk showed as corrupt by all nodes except `T' which continued to see it as pristine... i.e. the reboot had not cured the problem... Summary - I have one file, R-2.3.1-1.rh3AS.i386.rpm, which one node, `T', cannot successfully copy to the GFS disk, although it thinks it can, and can even copy it back, producing a duplicate of the original... # uname -r 2.4.21-47.0.1.ELsmp # rpm -qa | grep -i gfs GFS-devel-6.0.2.36-1 GFS-6.0.2.36-1 GFS-modules-smp-6.0.2.36-1 # cat /etc/redhat-release Red Hat Enterprise Linux AS release 3 (Taroon Update 8) sum pristine file: 01904 22905 sum corrupt file: 57604 22905 The above account is an accurate description of the events, only the confusion, disbelief and utter panic has been omitted. Looking for suggestions, like what to do next, which list to take it to and so on... Thanks Keith -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster