Colin Simpson wrote:
Probably not a cluster issue just pure kernel question. Sounds like the
driver or device is locked up and the driver or device is confused, so
the processes attached to it will be hung.
A common problem in a fabric environment is that there are 2+ paths to
the tapes (ie, 2 HBAs on the server) and commands may take either path
(drives get confused by this). Sending an unlock/reset command via the
other path is usually sufficient to recover but it's an extremely poorly
documented area.
The most common case of this is tapes which refuse to eject - lock
commands are per source and ORed, so unlock commands have to come from
the same HBA(s) which issued the lock. I've added scripts to my bacula
tape handling routines to ensure this happens on our setup.
To be honest I've had similar problems on pretty much all Unixes for
many years. And I've never found a good way out of it. Maybe not an
option with your case and application, but I guess why most people have
their backup systems running on separate dedicated boxes so it can be
rebooted without affecting production systems.
Strongly agree. There are a number of other good reasons for running
dedicated backup systems, not least of which is the double-barrel
difficulty of bootstrapping a restore of the backup system itself AND
the dead cluster box in a worst case scenario (It's a lot easier with
separate boxes as in most cases only one gets trashed and you can reduce
risk further by physically separating backups from operational servers.
A second good reason is the amount of IO a good tape backup solution can
generate - LTO tapes easily outrun spinning media, so a spooling setup
is needed to avoid shoeshine issues.
All this stuff is best discussed on a list dedicated to backups.
Discussions of this kind show up regularly and there are a number of
canned answers at hand.
AB
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster