Re: Fencing through iLO and functioning of kdump

"Ryan O'Hara" <rohara@xxxxxxxxxx> · Wed, 1 Sep 2010 12:11:45 -0500

On Wed, Sep 01, 2010 at 10:48:23AM -0400, Ben Turner wrote:
> Here is a kbase on fence scsi:
> 
> https://access.redhat.com/kb/docs/DOC-17809
> 
> It should answer any questions you have:
> 
> https://access.redhat.com/kb/docs/DOC-17809
> 
> Usually I try the fence_scsi_test to be sure my devices are capable, note:
> 
> "To assist with finding and detecting devices which are (or are not) suitable for use with fence_scsi, a tool has been provided. The fence_scsi_test script will find devices visible to the node and report whether or not they are compatible with SCSI persistent reservations."

I just have to comment that fence_scsi_test is rather limited. I'm
currently working on making it more robust, such that it more
accurately tests device(s) for SCSI-PR support.

Basically there are two issues:

1. The current script does not verify that registrations exist on a
device -- it relies on the error code returned from sg_persist. This
usually works, but we have seen some arrays that will report false
positives.

2. The script *only* puts a registration on the device(s) and then
removes the registration from each device. This doesn't tell the whole
story, since it the array must also support the preempt-and-abort
operation.

A new fence_scsi_test script should be available in the very near
future. Here is the relevant BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=603838

Ryan

> ----- "Chris Jankowski" <Chris.Jankowski@xxxxxx> wrote:
> 
> > Ben,
> > 
> > Thank you for pointing me at fence_scsi.
> > It looks like fence_scsi will fit the bill elegantly. And it should be
> > much more reliable then iLO fencing if the cluster uses properly
> > configured, dual fabric FC SAN for shared storage.
> > 
> > I read the fence_scsi manual page and have one more question.
> > 
> > What do I need to do for my cluster to start using SCSI reservations?
> > Is this done by default?
> > 
> > Thanks and regards,
> > 
> > Chris Jankowski
> > 
> > -----Original Message-----
> > From: linux-cluster-bounces@xxxxxxxxxx
> > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Ben Turner
> > Sent: Saturday, 28 August 2010 03:29
> > To: linux clustering
> > Subject: Re:  Fencing through iLO and functioning of
> > kdump
> > 
> > You have a couple options here:
> > 
> > 1.  Switch to fence_scsi(uses scsi reservation as you described) or an
> > other I/O fencing method that does not reboot the system.  This will
> > enable you core dump to complete without power fencing interrupting
> > it.
> > 
> > 2.  Put in a post fail delay long enough for fencing to complete. 
> > This is sub optimal as your cluster services/resources will be hung
> > for the duration of the post fail delay.  I usually only do this when
> > I know I have a node that is crashing and no I/O fencing
> > capabilities.
> > 
> > 3.  If you don't have access to an I/O fence agent and it post fail
> > delay won't work for some reason you can try:
> > 
> > Best practice I can think of right now would be the following:
> > 1. disable the power fence device on the host you're seeing panics on,
> > I have changed the IP for it in cluster.conf in the past 2. when that
> > node fails, the other nodes will attempt to fence the host
> >    and it will fail since the fence device was disabled
> >    (NOTE: between steps 2 and 3, cluster operation is suspended) 3.
> > administrator can now do things like:
> >    - disconnect the FC and network cables form the affected host
> > ensuring
> >      that it is 'manually I/O fenced'
> >    - run fence_ack_manual on the other host to override the failed
> >      fencing operation to continue cluster operation on the other
> > nodes 4. Now the failed host is free to continue kdumping for as long
> > as need be
> > 
> > Hope this helps.
> > 
> > -b
> > 
> > 
> > ----- "Chris Jankowski" <Chris.Jankowski@xxxxxx> wrote:
> > 
> > > Hi,
> > > 
> > > How can I reconcile the need to have Kdump configured and
> > operational 
> > > on cluster nodes with the need for fencing of a node most commonly
> > and 
> > > conveniently implemented through iLO on HP servers?
> > > 
> > > Customers require Kdump configured and operational to be able to
> > have 
> > > kernel crashes analysed by Red Hat support. The taking of crash dump
> > 
> > > starts immediately after the crash, but it may take very
> > considerable 
> > > time on a machine with 512 GB of memory (more than an hour) if done
> > in 
> > > dumplevel 0 and over 1 GBE network. However, if I use iLO fencing
> > then 
> > > the crashed node will be powered off through iLO which will 
> > > irrecovably kill the the kernel dump in progress and erase the
> > memory 
> > > content containing the crashed kernel image.
> > > 
> > > Ideally, I would love to have the functionality that is present in 
> > > several UNIX clusters, when a crashed node completes its kernel
> > crash 
> > > dump in peace. In UNIX clusters the crashed node can be configured
> > to 
> > > reboot automatically after kernel crash and rejoin the cluster. It 
> > > typically does the kernel dump as a part of the boot.
> > > 
> > > The UNIX clusters typically use SCSI reservation to protect
> > integrity 
> > > of storage. This enables them to keep the failed node isolated
> > whilst 
> > > it is still able to do the kernel crash dump before rejoining the 
> > > cluster. I believe this option is not avilable in Linux Cluster.
> > > 
> > > So, how can I have functioning Linux cluster with ability of taking
> > a 
> > > kernel crash dump of crashed nodes and without blocking the access
> > to 
> > > shared GFS2 filesystem for the hour or so that bit may take a crash
> > 
> > > dump obn a very large system?
> > > 
> > > Thanks and regards,
> > > 
> > > Chris Jankowski
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster@xxxxxxxxxx
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster