Re: Fedora 19 cluster stack and Cluster registry components

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Jamison Maxwell
Sr. Systems Administrator

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Digimer
Sent: Wednesday, April 24, 2013 2:22 PM
To: Michael Richmond
Cc: linux clustering
Subject: Re:  Fedora 19 cluster stack and Cluster registry components


   The way I deal with avoiding dual-fence is to put a delay into one of the nodes. For example, I can specify that if Node 1 is to be fenced, Node 2 will pause for X seconds (usually 15 in my setups). This way, if both nodes try to fence the other at the same time, Node 1 will have killed Node 2 long before 2's 15 second timer expired. However, if Node
1 really was dead, Node 2 would still fence 1 and then recover, albeit with a 15 second delay in recovery. Simple and effective. :)

I'm not sure if there is a specific RHEL 6.4 + pacemaker tutorial up yet, but keep an eye on clusterlabs. I *think* Andrew is working on that. If not, I plan to go back to working on my tutorial when I return to the office in May. However, that will still be *many* months before it's done.


On 04/24/2013 01:54 PM, Michael Richmond wrote:
> Hi Digimer,
> Thanks for your detailed comments.
> What you have described with regard to fencing is common practice for
> two node clusters that I have implemented in a few proprietary cluster
> implementations that I have worked on. However, fencing is does not
> completely solve the split-brain problem in two-node clusters. There
> is still the potential for both NodeA and NodeB to decide to fence at
> the same time. In this case, each node performs the fencing operation
> to fence the other node with the result that both nodes get fenced.
> To avoid this, most clustering systems can be optionally configured
> with a shared resource (usually a shared LUN) that is used to weight
> the decision about which node gets fenced. Additionally, the shared
> LUN can be used as a coarse communication mechanism to aid the
> election of a winning node. As I'm sure you are aware, a quorum disk
> is typically used to determine which partition has access to the
> larger/important portion of the cluster resources to determine the
> nodes that must be fenced because they are in a separate network partition.
> Since you mention that qdiskd has an uncertain future, it would appear
> that the pacemaker-based stack has a potential functionality gap with
> regard to two-node clusters. That is, unless some other approach is
> taken to resolve network partitions.
>  From what I understand, the CIB is at risk for unintended roll-back
> of a write in the case where a two-node cluster has nodes up at
> differing times. For example, assuming time
>          Time 0  Node A up                       Node B up       (CIB contains "CIB0")
>          Time 1  Node A up                       Node B down
>          Time 2  Node A writes update to CIB     Node B booting (not joined cluster)
>                  (CIB contains "CIB1")
>          Time 3  Node A down                     Node B up       (CIB contains "CIB0")
> After Time 3, Node B is operating with a CIB that contains "CIB0" and
> has no way of seeing the CIB contents "CIB1" written by Node A. In
> effect, the write by Node A was rolled-back when Node A went down.
> Thanks again for your input.
> Is there any description available about how to configure the
> pacemaker/chorosync stack on RHEL6.4?
> Regards,
> Michael Richmond
> michael richmond | principal software engineer | flashsoft, sandisk |
> +1.408.425.6731
> On 23/4/13 6:07 PM, "Digimer" <lists@xxxxxxxxxx> wrote:
>> First up, before I begin, I am looking to pacemaker for the future as
>> well and do not yet use it. So please take whatever I say about
>> pacemaker with a grain of sand. Andrew, on the other hand, is the
>> author and anything he says can be taken as authoritative on the topic.
>> On the future;
>> I also have a 2-node project/product that I am working to update in
>> time for the release of RHEL 7. Speaking entirely for myself, I can
>> tell you that I am planning to use Pacemaker from RHEL 7.0. As a Red
>> hat outsider, I can only speak as a member of the community, but I
>> have every reason to believe that the pacemaker resource manager will
>> be the one used from 7.0 and forward.
>> As for the CIB, yes, it's a local XML file stored on each node.
>> Synchronization occurs via updates pushed over corosync to nodes
>> active in the cluster. As I understand it, when a node that had been
>> offline connects to the cluster, it receives any updates to the CIB.
>> Dealing with 2-node clusters, setting aside qdisk which has an
>> uncertain future I believe, you can not use quorum. For this reason,
>> it is possible for a node to boot up, fail to reach it's peer and
>> think it's the only one running. It will start your HA services and
>> voila, two nodes offering the same services at the same time in an
>> uncoordinated manner. This is bad and it is called a "split-brain".
>> The way to avoid split-brains in 2-node clusters is to use fence
>> devices, aka stonith devices (exact same thing by two different names).
>> This is _always_ wise to use, but in 2-node clusters, it is critical.
>> So imagine back to your scenario;
>> If a node came up and tried to connect to it's peer but failed to do
>> so, before proceeding, it would fence (usually forcibly power off)
>> the other node. Only after doing so would it start the HA services.
>> In this way, both nodes can never be offering the same HA service at the same time.
>> The risk here though is a "fence loop". If you set the cluster to
>> start on boot and if there is a break in the connection, you can have
>> an initial state where, upon the break in the network, both try to
>> fence the other. The faster node wins, forcing the other node off and
>> resuming to operate on it's own. This is fine and exactly what you
>> want. However, now the fenced node powers back up, starts it's
>> cluster stack, fails to reach it's peer and fences it. It finishes
>> starting, offers the HA services and goes on it's way ... until the
>> other node boots back up. :)
>> Personally, I avoid this by _not_ starting the cluster stack on boot.
>> My reasoning is that, if a node fails and gets rebooted, I want to
>> check it over myself before I let it back into the cluster (I get
>> alert emails when something like this happens). It's not a risk from
>> an HA perspective because it's services would have recovered on the
>> surviving peer long before it reboots anyway. This also has the added
>> benefit of avoiding a fence loop, no matter what happens.
>> Cheers
>> digimer
>> On 04/23/2013 02:07 PM, Michael Richmond wrote:
>>> Andrew and Digimer,
>>> Thank you for taking the time to respond, you have collaborated some
>>> of what I've been putting together as the likely direction.
>>> I am working on adapting some cluster-aware storage features for use
>>> in a Linux cluster environment. With this kind of project it is
>>> useful to try and predict where the Linux community is heading so
>>> that I can focus my development work on what will be the "current"
>>> cluster stack around my anticipated release dates. Any predictions
>>> are simply educated guesses that may prove to be wrong, but are
>>> useful with regard to developing plans. From my reading of various
>>> web pages and piecing things together I found that RHEL 7 is
>>> intended to be based on Fedora 18, so I assume that the new
>>> Pacemaker stack has a good chance of being rolled out in RHEL
>>> 7.1/7.2, or even possibly 7.0.
>>> Hearing that there is official word that the intention is for
>>> Pacemaker to be the official cluster stack helps me put my
>>> development plans together.
>>> The project I am working on is focused on two-node clusters. But I
>>> also need a persistent, cluster-wide data store to hold a small
>>> amount of state (less than 1KB). This data store is what I refer to
>>> as a cluster-registry.
>>> The state data records the last-known operational state for the
>>> storage feature. This last-known state helps drive recovery
>>> operations for the storage feature during node bring-up. This
>>> project is specifically aimed at integrating generic functionality into the Linux cluster stack.
>>> I have been thinking about using the cluster configuration file for
>>> this storage which I assume is the CIB referenced by Andrew. But I
>>> can imagine cases where the CIB file may loose updates if it does
>>> not utilize shared storage media. My understanding is that the CIB
>>> file is stored on each node using local disk storage.
>>> For example, consider a two-node cluster that is configured with a
>>> quorum disk on shared storage media. If at a given point in time
>>> NodeB is up and NodeB is down. NodeA can form quorate and start
>>> cluster services (including HA applications). Assume that NodeA
>>> updates the CIB to record some state update. If NodeB starts booting
>>> but before NodeB joins the cluster, NodeA crashes. At this point,
>>> the updated CIB only resides on NodeA and cannot be accessed by
>>> NodeB even if NodeB can access the quorum disk as form quorate.
>>> Effectively, NodeB cannot be aware of the update from NodeA which
>>> will result in an implicit roll-back of any updates performed by
>>> NodeA.
>>> With a two-node cluster, there are two options for resolving this:
>>> * prevent any update to the cluster registry/CIB unless all nodes
>>> are part of the cluster. (This is not practical since it undermines
>>> some of the reasons for building clusters.)
>>> * store the cluster registry on shared storage so that there is one
>>> source of truth.
>>> It is possible that the nature of the data stored in the CIB is
>>> resilient to the example scenario that I describe. In this case,
>>> maybe the CIB is not an appropriate data store for my cluster
>>> registry data. In this case I am either looking for an appropriate
>>> Linux component to use for my cluster registry, or I will build a
>>> custom data store that provides atomic update semantics on shared
>>> storage.
>>> Any thoughts and/or pointers would be appreciated.
>>> Thanks,
>>> Michael Richmond
>>> --
>>> michael richmond | principal software engineer | flashsoft, sandisk
>>> |
>>> +1.408.425.6731
>>> On 22/4/13 4:37 PM, "Andrew Beekhof" <andrew@xxxxxxxxxxx> wrote:
>>>> On 23/04/2013, at 4:59 AM, Digimer <lists@xxxxxxxxxx> wrote:
>>>>> On 04/22/2013 02:36 PM, Michael Richmond wrote:
>>>>>> Hello,
>>>>>> I am researching the new cluster stack that is scheduled to be
>>>>>> delivered in Fedora 19. Does anyone on this list have a sense for
>>>>>> the timeframe for this new stack to be rolled into a RHEL
>>>>>> release? (I assume the earliest would be RHEL 7.)
>>>>>> On the Windows platform, Microsoft Cluster Services provides a
>>>>>> cluster-wide registry service that is basically a cluster-wide
>>>>>> key:value store with atomic updates and support to store the
>>>>>> registry on shared disk. The  storage on shared disk allows
>>>>>> access and use of the registry in cases where nodes are
>>>>>> frequently joining and leaving the cluster.
>>>>>> Are there any component(s) that can be used to provide a similar
>>>>>> registry in the Linux cluster stack? (The current RHEL 6 stack,
>>>>>> and/or the new Fedora 19 stack.)
>>>>>> Thanks in advance for your information, Michael Richmond
>>>>> Hi Michael,
>>>>>    First up, Red Hat's policy of what is coming is "we'll announce
>>>>> on release day". So anything else is a guess. As it is, Pacemaker
>>>>> is in tech-preview in RHEL 6, and the best guess is that it will
>>>>> be the official resource manager in RHEL 7, but it's just that, a guess.
>>>> I believe we're officially allowed to say that it is our
>>>> _intention_ that Pacemaker will be the one and only supported stack
>>>> in RHEL7.
>>>>>    As for the registry question; I am not entirely sure what it is
>>>>> you are asking here (sorry, not familiar with windows). I can say
>>>>> that pacemaker uses something called the CIB (cluster information
>>>>> base) which is an XML file containing the cluster's configuration
>>>>> and state. It can be updated from any node and the changes will
>>>>> push to the other nodes immediately.
>>>> How many of these attributes are you planning to have?
>>>> You can throw a few in there, but I'd not use it for 100's or
>>>> 1000's of them - its mainly designed to store the resource/service configuration.
>>>>> Does this answer your question?
>>>>>    The current RHEL 6 cluster is corosync + cman + rgmanager. It
>>>>> also uses an XML config and it can be updated from any node and
>>>>> push out to the other nodes.
>>>>>    Perhaps a better way to help would be to ask what, exactly, you
>>>>> want to build your cluster for?
>>>>> Cheers
>>>>> --
>>>>> Digimer
>>>>> Papers and Projects: What if the cure for
>>>>> cancer is trapped in the mind of a person without access to
>>>>> education?
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster@xxxxxxxxxx
>>> ________________________________
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s) named above.
>>> If the reader of this message is not the intended recipient, you are
>>> hereby notified that you have received this message in error and
>>> that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies
>>> of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> Digimer
>> Papers and Projects: What if the cure for
>> cancer is trapped in the mind of a person without access to
>> education?
> ________________________________
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

Papers and Projects: What if the cure for cancer is trapped in the mind of a person without access to education?

Linux-cluster mailing list

Linux-cluster mailing list

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux