See my thread earlier as I am having similar issues. I am testing this soon, but I "think" the issue in my case is setting up SCSI fencing before GFS2. So essentially it has nothing to fence off of, sees it as a fault, and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc then put in the SCSI fence so that it can actually create the private reservations. Then the fun begins in pulling the plug randomly to see how it behaves. ________________________________________ Chip Burke On 8/10/12 12:46 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >Not sure if it relates, but I can say that without fencing, things will >break in strange ways. The reason is that if anything triggers a fault, >the cluster blocks by design and stays blocked until a fence call >succeeds (which is impossible without fencing configured in the first >place). > >Can you please setup fencing, test to make sure it works (using >'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this >is done, test again for your problem. If it still exists, please paste >the updated cluster.conf then. Also please include syslog from both >nodes around the time of your LVM tests. > >digimer > >On 08/10/2012 12:38 PM, Poós Krisztián wrote: >> This is the cluster conf, Which is a clone of the problematic system on >> a test environment (without the ORacle and SAP instances, only focusing >> on this LVM issue, with an LVM resource) >> >> [root@rhel2 ~]# cat /etc/cluster/cluster.conf >> <?xml version="1.0"?> >> <cluster config_version="7" name="teszt"> >> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> >> <clusternodes> >> <clusternode name="rhel1.local" nodeid="1" votes="1"> >> <fence/> >> </clusternode> >> <clusternode name="rhel2.local" nodeid="2" votes="1"> >> <fence/> >> </clusternode> >> </clusternodes> >> <cman expected_votes="3"/> >> <fencedevices/> >> <rm> >> <failoverdomains> >> <failoverdomain name="all" nofailback="1" ordered="1" restricted="0"> >> <failoverdomainnode name="rhel1.local" priority="1"/> >> <failoverdomainnode name="rhel2.local" priority="2"/> >> </failoverdomain> >> </failoverdomains> >> <resources> >> <lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/> >> <fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4" >> mountpoint="/lvm" name="teszt-fs"/> >> </resources> >> <service autostart="1" domain="all" exclusive="0" name="teszt" >> recovery="disable"> >> <lvm ref="teszt-lv"/> >> <fs ref="teszt-fs"/> >> </service> >> </rm> >> <quorumd label="qdisk"/> >> </cluster> >> >> Here are the log parts: >> Aug 10 17:21:21 rgmanager I am node #2 >> Aug 10 17:21:22 rgmanager Resource Group Manager Starting >> Aug 10 17:21:22 rgmanager Loading Service Data >> Aug 10 17:21:29 rgmanager Initializing Services >> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted >> Aug 10 17:21:31 rgmanager Services Initialized >> Aug 10 17:21:31 rgmanager State change: Local UP >> Aug 10 17:21:31 rgmanager State change: rhel1.local UP >> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt >> Aug 10 17:23:25 rgmanager Failed to activate logical volume, >>teszt/teszt-lv >> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv >> Aug 10 17:23:29 rgmanager Failed second attempt to activate >>teszt/teszt-lv >> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic >>error) >> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return >> value: 1 >> Aug 10 17:23:29 rgmanager Stopping service service:teszt >> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with >> a real device >> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid >> argument(s)) >> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop; >> intervention required >> Aug 10 17:23:31 rgmanager Service service:teszt is failed >> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not >> start. >> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop >>cleanly >> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt >> Aug 10 17:25:14 rgmanager Failed to activate logical volume, >>teszt/teszt-lv >> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv >> Aug 10 17:25:17 rgmanager Failed second attempt to activate >>teszt/teszt-lv >> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic >>error) >> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return >> value: 1 >> Aug 10 17:25:18 rgmanager Stopping service service:teszt >> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with >> a real device >> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid >> argument(s)) >> >> >> After I manually started the lvm on node1 and tried to switch it on >> node2 it's not able to start it. >> >> Regards, >> Krisztian >> >> >> On 08/10/2012 05:15 PM, Digimer wrote: >>> On 08/10/2012 11:07 AM, Poós Krisztián wrote: >>>> Dear all, >>>> >>>> I hope that anyone run into this problem in the past, so maybe can >>>>help >>>> me resolving this issue. >>>> >>>> There is a 2 node rhel cluster with quorum also. >>>> There are clustered lvms, where the -c- flag is on. >>>> If I start clvmd all the clustered lvms became online. >>>> >>>> After this if I start rgmanager, it deactivates all the volumes, and >>>>not >>>> able to activate them anymore as there are no such devices anymore >>>> during the startup of the service, so after this, the service fails. >>>> All lvs remain without the active flag. >>>> >>>> I can manually bring it up, but only if after clvmd is started, I set >>>> the lvms manually offline by the lvchange -an <lv> >>>> After this, when I start rgmanager, it can take it online without >>>> problems. However I think, this action should be done by the rgmanager >>>> itself. All the logs is full with the next: >>>> rgmanager Making resilient: lvchange -an .... >>>> rgmanager lv_exec_resilient failed >>>> rgmanager lv_activate_resilient stop failed on .... >>>> >>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to >>>> restart clvmd to make it work again. (sometimes killing it) >>>> >>>> Anyone has any idea, what to check? >>>> >>>> Thanks and regards, >>>> Krisztian >>> >>> Please paste your cluster.conf file with minimal edits. > > >-- >Digimer >Papers and Projects: https://alteeve.com > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster