Hi Brem El mar, 15-12-2009 a las 21:15 +0100, brem belguebli escribió: > Hi Rafael, > > I can already predict what is going to happen during your test > > I one of your nodes looses only 1 leg of your mirrored qdisk (either > with mdadm or lvm), the qdisk will still be active from the point of > view of this particular node, so nothing will happen. > > What you should consider is > > 1) reducing the scsi timeout of the lun which is by default around 60 > seconds (see udev rules) > 2) if your qdisk lun is configured to multipath, don't configure it > with queue_if_no_path or mdadm will never see if one of the legs came > to be unavail. > > Brem > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster I made some tests today. A) With MDADM mirrored LUNs: I built the MD device over the multipathd devices and used it as a quorum disk. It seemed to work, but in a test during the intentioned failure of a LUN on a single machine the node failed to access the quorum device, so it was evicted by the rest of the nodes. I have to take a closer look to this because in other attempts it didn't happen, I think this is realated with the device timeouts, retries and queues. B) With non-clustered LVM-Mirrored LUNs: Seems to work too, but there are some strange behaviours. During the intentioned failure of a LUN on a single machine the node did not see the failure at the LVM layer of one device not being reachable, but the multipath daemon was marking the device as failed. In other attempts it worked right. Also I have to check, as you commented, the values at the udev rules and multipath.conf file: device { vendor "HP" product "MSA VOLUME" path_grouping_policy group_by_prio getuid_callout "/sbin/scsi_id -g -u -s /block/%n" path_checket tur patch_selector "round_robin 0" prio_callout "/sbin/mpath_prio_alua /dev/%n" rr_weight uniform failback immediate hardware_handler "0" no_path_retry 12 rr_min_io 100 } Note: this is my testing scenario. The production environment is not using MSA storage arrays. I'm thinking in reducing the "no_path_retry" to a smaller value or even to "fail". With the current value (equivalent to "queue_if_no_path" of 12 regarding RHEL docs) MDADM saw the failure of the device, so this is more or less working. I'm interested too in the "flush_on_last_del" parameter, have you ever tried it? Thanks in advance. Cheers, Rafael -- Rafael Micó Miranda -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster