On Wed, 2007-01-03 at 23:12 -0500, danwest wrote: > The SAN in this > case is a high end EMC DMX, multipathed, etc... Currently our clusters > are set to interval="1" and tko="15" which should allow for at least 15 > seconds (a very long time for this type of storage) "at max" 15 seconds. > In looking at ~/cluster/cman/qdisk/main.c it seems like the following is > taking place: > > In quroum_loop {} > > 1) read everybody else's status (not sure if this includes > yourself > 2) check for node transitions (write eviction notice if number > of heartbeats missed > tko) > 3) check local heuristic (if we do not meet requirement remove > from qdisk partition and possibly reboot) > 4) Find master and/or determine new master, etc... > 5) write out our status to qdisk > 6) write out our local status (heuristics) > 7) cycle ( sleep for defined interval). sleep() measured in > seconds so complete cycle = interval + time for steps (1) through (6) > > Do you think that any delay in steps (1) through (4) could be the > problem? From an architectural standpoint wouldn't it be better to have > (6) and (7) as a separate thread or daemon? A kernel thread like > cman_hbeat for example? The heuristics are checked in the background in a separate thread; the only thing that is checked is their states. Step 1 will take awhile (most of any part of qdiskd). However, steps 2-4 shouldn't. Making the read/write separate probably will (probably) not change much - it's all direct I/O. You basically said it yourself: on high end storage, this just shouldn't be a problem. We're doing a maddening 8k of reads and 0.5k of writes during a normal cycle every (in your case) 1 second. So, I suspect it's a scheduling problem. That is, it would probably be a whole lot more effective to just increase the priority of qdiskd so that it gets scheduled even during load spikes (E.g. use a realtime queue; SCHED_RR?). I don't think the I/O path is the bottleneck. > Further in the check_transitions procedure case #2 it might be more > helpful to clulog what actually caused this to trigger. The current > logging is a bit generic. You're totally right here; the logging isn't very great at the moment. -- Lon
Attachment:
signature.asc
Description: This is a digitally signed message part
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster