On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote: > My 3 node cluster ran tests for 53 hours before hitting a problem. Attached is a patch to set the CMAN process to run at realtime priority, I'm not sure if that's the right thing to do or not to be honest. Neither am I sure whether your 48-53 hours is significant - it's possible that memory may be an issue (only guessing but GFS caches locks like crazy, it may be worth cutting this down a bit by tweaking /proc/cluster/lock_dlm/drop_count and/or /proc/cluster/lock_dlm/drop_period otherwise, the only way were gpoing to get to the bottom of this is to enable "DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked out of the cluster. patrick
Index: cnxman.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v retrieving revision 1.45 diff -u -p -r1.45 cnxman.c --- cnxman.c 17 Jan 2005 14:42:36 -0000 1.45 +++ cnxman.c 18 Jan 2005 10:49:50 -0000 @@ -63,6 +63,7 @@ static int is_valid_temp_nodeid(int node extern int start_membership_services(pid_t); extern int kcl_leave_cluster(int remove); extern int send_kill(int nodeid, int needack); +extern void cman_set_realtime(struct task_struct *tsk, int prio); static struct proto_ops cl_proto_ops; static struct sock *master_sock; @@ -308,7 +309,7 @@ static int cluster_kthread(void *unused) init_waitqueue_entry(&cnxman_waitq_head, current); add_wait_queue(&cnxman_waitq, &cnxman_waitq_head); - set_user_nice(current, -6); + cman_set_realtime(current, 1); /* Allow the sockets to start receiving */ list_for_each(socklist, &socket_list) { Index: membership.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/membership.c,v retrieving revision 1.47 diff -u -p -r1.47 membership.c --- membership.c 13 Jan 2005 14:12:59 -0000 1.47 +++ membership.c 18 Jan 2005 10:49:50 -0000 @@ -201,6 +202,13 @@ static uint8_t *node_opinion = NULL; #define OPINION_AGREE 1 #define OPINION_DISAGREE 2 + +void cman_set_realtime(struct task_struct *tsk, int prio) +{ + tsk->policy = SCHED_FIFO; + tsk->rt_priority = prio; +} + /* Set node id of a node, also add it to the members array and expand the array * if necessary */ static inline void set_nodeid(struct cluster_node *node, int nodeid) @@ -281,7 +289,7 @@ static int hello_kthread(void *unused) hello_task = tsk; up(&hello_task_lock); - set_user_nice(current, -20); + cman_set_realtime(current, 1); while (node_state != REJECTED && node_state != LEFT_CLUSTER) { @@ -317,7 +325,7 @@ static int membership_kthread(void *unus sigprocmask(SIG_BLOCK, &tmpsig, NULL); membership_task = tsk; - set_user_nice(current, -5); + cman_set_realtime(current, 1); /* Open the socket */ if (init_membership_services())