For a longest time I was plagued by (very infrequent) crashes like this: Program received signal SIGSEGV, Segmentation fault. sess_retry_output (timer=0x92070c0) at session.c:532 532 if (!next_retry || (op->next_retry < next_retry)) (gdb) info threads * 1 Thread 0xb72f96c0 (LWP 22417) sess_retry_output (timer=0x92070c0) at session.c:532 (gdb) where #0 sess_retry_output (timer=0x92070c0) at session.c:532 #1 session_retry (timer=0x92070c0) at session.c:565 #2 0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95 #3 0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983 #4 main (argc=5, argv=0xbff70bd4) at server.c:1138 The crash happens because op is NULL. As it turned out, this happens if a packet retransmit and a session expiration occur simultaneously (in the same pass of timers_run). The scenario is: - timers_run collects expired timers at exec list - timers_run expires session - two timer_del are called, but one of them is on exec list already, so it's ineffective - session is freed, this zeroes ->data in lists (later op) - timers_run continues along the exec list, invokes the retransmission callback, and that crashes with NULL op. The proposed solution is to rework the timers_run, again. But this time, we'll make it simpler by observing that timers are ordered by expiration time. Therefore, we can pull next timer off the list, expire it, and loop until expiration time is greater than the current time. No execution list is kept. The integrity of the main list is assured by never walking it and always referring to the head anew at each iteration. This patch appears to fix the problem and stands up to use that crashed the old code. Signed-off-by: Pete Zaitcev <zaitcev@xxxxxxxxxx> --- include/cld_common.h | 10 ++++++---- lib/libtimer.c | 41 ++++++++++++++++++++--------------------- 2 files changed, 26 insertions(+), 25 deletions(-) I am absurdly proud of this. But the long history of timer problem in CLD makes me cautious. I already thought that we fixed everything in the past. commit c1b4a0c4dda14d9fbc4d5896e9e0dc139c31798b Author: Master <zaitcev@xxxxxxxxxxxxxxxxxx> Date: Sat Apr 17 19:02:35 2010 -0600 Fix persistent CLD crashes in retransmits (op == NULL). diff --git a/include/cld_common.h b/include/cld_common.h index d269e6c..379bd9e 100644 --- a/include/cld_common.h +++ b/include/cld_common.h @@ -24,6 +24,7 @@ #include <stdbool.h> #include <string.h> #include <time.h> +#include <glib.h> #include <openssl/sha.h> #include <cld_msg_rpc.h> @@ -31,10 +32,6 @@ struct hail_log; -struct cld_timer_list { - void *list; -}; - struct cld_timer { bool fired; bool on_list; @@ -44,6 +41,11 @@ struct cld_timer { char name[32]; }; +struct cld_timer_list { + GList *list; /* of struct cld_timer */ + time_t runmark; +}; + extern void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer, time_t expires); extern void cld_timer_del(struct cld_timer_list *tlist, struct cld_timer *timer); diff --git a/lib/libtimer.c b/lib/libtimer.c index c6f5241..75514f0 100644 --- a/lib/libtimer.c +++ b/lib/libtimer.c @@ -44,6 +44,17 @@ void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer, if (timer->on_list) timer_list = g_list_remove(timer_list, timer); + /* + * This additional resiliency is required by the invocations from + * session_retry(). For some reason the computations in it result + * in attempts to add timers in the past sometimes, and then we loop + * when trying to run those. FIXME: maybe fix that one day. + * + * Even if we fix the callers, we probably should keep this. + */ + if (expires < tlist->runmark + 1) + expires = tlist->runmark + 1; + timer->on_list = true; timer->fired = false; timer->expires = expires; @@ -66,38 +77,26 @@ time_t cld_timers_run(struct cld_timer_list *tlist) struct cld_timer *timer; time_t now = time(NULL); time_t next_timeout = 0; - GList *tmp, *cur; - GList *timer_list = tlist->list; - GList *exec_list = NULL; - - tmp = timer_list; - while (tmp) { - timer = tmp->data; - cur = tmp; - tmp = tmp->next; + GList *cur; + tlist->runmark = now; + for (;;) { + cur = tlist->list; + if (!cur) + break; + timer = cur->data; if (timer->expires > now) break; - timer_list = g_list_remove_link(timer_list, cur); - exec_list = g_list_concat(exec_list, cur); - + tlist->list = g_list_delete_link(tlist->list, cur); timer->on_list = false; - } - tlist->list = timer_list; - - tmp = exec_list; - while (tmp) { - timer = tmp->data; - tmp = tmp->next; timer->fired = true; timer->cb(timer); } if (tlist->list) { - timer_list = tlist->list; - timer = timer_list->data; + timer = tlist->list->data; if (timer->expires > now) next_timeout = (timer->expires - now); else -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html