[Patch 01/12] CLD: fix crash in retransmissions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For a longest time I was plagued by (very infrequent) crashes like this:

Program received signal SIGSEGV, Segmentation fault.
sess_retry_output (timer=0x92070c0) at session.c:532
532                     if (!next_retry || (op->next_retry < next_retry))
(gdb) info threads
* 1 Thread 0xb72f96c0 (LWP 22417)  sess_retry_output (timer=0x92070c0) at session.c:532
(gdb) where
#0  sess_retry_output (timer=0x92070c0) at session.c:532
#1  session_retry (timer=0x92070c0) at session.c:565
#2  0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95
#3  0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983
#4  main (argc=5, argv=0xbff70bd4) at server.c:1138

The crash happens because op is NULL. As it turned out, this happens
if a packet retransmit and a session expiration occur simultaneously
(in the same pass of timers_run). The scenario is:
 - timers_run collects expired timers at exec list
 - timers_run expires session
 - two timer_del are called, but one of them is on exec list already,
   so it's ineffective
 - session is freed, this zeroes ->data in lists (later op)
 - timers_run continues along the exec list, invokes the retransmission
   callback, and that crashes with NULL op.

The proposed solution is to rework the timers_run, again. But this
time, we'll make it simpler by observing that timers are ordered by
expiration time. Therefore, we can pull next timer off the list,
expire it, and loop until expiration time is greater than the current
time. No execution list is kept. The integrity of the main list
is assured by never walking it and always referring to the head
anew at each iteration.

This patch appears to fix the problem and stands up to use that
crashed the old code.

Signed-off-by: Pete Zaitcev <zaitcev@xxxxxxxxxx>

---
 include/cld_common.h |   10 ++++++----
 lib/libtimer.c       |   41 ++++++++++++++++++++---------------------
 2 files changed, 26 insertions(+), 25 deletions(-)

I am absurdly proud of this. But the long history of timer problem
in CLD makes me cautious. I already thought that we fixed everything
in the past.

commit c1b4a0c4dda14d9fbc4d5896e9e0dc139c31798b
Author: Master <zaitcev@xxxxxxxxxxxxxxxxxx>
Date:   Sat Apr 17 19:02:35 2010 -0600

    Fix persistent CLD crashes in retransmits (op == NULL).

diff --git a/include/cld_common.h b/include/cld_common.h
index d269e6c..379bd9e 100644
--- a/include/cld_common.h
+++ b/include/cld_common.h
@@ -24,6 +24,7 @@
 #include <stdbool.h>
 #include <string.h>
 #include <time.h>
+#include <glib.h>
 #include <openssl/sha.h>
 #include <cld_msg_rpc.h>
 
@@ -31,10 +32,6 @@
 
 struct hail_log;
 
-struct cld_timer_list {
-	void *list;
-};
-
 struct cld_timer {
 	bool			fired;
 	bool			on_list;
@@ -44,6 +41,11 @@ struct cld_timer {
 	char			name[32];
 };
 
+struct cld_timer_list {
+	GList *list;		/* of struct cld_timer */
+	time_t			runmark;
+};
+
 extern void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer,
 			  time_t expires);
 extern void cld_timer_del(struct cld_timer_list *tlist, struct cld_timer *timer);
diff --git a/lib/libtimer.c b/lib/libtimer.c
index c6f5241..75514f0 100644
--- a/lib/libtimer.c
+++ b/lib/libtimer.c
@@ -44,6 +44,17 @@ void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer,
 	if (timer->on_list)
 		timer_list = g_list_remove(timer_list, timer);
 
+	/*
+	 * This additional resiliency is required by the invocations from
+	 * session_retry(). For some reason the computations in it result
+	 * in attempts to add timers in the past sometimes, and then we loop
+	 * when trying to run those. FIXME: maybe fix that one day.
+	 *
+	 * Even if we fix the callers, we probably should keep this.
+	 */
+	if (expires < tlist->runmark + 1)
+		expires = tlist->runmark + 1;
+
 	timer->on_list = true;
 	timer->fired = false;
 	timer->expires = expires;
@@ -66,38 +77,26 @@ time_t cld_timers_run(struct cld_timer_list *tlist)
 	struct cld_timer *timer;
 	time_t now = time(NULL);
 	time_t next_timeout = 0;
-	GList *tmp, *cur;
-	GList *timer_list = tlist->list;
-	GList *exec_list = NULL;
-
-	tmp = timer_list;
-	while (tmp) {
-		timer = tmp->data;
-		cur = tmp;
-		tmp = tmp->next;
+	GList *cur;
 
+	tlist->runmark = now;
+	for (;;) {
+		cur = tlist->list;
+		if (!cur)
+			break;
+		timer = cur->data;
 		if (timer->expires > now)
 			break;
 
-		timer_list = g_list_remove_link(timer_list, cur);
-		exec_list = g_list_concat(exec_list, cur);
-
+		tlist->list = g_list_delete_link(tlist->list, cur);
 		timer->on_list = false;
-	}
-	tlist->list = timer_list;
-
-	tmp = exec_list;
-	while (tmp) {
-		timer = tmp->data;
-		tmp = tmp->next;
 
 		timer->fired = true;
 		timer->cb(timer);
 	}
 
 	if (tlist->list) {
-		timer_list = tlist->list;
-		timer = timer_list->data;
+		timer = tlist->list->data;
 		if (timer->expires > now)
 			next_timeout = (timer->expires - now);
 		else
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Fedora Clound]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux