>> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: directio >> checker refcount 6 >> Mar 02 11:40:35 localhost.localdomain multipathd[85474]: lxk free tur >> checker //checker_put > > > So we do not see "unloading tur checker". Like you said, that suggests > that the crash occurs between libcheck_free() and the thread exiting. "lxk free tur checker" is add in free_checker called by checker_put. I don't change the level of "unloading tur checker", so we don't see it. @@ -58,7 +58,7 @@ void free_checker (struct checker * c) return; c->refcount--; if (c->refcount) { - condlog(3, "%s checker refcount %d", + condlog(2, "%s checker refcount %d", c->name, c->refcount); return; } @@ -77,6 +77,7 @@ void free_checker (struct checker * c) pthread_join(ct->thread, NULL); }; } + condlog(2, "lxk free %s checker", c->name); FREE(c); } > I suggest you put a message in tur.c:libcheck_free (), AFTER the call > to cleanup_context(), printing the values of "running" and "holders" > Anyway: > > holders = uatomic_sub_return(&ct->holders, 1); > if (!holders) > cleanup_context(ct); > > Whatever mistakes we have made, only one actor can have seen > holders == 0, and have called cleanup_context(). > diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c index 4ea63af..900f960 100644 --- a/libmultipath/checkers/tur.c +++ b/libmultipath/checkers/tur.c @@ -105,8 +105,11 @@ void libcheck_free (struct checker * c) pthread_cancel(ct->thread); ct->thread = 0; holders = uatomic_sub_return(&ct->holders, 1); - if (!holders) + if (!holders) { + running = uatomic_xchg(&ct->running, 0); cleanup_context(ct); + condlog(2, "lxk tur running is %d", running); + } c->context = NULL; } return; Here I add running print but it is zero. > The stacks you have shown indicate that the instruction pointers were > broken. That would suggest something similar as dicussed in the ML > thread leading to 38ffd89 ("libmultipath: prevent DSO unloading with > astray checker threads"). Your logs show "tur checker refcount 1", so > the next call to checker_put would have unloaded the DSO. Here I test 0.8.5 master code with commit 38ffd89. There is no crash in five hours (without patch, crash happen in running test script for 30 to 40 minutes.) Regards, Lixiaokeng -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel