On Wed, Nov 04, 2020 at 11:56:07PM +0000, Martin Wilck wrote: > On Wed, 2020-11-04 at 17:27 -0600, Benjamin Marzinski wrote: > > On Wed, Nov 04, 2020 at 10:39:39PM +0000, Martin Wilck wrote: > > > On Mon, 2020-11-02 at 12:27 -0600, Benjamin Marzinski wrote: > > > > On Fri, Oct 30, 2020 at 09:15:39PM +0000, Martin Wilck wrote: > > > > > On Fri, 2020-10-23 at 16:15 -0500, Benjamin Marzinski wrote: > > > > > > The multipathd tur checker thread is designed to be able to > > > > > > finish at > > > > > > any time, even after the tur checker itself has been freed. > > > > > > The > > > > > > multipathd shutdown code makes sure all the checkers have > > > > > > been > > > > > > freed > > > > > > before freeing the checker_class and calling dlclose() to > > > > > > unload > > > > > > the > > > > > > DSO, but this doesn't guarantee that the checker threads have > > > > > > finished. > > > > > > If one hasn't, the DSO will get unloaded while the thread > > > > > > still > > > > > > running > > > > > > code from it, causing a segfault. Unfortunately, it's not > > > > > > possible to > > > > > > be > > > > > > sure that all tur checker threads have ended during shutdown, > > > > > > without > > > > > > making them joinable. > > > > > > > > > > > > However, since libmultipath will never be reinitialized after > > > > > > it > > > > > > has > > > > > > been uninitialzed, not dlclosing the tur checker DSO once a > > > > > > thread is > > > > > > started has minimal cost (keeping the DSO code around until > > > > > > the > > > > > > program > > > > > > exits, which usually happens right after freeing the > > > > > > checkers). > > > > > > > > > > I'm not against this, but have you considered using an > > > > > atomic refcount > > > > > for the DSO? With every tur thread starting, we could increase > > > > > it, > > > > > and > > > > > decrease it in the cleanup function of the thread when it > > > > > exits. > > > > > That > > > > > should be safe. If the refcount was positive when we exit, we > > > > > could > > > > > refrain from unloading the DSO. > > > > > > > > I tried exactly that, and I would still get crashes, even if it > > > > put > > > > the > > > > code that decrements the atomic variable in a function that's not > > > > part > > > > of the DSO, and put a pthread_exit() before the final > > > > pthread_cleanup_pop() that would decrement it in tur_thread, so > > > > that > > > > after the cleanup code is called the thread should never return > > > > to > > > > code > > > > that is in the DSO. I had to add sleeps in code to force various > > > > orderings, but I couldn't find any way that worked for all > > > > possible > > > > orderings. I would love it if this worked, and you're free to > > > > try, > > > > but > > > > I could not get this method to work. > > > > > > I've experimented a bit with a trivial test program, and I found > > > that > > > it worked stably if decrementing the refcount is really the last > > > thing > > > thread's cleanup function does. Could you provide some details > > > about > > > the sleeps that you'd put in that made this approach fail? > > > > I believe the situation that continued to crash was where the > > tur_thread() exitted naturally (so it set running to 0), although I'm > > not sure that this is necessary, or if it would still crash when > > running > > the cleanup function because of a cancel. I put the cleanup function > > to > > decrement the count in libmultipath, so that it wasn't part of the > > DSO, > > and then I put a sleep(5) as the last line of the cleanup function, > > and > > a sleep(10) as the last line of cleanup_checkers(). I also had to set > > running to 0 before starting the thread, and then take out the code > > to > > pause the thread if running was aleady 0, to make sure the thread > > acted > > like it was the one to set running to 0. Then the idea is to stop > > multipathd while there is a thread in its sleep period, so that > > multipathd sees that the counter is 0, and closes the dso, and then > > the thread finishes before multipathd shuts the rest of the way > > down. > > I guess the key is that the thread's entry point must also be in > libmultipath (i.e. outside the DSO). In pseudo-code: > > entrypoint() { > refcount++; > pthread_cleanup_push(refcount--); > tur_thread(ct); > pthread_cleanup_pop(1); > } > > This way the thread can't be in DSO code any more when refcount goes to > zero. Oh! I didn't think of solving it that way, but it makes sense. So, were you planning on posting a patch? -Ben > > Martin > > -- > Dr. Martin Wilck <mwilck@xxxxxxxx>, Tel. +49 (0)911 74053 2107 > SUSE Software Solutions Germany GmbH > HRB 36809, AG Nürnberg GF: Felix > Imendörffer > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel