On Tue, 2022-03-29 at 22:22 -0500, Benjamin Marzinski wrote: > If tur thead hangs, multipathd was simply creating a new thread, and > assuming that the old thread would get cleaned up eventually. I have > seen a case recently where there were 26000 multipathd threads on a > system, all stuck trying to send TUR commands to path devices. The > root > cause of the issue was a scsi kernel issue, but it shows that the way > multipathd currently deals with stuck threads could use some > refinement. Alright, let's start the discussion about dead TUR threads again ;-) For my own records, here are links to some previous ML threads: https://listman.redhat.com/archives/dm-devel/2018-September/036102.html https://listman.redhat.com/archives/dm-devel/2018-October/036156.html https://listman.redhat.com/archives/dm-devel/2018-October/036159.html https://listman.redhat.com/archives/dm-devel/2018-October/036240.html https://listman.redhat.com/archives/dm-devel/2018-October/036362.html Have you assessed why the SCSI issues were causing indefinite hangs? Was it just the TUR command hanging forever in the kernel, or some other issue? I have seen similar issues in the past; they were also related to SCSI problems, IIRC. I used to think that this is just a symptom: The "dead" threads are detached and consume a minimal amount of non-shared memory (they are a pain when analyzing a crash dump, though). But I agree, just continuing to start new threads forever isn't optimal. > > Now, when one tur thread hangs, multipathd will act as it did before. > If a second one in a row hangs, multipathd will instead wait for it > to > complete before starting another thread. Once the thread completes, > the > count is reset. > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > --- > libmultipath/checkers/tur.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/libmultipath/checkers/tur.c > b/libmultipath/checkers/tur.c > index c93e4625..1bcb7576 100644 > --- a/libmultipath/checkers/tur.c > +++ b/libmultipath/checkers/tur.c > @@ -26,6 +26,7 @@ > > #define TUR_CMD_LEN 6 > #define HEAVY_CHECK_COUNT 10 > +#define MAX_NR_TIMEOUTS 1 Why did you choose 1? Perhaps we should make a few more attempts? Other than that, this looks ok to me (assuming you've tested with the code from bdf55d6, or something similar). Martin > > enum { > MSG_TUR_RUNNING = CHECKER_FIRST_MSGID, > @@ -54,6 +55,7 @@ struct tur_checker_context { > int holders; /* uatomic access only */ > int msgid; > struct checker_context ctx; > + unsigned int nr_timeouts; > }; > > int libcheck_init (struct checker * c) > @@ -358,8 +360,23 @@ int libcheck_check(struct checker * c) > } > } else { > if (uatomic_read(&ct->holders) > 1) { > + /* The thread has been cancelled but hasn't > quit. */ > + if (ct->nr_timeouts == MAX_NR_TIMEOUTS) { > + condlog(2, "%d:%d : waiting for > stalled tur thread to finish", > + major(ct->devt), minor(ct- > >devt)); > + ct->nr_timeouts++; > + } > /* > - * The thread has been cancelled but hasn't > quit. > + * Don't start new threads until the last > once has > + * finished. > + */ > + if (ct->nr_timeouts > MAX_NR_TIMEOUTS) { > + c->msgid = MSG_TUR_TIMEOUT; > + return PATH_TIMEOUT; > + } > + ct->nr_timeouts++; > + /* > + * Start a new thread while the old one is > stalled. > * We have to prevent it from interfering > with the new > * thread. We create a new context and leave > the old > * one with the stale thread, hoping it will > clean up > @@ -375,13 +392,15 @@ int libcheck_check(struct checker * c) > */ > if (libcheck_init(c) != 0) > return PATH_UNCHECKED; > + ((struct tur_checker_context *)c->context)- > >nr_timeouts = ct->nr_timeouts; > > if (!uatomic_sub_return(&ct->holders, 1)) > /* It did terminate, eventually */ > cleanup_context(ct); > > ct = c->context; > - } > + } else > + ct->nr_timeouts = 0; > /* Start new TUR checker */ > pthread_mutex_lock(&ct->lock); > tur_status = ct->state = PATH_PENDING; -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel