On Wed, 2022-03-30 at 09:34 -0500, Benjamin Marzinski wrote: > On Wed, Mar 30, 2022 at 09:44:39AM +0000, Martin Wilck wrote: > > > > > Why did you choose 1? Perhaps we should make a few more attempts? > > Oops. I forgot to include the Notes when formatting my patch (I need > to > make that the default). Here they are: > > Notes: > > I'd be interested in knowing what people think of this solution. > I'm > open to making multipathd start more threads before it gives up. > We > also could make multipathd save the contexts from the stalled > threads, so that when it stops creating new ones, instead of just > waiting for the last thread to complete, it could start up again I wouldn't recommend that. We drop all references to the old context for a good reason: to be sure there are no races when the hanging thread eventually does exit. Keeping such references would re- open a Pandora's box which we sealed and closed in 2018. > as > soon as any of the outstanding threads completed. We could also > consider not stopping creating new threads entirely, but instead > having a delay before we create a new checker thread, where we > wait > for the last thread to complete. I wouldn't do this, either. The hang check is not done immediately after cancelling the thread, but in the following libcheck_check() invocation. That means there has been some delay already when we do the check, at least a second. And this cancellation happened after the SCSI timeout expired, anyway. How long are we going to wait for the normally instanteneous cancellation to complete? I like the "just forget about this thread" attitude which has saved us a lot of trouble lately, IMO. > If the delay doubled after evey > consecutive timeout, the number of threads created would stay at > a > more reasonable level until someone got around to looking into > what > was going wrong. Thoughts? > > But to answer your question, there was no obvious number to choose, > and > you can make the case that if it fails once, that's a fluke. If it > fails > twice in a row, then it's likely to keep faiing. That makes sense. I'm fine with the patch. > But I'm fine with > picking a bigger number, or any of the other possibilities I listed. > I > just wanted to throw something out as a starting point. > > -Ben > > > > Other than that, this looks ok to me (assuming you've tested with > > the > > code from bdf55d6, or something similar). > > Yep. I tested it with the zombie tur checker tests, with different > sleep seconds and intervals. Reviewed-by: Martin Wilck <mwilck@xxxxxxxx> -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel