On Tue, Feb 11, 2020 at 09:38:39AM +0100, Martin Wilck wrote: > On Mon, 2020-02-10 at 17:15 -0600, Benjamin Marzinski wrote: > > On Mon, Feb 10, 2020 at 06:08:14PM +0100, Martin Wilck wrote: > > > On Wed, 2020-02-05 at 12:58 -0600, Benjamin Marzinski wrote: > > > > On systems with a large number of cores (>500), io_destroy() can > > > > take > > > > tens to hundreds of milliseconds to complete, due to RCU > > > > synchronization. If there are a large number of paths using the > > > > directio > > > > checker on such a system, this can lead to multipath taking > > > > almost a > > > > minute to complete, with the vast majority of time taken up by > > > > io_destroy(). > > > > > > > > To solve this, the directio checker now allocates one aio context > > > > for > > > > every 1024 checkers. This reduces the io_destroy() delay to a > > > > thousandth > > > > of its previous level. However, this means that muliple checkers > > > > are > > > > sharing the same aio context, and must be able to handle getting > > > > results > > > > for other checkers. Because only one checker is ever running at > > > > a > > > > time, this doesn't require any locking. However, locking could > > > > be added > > > > in the future if necessary, to allow multiple checkers to run at > > > > the > > > > same time. > > > > > > > > When checkers are freed, they usually no longer destroy the io > > > > context. > > > > Instead, they attempt to cancel any outstanding request. If that > > > > fails, > > > > they put the request on an orphan list, so that it can be freed > > > > by other > > > > checkers, once it has completed. IO contexts are only destroyed > > > > at three > > > > times, during reconfigure (to deal with the possibility that > > > > multipathd > > > > is holding more aio events than it needs to be, since there is a > > > > single > > > > limit for the whole system), when the checker class is unloaded, > > > > and > > > > in a corner case when checkers are freed. If an aio_group (which > > > > contains the aio context) is entirely full of orphaned requests, > > > > then > > > > no checker can use it. Since no checker is using it, there is no > > > > checker > > > > to clear out the orphaned requests. In this (likely rare) case, > > > > the > > > > last checker from that group to be freed and leave behind an > > > > orphaned > > > > request will call io_destroy() and remove the group. > > > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > > --- > > > > libmultipath/checkers/directio.c | 336 > > > > +++++++++++++++++++++++++------ > > > > multipath/multipath.conf.5 | 7 +- > > > > 2 files changed, 281 insertions(+), 62 deletions(-) > > > > > > Although I concur now that your design is sound, I still have some > > > issues, see below. > > > ... > > > } > > > > ct->running++; > > > > > > This looks to me as if in the case (ct->running && ct->req->state > > > == > > > PATH_PENDING), ct->running could become > 1, even though you don't > > > start a new IO. Is that intended? I don't think it matters because > > > you > > > never decrement, but it looks weird. > > > > That's necessary for how the checker currently works. In async mode > > checker doesn't actually wait for a specific number of seconds (and > > didn't before my patch). It assumes 1 sec call times for pending > > paths, > > and times out after ct->running > timeout_secs. That's why the unit > > tests can get away with simply calling the checker repeatedly, > > without > > waiting, to make it timeout. But I suppose that > > wait_for_pending_paths() > > will also be making the checker time out quicker, so this should > > probably be changed. However, since check_path won't set a paths > > state > > to PATH_PENDING if it wasn't already, it's not really a very likely > > issue to occur. > > Bah. I should have realized that, of course. Yet measuring the timeout > in *seconds*, and polling for it the way we currently do, is really 80s > design... I guess I was confused by the use of ns timeout calculation > for the get_events() call (suggesting high-res timing), and the use of > "ct->running" as both indicator of running IO and "abstract run time". > > I hope you admit that the logic in check_path() is convoluted and hard > to understand :-/ . For example here: > > > while(1) { > > r = get_events(ct->aio_grp, timep); > > > > if (ct->req->state != PATH_PENDING) { > > ct->running = 0; > > return ct->req->state; > > } else if (r == 0 || !timep) > > break; > > > > get_monotonic_time(&currtime); > > timespecsub(&endtime, &currtime, &timeout); > > if (timeout.tv_sec < 0) > > timep = NULL; > > We're past the timeout already, having seen some events, just not for > the path we're currently looking at. Why do we go through another > iteration? Well, in this case we are past the timeout now, but weren't when io_getevents() completed, so the code loops one more time to check (without waiting) if the current path's request has completed. Since we call io_getevents() after setting the timeout to be all the remaining time, this should only ever happen if io_getevents() exitted early. Granted, it likely didn't exit very early (unless the thread was preempted afterwards), since it is now past the timeout. -Ben > > > } > > if (ct->running > timeout_secs || sync) { > > > > > See comment for get_events() above. Why don't you simply do this? > > > > > > timeout.tv_sec = timeout.tv_nsec = 0; > > > > > > > Sure. > > > > So, If I will post a seperate patch that resolves these issues, can > > this > > one have an ack? > > Yes. I agree we should move forward. > > Regards > Martin > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel