On Thu, Nov 19, 2020 at 09:40:29AM +0800, Huang, Ying wrote: > Uladzislau Rezki <urezki@xxxxxxxxx> writes: > > > On Wed, Nov 18, 2020 at 10:44:13AM +0800, huang ying wrote: > >> On Tue, Nov 17, 2020 at 9:04 PM Uladzislau Rezki <urezki@xxxxxxxxx> wrote: > >> > > >> > On Tue, Nov 17, 2020 at 10:37:34AM +0800, huang ying wrote: > >> > > On Tue, Nov 17, 2020 at 6:00 AM Uladzislau Rezki (Sony) > >> > > <urezki@xxxxxxxxx> wrote: > >> > > > > >> > > > A current "lazy drain" model suffers from at least two issues. > >> > > > > >> > > > First one is related to the unsorted list of vmap areas, thus > >> > > > in order to identify the [min:max] range of areas to be drained, > >> > > > it requires a full list scan. What is a time consuming if the > >> > > > list is too long. > >> > > > > >> > > > Second one and as a next step is about merging all fragments > >> > > > with a free space. What is also a time consuming because it > >> > > > has to iterate over entire list which holds outstanding lazy > >> > > > areas. > >> > > > > >> > > > See below the "preemptirqsoff" tracer that illustrates a high > >> > > > latency. It is ~24 676us. Our workloads like audio and video > >> > > > are effected by such long latency: > >> > > > >> > > This seems like a real problem. But I found there's long latency > >> > > avoidance mechanism in the loop in __purge_vmap_area_lazy() as > >> > > follows, > >> > > > >> > > if (atomic_long_read(&vmap_lazy_nr) < resched_threshold) > >> > > cond_resched_lock(&free_vmap_area_lock); > >> > > > >> > I have added that "resched threshold" because of on my tests i could > >> > simply hit out of memory, due to the fact that a drain work is not up > >> > to speed to process such long outstanding list of vmap areas. > >> > >> OK. Now I think I understand the problem. For free area purging, > >> there are multiple "producers" but one "consumer", and it lacks enough > >> mechanism to slow down the "producers" if "consumer" can not catch up. > >> And your patch tries to resolve the problem via accelerating the > >> "consumer". > >> > > Seems, correct. But just in case one more time: > > > > the cond_resched_lock was added once upon a time to get rid of long > > preemption off time. Due to dropping the lock, "producers" can start > > generate further vmap area, so "consumer" can not catch up. Seems > > Yes. And in theory there are vfree() storm, that is, thousands vfree() > can be called in short time. But I don't think that's practical use > case. > > > Later on, a resched threshold was added. It is just a simple protection > > threshold, passing which, a freeing is prioritized back over allocation, > > so we guarantee that we do not hit out of memory. > > Yes. That can accelerate freeing if necessary. > > >> > >> That isn't perfect, but I think we may have quite some opportunities > >> to merge the free areas, so it should just work. > >> > > Yes, merging opportunity should do the work. But of course there are > > exceptions. > > > >> And I found the long latency avoidance logic in > >> __purge_vmap_area_lazy() appears problematic, > >> > >> if (atomic_long_read(&vmap_lazy_nr) < resched_threshold) > >> cond_resched_lock(&free_vmap_area_lock); > >> > >> Shouldn't it be something as follows? > >> > >> if (i >= BATCH && atomic_long_read(&vmap_lazy_nr) < > >> resched_threshold) { > >> cond_resched_lock(&free_vmap_area_lock); > >> i = 0; > >> } else > >> i++; > >> > >> This will accelerate the purging via batching and slow down vmalloc() > >> via holding free_vmap_area_lock. If it makes sense, can we try this? > >> > > Probably we can switch to just using "batch" methodology: > > > > <snip> > > if (!(i++ % batch_threshold)) > > cond_resched_lock(&free_vmap_area_lock); > > <snip> > > That's the typical long latency avoidance method. > > > The question is, which value we should use as a batch_threshold: 100, 1000, etc. > > I think we can do some measurement to determine it? > Hmm.. looking at it one more time i do not see what batching solves. Anyway we need to have some threshold(what we do have), that regulates a priority between vmalloc()/vfree(). What we can do more with it are: - purging should be just performed asynchronously in workqueue context. Giving the fact, that now we also do a merge of outstanding areas, the data structure(rb-tree) will not be so fragmented. - lazy_max_pages() can slightly be decreased. If there are existing workloads which suffer from such long value. It would be good to get real complains and evidence. > > Apart of it and in regard to CONFIG_KASAN_VMALLOC, it seems that we are not > > allowed to drop the free_vmap_area_lock at all. Because any simultaneous > > allocations are not allowed within a drain region, so it should occur in > > disjoint regions. But i need to double check it. > > > >> > >> And, can we reduce lazy_max_pages() to control the length of the > >> purging list? It could be > 8K if the vmalloc/vfree size is small. > >> > > We can adjust it for sure. But it will influence on number of global > > TLB flushes that must be performed. > > Em... For example, if we set it to 100, then the number of the TLB > flushes can be reduced to 1% of the un-optimized implementation > already. Do you think so? > If we set lazy_max_pages() to vague value such as 100, the performance will be just destroyed. Thanks! -- Vlad Rezki