On Sat 23-07-16 10:12:24, NeilBrown wrote: > On Fri, Jul 22 2016, Michal Hocko wrote: [...] > > If we just back off and rely on kswapd which > > might get stuck on the writeout then the IO throughput can be reduced > > If I were king of MM, I would make a decree to be proclaimed throughout > the land > kswapd must never sleep except when it explicitly chooses to > > Maybe that is impractical, but having firm rules like that would go a > long way to make it possible to actually understand and reason about how > MM works. As it is, there seems to be a tendency to put bandaids over > bandaids. Ohh, I would definitely wish for this to be more clear but as it turned out over time there are quite some interdependencies between MM/FS/IO layers which make the picture really blur. If there is a brave soul to make that more clear without breaking any of that it would be really cool ;) > > I believe which would make the whole memory pressure just worse. So I am > > not sure this is a good idea in general. I completely agree with you > > that the mempool request shouldn't be throttled unless there is a strong > > reason for that. More on that below. > > > >> If I'm following the code properly, the stack trace below can only > >> happen if the first pool->alloc() attempt, with direct-reclaim disabled, > >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait() > >> and io_schedule_timeout(). > > > > mempool_alloc retries immediatelly without any sleep after the first > > no-reclaim attempt. > > I missed that ... I see it now... I wonder if anyone has contemplated > using some modern programming techniques like, maybe, a "while" loop in > there.. > Something like the below... Heh, why not, the code could definitely see some more love. Care to send a proper patch so that we are not mixing two different things here. > >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it > >> gets woken up when there is something in the pool. It then loops around > >> and tries pool->alloc() again, even though there is something in the > >> pool. This might be justified if that ->alloc would never block, but > >> obviously it does. > >> > >> I would very strongly recommend just changing mempool_alloc() to > >> permanently mask out __GFP_DIRECT_RECLAIM. > >> > >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate. > >> It is "LESS" throttle, not "NO" throttle, but you have made > >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads. > > > > Yes that is correct. But it still allows to throttle on congestion: > > shrink_inactive_list: > > /* > > * Stall direct reclaim for IO completions if underlying BDIs or zone > > * is congested. Allow kswapd to continue until it starts encountering > > * unqueued dirty pages or cycling through the LRU too quickly. > > */ > > if (!sc->hibernation_mode && !current_is_kswapd() && > > current_may_throttle()) > > wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); > > > > My thinking was that throttle_vm_writeout is there to prevent from > > dirtying too many pages from the reclaim the context. PF_LESS_THROTTLE > > is part of the writeout so throttling it on too many dirty pages is > > questionable (well we get some bias but that is not really reliable). It > > still makes sense to throttle when the backing device is congested > > because the writeout path wouldn't make much progress anyway and we also > > do not want to cycle through LRU lists too quickly in that case. > > "dirtying ... from the reclaim context" ??? What does that mean? Say you would cause a swapout from the reclaim context. You would effectively dirty that anon page until it gets written down to the storage. > According to > Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling") > From the history tree, the purpose of throttle_vm_writeout() is to > limit the amount of memory that is concurrently under I/O. > That seems strange to me because I thought it was the responsibility of > each backing device to impose a limit - a maximum queue size of some > sort. We do throttle on the congestion during the reclaim so in some sense this is already implemented but I am not really sure that is sufficient. Maybe this is something to re-evaluate because wait_iff_congested came in much later after throttle_vm_writeout. Let me think about it some more. > I remember when NFS didn't impose a limit and you could end up with lots > of memory in NFS write-back, and very long latencies could result. > > So I wonder what throttle_vm_writeout() really achieves these days. Is > it just a bandaid that no-one is brave enough to remove? Maybe yes. It is sitting there quietly and you do not know about it until it bites. Like in this particular case. > I guess it could play a role in balancing the freeing of clean pages, > which can be done instantly, against dirty pages, which require > writeback. Without some throttling, might all clean pages being cleaned > too quickly, just trashing our read caches? I do not see how that would happen. kswapd has its reclaim targets depending on watermarks and direct reclaim has SWAP_CLUSTER_MAX. So none of them should go too wild and reclaim way too many clean pages. > > Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded > > dirtying of memory? > > In most cases, nfsd it just like any other application and needs to be > throttled like any other application when it writes too much data. > The only time nfsd *needs* PF_LESS_THROTTLE when when a loop-back mount > is active. When the same page cache is the source and destination of > writes. > So nfsd needs to be able to dirty a few more pages when nothing else > can due to high dirty count. Otherwise it deadlocks. > The main use of PF_LESS_THROTTLE is in zone_dirty_limit() and > domain_dirty_limits() where an extra 25% is allowed to overcome this > deadlock. > > The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to > avoid a live-lock. A key premise is that nfsd only allocates unbounded > memory when it is writing to the page cache. So it only needs to be > throttled when the backing device it is writing to is congested. It is > particularly important that it *doesn't* get throttled just because an > NFS backing device is congested, because nfsd might be trying to clear > that congestion. Thanks for the clarification. IIUC then removing throttle_vm_writeout for the nfsd writeout should be harmless as well, right? > In general, callers of try_to_free_pages() might get throttled when any > backing device is congested. This is a reasonable default when we don't > know what they are allocating memory for. When we do know the purpose of > the allocation, we can be more cautious about throttling. > > If a thread is allocating just to dirty pages for a given backing > device, we only need to throttle the allocation if the backing device is > congested. Any further throttling needed happens in > balance_dirty_pages(). > > If a thread is only making transient allocations, ones which will be > freed shortly afterwards (not, for example, put in a cache), then I > don't think it needs to be throttled at all. I think this universally > applies to mempools. > In the case of dm_crypt, if it is writing too fast it will eventually be > throttled in generic_make_request when the underlying device has a full > queue and so blocks waiting for requests to be completed, and thus parts > of them returned to the mempool. Makes sense to me. > >> The purpose of that flag is to allow a thread to dirty a page-cache page > >> as part of cleaning another page-cache page. > >> So it makes sense for loop and sometimes for nfsd. It would make sense > >> for dm-crypt if it was putting the encrypted version in the page cache. > >> But if dm-crypt is just allocating a transient page (which I think it > >> is), then a mempool should be sufficient (and we should make sure it is > >> sufficient) and access to an extra 10% (or whatever) of the page cache > >> isn't justified. > > > > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not > > appropriate then would a PF_MEMPOOL be any better? > > Why a PF rather than a GFP flag? Well, short answer is that gfp masks are almost depleted. > NFSD uses a PF because there is no GFP interface for filesystem write. > But mempool can pass down a GFP flag, so I think it should. > The meaning of the flag is, in my opinion, that a 'transient' allocation > is being requested. i.e. an allocation which will be used for a single > purpose for a short amount of time and will then be freed. In > particularly it will never be placed in a cache, and if it is ever > placed on a queue, that is certain to be a queue with an upper bound on > the size and with guaranteed forward progress in the face of memory > pressure. > Any allocation request for a use case with those properties should be > allowed to set GFP_TRANSIENT (for example) with the effect that the > allocation will not be throttled. > A key point with the name is to identify the purpose of the flag, not a > specific use case (mempool) which we want it for. Agreed. But let's first explore throttle_vm_writeout and its potential removal. Thanks! -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel