On Wed 27-07-16 13:43:35, NeilBrown wrote: > On Mon, Jul 25 2016, Michal Hocko wrote: > > > On Sat 23-07-16 10:12:24, NeilBrown wrote: [...] > >> > My thinking was that throttle_vm_writeout is there to prevent from > >> > dirtying too many pages from the reclaim the context. PF_LESS_THROTTLE > >> > is part of the writeout so throttling it on too many dirty pages is > >> > questionable (well we get some bias but that is not really reliable). It > >> > still makes sense to throttle when the backing device is congested > >> > because the writeout path wouldn't make much progress anyway and we also > >> > do not want to cycle through LRU lists too quickly in that case. > >> > >> "dirtying ... from the reclaim context" ??? What does that mean? > > > > Say you would cause a swapout from the reclaim context. You would > > effectively dirty that anon page until it gets written down to the > > storage. > > I should probably figure out how swap really works. I have vague ideas > which are probably missing important details... > Isn't the first step that the page gets moved into the swap-cache - and > marked dirty I guess. Then it gets written out and the page is marked > 'clean'. > Then further memory pressure might push it out of the cache, or an early > re-use would pull it back from the cache. > If so, then "dirtying in reclaim context" could also be described as > "moving into the swap cache" - yes? Yes that is basically correct > So should there be a limit on dirty > pages in the swap cache just like there is for dirty pages in any > filesystem (the max_dirty_ratio thing) ?? > Maybe there is? There is no limit AFAIK. We are relying that the reclaim is throttled when necessary. > >> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to > >> avoid a live-lock. A key premise is that nfsd only allocates unbounded > >> memory when it is writing to the page cache. So it only needs to be > >> throttled when the backing device it is writing to is congested. It is > >> particularly important that it *doesn't* get throttled just because an > >> NFS backing device is congested, because nfsd might be trying to clear > >> that congestion. > > > > Thanks for the clarification. IIUC then removing throttle_vm_writeout > > for the nfsd writeout should be harmless as well, right? > > Certainly shouldn't hurt from the perspective of nfsd. > > >> >> The purpose of that flag is to allow a thread to dirty a page-cache page > >> >> as part of cleaning another page-cache page. > >> >> So it makes sense for loop and sometimes for nfsd. It would make sense > >> >> for dm-crypt if it was putting the encrypted version in the page cache. > >> >> But if dm-crypt is just allocating a transient page (which I think it > >> >> is), then a mempool should be sufficient (and we should make sure it is > >> >> sufficient) and access to an extra 10% (or whatever) of the page cache > >> >> isn't justified. > >> > > >> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not > >> > appropriate then would a PF_MEMPOOL be any better? > >> > >> Why a PF rather than a GFP flag? > > > > Well, short answer is that gfp masks are almost depleted. > > Really? We have 26. > > pagemap has a cute hack to store both GFP flags and other flag bits in > the one 32 it number per address_space. 'struct address_space' could > afford an extra 32 number I think. > > radix_tree_root adds 3 'tag' flags to the gfp_mask. > There is 16bits of free space in radix_tree_node (between 'offset' and > 'count'). That space on the root node could store a record of which tags > are set anywhere. Or would that extra memory de-ref be a killer? Yes these are reasons why adding new gfp flags is more complicated. > I think we'd end up with cleaner code if we removed the cute-hacks. And > we'd be able to use 6 more GFP flags!! (though I do wonder if we really > need all those 26). Well, maybe we are able to remove those hacks, I wouldn't definitely be opposed. But right now I am not even convinced that the mempool specific gfp flags is the right way to go. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel