On Mon, 2025-01-13 at 16:27 +0100, David Hildenbrand wrote: > On 10.01.25 23:00, Shakeel Butt wrote: > > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote: > > > On 10.01.25 21:28, Jeff Layton wrote: > > > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > > > > > On 07.01.25 19:07, Shakeel Butt wrote: > > > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > > > problem is either that > > > > > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > > > compaction, right? > > > > > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > > > > > by an trusted source. > > > > > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > > > > > The "must not be unmovable pages ever" is a very strong statement and we > > > > > > are violating it today and will keep violating it in future. Any > > > > > > page/folio under lock or writeback or have reference taken or have been > > > > > > isolated from their LRU is unmovable (most of the time for small period > > > > > > of time). > > > > > > > > > > ^ this: "small period of time" is what I meant. > > > > > > > > > > Most of these things are known to not be problematic: retrying a couple > > > > > of times makes it work, that's why migration keeps retrying. > > > > > > > > > > Again, as an example, we allow short-term O_DIRECT but disallow > > > > > long-term page pinning. I think there were concerns at some point if > > > > > O_DIRECT might also be problematic (I/O might take a while), but so far > > > > > it was not a problem in practice that would make CMA allocations easily > > > > > fail. > > > > > > > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but > > > > > actually triggers long-term pinning; IIRC David Howells has this on his > > > > > todo list to fix. [I recall that seccomp disallows vmsplice by default > > > > > right now] > > > > > > > > > > These operations are being done all over the place in kernel. > > > > > > Miklos gave an example of readahead. > > > > > > > > > > I assume you mean "unmovable for a short time", correct, or can you > > > > > point me at that specific example; I think I missed that. > > > > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@xxxxxxxxxxxxxx/ > > > > > > > > > > > > > The per-CPU LRU caches are another > > > > > > case where folios can get stuck for long period of time. > > > > > > > > > > Which is why memory offlining disables the lru cache. See > > > > > lru_cache_disable(). Other users that care about that drain the LRU on > > > > > all cpus. > > > > > > > > > > > Reclaim and > > > > > > compaction can isolate a lot of folios that they need to have > > > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > > > > > impractical. > > > > > > > > > > "must only be short-term unmovable", better? > > > > Yes and you have clarified further below of the actual amount. > > > > > > > > > > > > > > > Still a little ambiguous. > > > > > > > > How short is "short-term"? Are we talking milliseconds or minutes? > > > > > > Usually a couple of seconds, max. For memory offlining, slightly longer > > > times are acceptable; other things (in particular compaction or CMA > > > allocations) will give up much faster. > > > > > > > > > > > Imposing a hard timeout on writeback requests to unprivileged FUSE > > > > servers might give us a better guarantee of forward-progress, but it > > > > would probably have to be on the order of at least a minute or so to be > > > > workable. > > > > > > Yes, and that might already be a bit too much, especially if stuck on > > > waiting for folio writeback ... so ideally we could find a way to migrate > > > these folios that are under writeback and it's not your ordinary disk driver > > > that responds rather quickly. > > > > > > Right now we do it via these temp pages, and I can see how that's > > > undesirable. > > > > > > For NFS etc. we probably never ran into this, because it's all used in > > > fairly well managed environments and, well, I assume NFS easily outdates CMA > > > and ZONE_MOVABLE :) > > > > > > > > > > > > > > > > The point is that, yes we should aim to improve things but in iterations > > > > > > and "must not be unmovable pages ever" is not something we can achieve > > > > > > in one step. > > > > > > > > > > I agree with the "improve things in iterations", but as > > > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > > > > > are making things worse. > > > > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still > > causing confusion. It is a simple flag to avoid deadlock in the reclaim > > code path and does not say anything about movability. > > > > > > > > > > > > And as this discussion has been going on for too long, to summarize my > > > > > point: there exist conditions where pages are short-term unmovable, and > > > > > possibly some to be fixed that turn pages long-term unmovable (e.g., > > > > > vmsplice); that does not mean that we can freely add new conditions that > > > > > turn movable pages unmovable long-term or even forever. > > > > > > > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I > > > > > would suggest a topic around which things are know to cause pages to be > > > > > short-term or long-term unmovable/unsplittable, and which can be > > > > > handled, which not. Maybe I'll find the time to propose that as a topic. > > > > > > > > > > > > > > > > > This does sound like great LSF/MM fodder! I predict that this session > > > > will run long! ;) > > > > > > Heh, fully agreed! :) > > > > I would like more targeted topic and for that I want us to at least > > agree where we are disagring. Let me write down two statements and > > please tell me where you disagree: > > I think we're mostly in agreement! > > > > > 1. For a normal running FUSE server (without tmp pages), the lifetime of > > writeback state of fuse folios falls under "short-term unmovable" bucket > > as it does not differ in anyway from anyother filesystems handling > > writeback folios. > > That's the expectation, yes. As long as the FUSE server is able to make > progress, the expectation is that it's just like NFS etc. If it isn't > able to make progress (i.e., crash), the expectation is that everything > will get cleaned up either way. > > I wonder if there could be valid scenario where the FUSE server is no > longer able to make progress (ignoring network outages), or the progress > might start being extremely slow such that it becomes a problem. In > contrast to in-kernel FSs, one can do some fancy stuff with fuse where > writing a page could possibly consume a lot of memory in user-space. > Likely, in this case we might just blame it on the admin that agreed to > running this (trusted) fuse server. > > > > > 2. For a buggy or untrusted FUSE server (without tmp pages), the > > lifetime of writeback state of fuse folios can be arbitrarily long and > > we need some mechanism to limit it. > > Yes. > > > Especially in 1), we really want to wait for writeback to finish, just > like for any other filesystem. For 2), we want a way so writeback will > not get stuck for a long time, but are able to make progress and migrate > these pages. > What if we were to allow the kernel to kill off an unprivileged FUSE server that was "misbehaving" [1], clean any dirty pagecache pages that it has, and set writeback errors on the corresponding FUSE inodes [2]? We'd still need a rather long timeout (on the order of at least a minute or so, by default). Would that be enough to assuage concerns about unprivileged servers pinning pages indefinitely? Buggy servers are still a problem, but there's not much we can do about that. There are a lot of details we'd have to sort out, so I'm also interested in whether anyone (Miklos? Bernd?) would find this basic approach objectionable. [1]: for some definition of misbehavior. Probably a writeback timeout of some sort but maybe there would be other criteria too. [2]: or maybe just make them eligible to be cleaned without talking to the server, should the VM wish it. -- Jeff Layton <jlayton@xxxxxxxxxx>