On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote: > On 10.01.25 21:28, Jeff Layton wrote: > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote: > > > On 07.01.25 19:07, Shakeel Butt wrote: > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote: > > > > > On 06.01.25 19:17, Shakeel Butt wrote: > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote: > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > > In any case, having movable pages be turned unmovable due to persistent > > > > > > > > writaback is something that must be fixed, not worked around. Likely a > > > > > > > > good topic for LSF/MM. > > > > > > > > > > > > > > Yes, this seems a good cross fs-mm topic. > > > > > > > > > > > > > > So the issue discussed here is that movable pages used for fuse > > > > > > > page-cache cause a problems when memory needs to be compacted. The > > > > > > > problem is either that > > > > > > > > > > > > > > - the page is skipped, leaving the physical memory block unmovable > > > > > > > > > > > > > > - the compaction is blocked for an unbounded time > > > > > > > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things > > > > > > > worse, the same thing happens on readahead, since the new page can be > > > > > > > locked for an indeterminate amount of time, which can also block > > > > > > > compaction, right? > > > > > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered > > > > > by an trusted source. > > > > > > > > > > It's a violation of core-mm principles. > > > > > > > > The "must not be unmovable pages ever" is a very strong statement and we > > > > are violating it today and will keep violating it in future. Any > > > > page/folio under lock or writeback or have reference taken or have been > > > > isolated from their LRU is unmovable (most of the time for small period > > > > of time). > > > > > > ^ this: "small period of time" is what I meant. > > > > > > Most of these things are known to not be problematic: retrying a couple > > > of times makes it work, that's why migration keeps retrying. > > > > > > Again, as an example, we allow short-term O_DIRECT but disallow > > > long-term page pinning. I think there were concerns at some point if > > > O_DIRECT might also be problematic (I/O might take a while), but so far > > > it was not a problem in practice that would make CMA allocations easily > > > fail. > > > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but > > > actually triggers long-term pinning; IIRC David Howells has this on his > > > todo list to fix. [I recall that seccomp disallows vmsplice by default > > > right now] > > > > > > These operations are being done all over the place in kernel. > > > > Miklos gave an example of readahead. > > > > > > I assume you mean "unmovable for a short time", correct, or can you > > > point me at that specific example; I think I missed that. Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@xxxxxxxxxxxxxx/ > > > > > > > The per-CPU LRU caches are another > > > > case where folios can get stuck for long period of time. > > > > > > Which is why memory offlining disables the lru cache. See > > > lru_cache_disable(). Other users that care about that drain the LRU on > > > all cpus. > > > > > > > Reclaim and > > > > compaction can isolate a lot of folios that they need to have > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is > > > > impractical. > > > > > > "must only be short-term unmovable", better? Yes and you have clarified further below of the actual amount. > > > > > > > Still a little ambiguous. > > > > How short is "short-term"? Are we talking milliseconds or minutes? > > Usually a couple of seconds, max. For memory offlining, slightly longer > times are acceptable; other things (in particular compaction or CMA > allocations) will give up much faster. > > > > > Imposing a hard timeout on writeback requests to unprivileged FUSE > > servers might give us a better guarantee of forward-progress, but it > > would probably have to be on the order of at least a minute or so to be > > workable. > > Yes, and that might already be a bit too much, especially if stuck on > waiting for folio writeback ... so ideally we could find a way to migrate > these folios that are under writeback and it's not your ordinary disk driver > that responds rather quickly. > > Right now we do it via these temp pages, and I can see how that's > undesirable. > > For NFS etc. we probably never ran into this, because it's all used in > fairly well managed environments and, well, I assume NFS easily outdates CMA > and ZONE_MOVABLE :) > > > >>> > > > > The point is that, yes we should aim to improve things but in iterations > > > > and "must not be unmovable pages ever" is not something we can achieve > > > > in one step. > > > > > > I agree with the "improve things in iterations", but as > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we > > > are making things worse. AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still causing confusion. It is a simple flag to avoid deadlock in the reclaim code path and does not say anything about movability. > > > > > > And as this discussion has been going on for too long, to summarize my > > > point: there exist conditions where pages are short-term unmovable, and > > > possibly some to be fixed that turn pages long-term unmovable (e.g., > > > vmsplice); that does not mean that we can freely add new conditions that > > > turn movable pages unmovable long-term or even forever. > > > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I > > > would suggest a topic around which things are know to cause pages to be > > > short-term or long-term unmovable/unsplittable, and which can be > > > handled, which not. Maybe I'll find the time to propose that as a topic. > > > > > > > > > This does sound like great LSF/MM fodder! I predict that this session > > will run long! ;) > > Heh, fully agreed! :) I would like more targeted topic and for that I want us to at least agree where we are disagring. Let me write down two statements and please tell me where you disagree: 1. For a normal running FUSE server (without tmp pages), the lifetime of writeback state of fuse folios falls under "short-term unmovable" bucket as it does not differ in anyway from anyother filesystems handling writeback folios. 2. For a buggy or untrusted FUSE server (without tmp pages), the lifetime of writeback state of fuse folios can be arbitrarily long and we need some mechanism to limit it.