On Tue, Jun 04, 2024 at 11:59:51AM +0200, David Hildenbrand wrote: > [...] > > > > > How can we use per-block tracking for reclaiming memory and what changes would > > > > be needed? Or is per-block really a non-viable option? > > > > > > The interesting thing is: with mTHP toggles it is opt-in -- like for > > > PMD-sized THP with shmem_enabled -- and we don't have to be that concerned > > > about this problem right now. > > > > Without respecting the size when allocating large folios, mTHP toggles would > > over allocate. My proposal added earlier to this thread is to combine the 2 > > to avoid that case. Otherwise, shouldn't we try to find a solution for the > > reclaiming path? > > I think at some point we'll really have to do a better job at reclaiming > (either memory-overallocation, PUNCHHOLE that couldn't split, but maybe also > pages that are now all-zero again and could be reclaimed again). > > > > > > > > > > > > > > Clearly, if per-block is viable option, shmem_fault() bug would required to be > > > > fixed first. Any ideas on how to make it reproducible? > > > > > > > > The alternatives discussed where sub-page refcounting and zeropage scanning. > > > > > > Yeah, I don't think sub-page refcounting is a feasible (and certainly not > > > desired ;) ) option in the folio world. > > > > > > > The first one is not possible (IIUC) because of a refactor years back that > > > > simplified the code and also requires extra complexity. The second option would > > > > require additional overhead as we would involve scanning. > > > > > > We'll likely need something similar (scanning, tracking?) for anonymous > > > memory as well. There was a proposal for a THP shrinker some time ago, that > > > would solve part of the problem. > > > > It's good to know we have the same problem in different places. I'm more > > inclined to keep the information rather than adding an extra overhead. Unless > > the complexity is really overwhelming. Considering the concerns here, not sure > > how much should we try merging with iomap as per Ritesh's suggestion. > > As raised in the meeting, I do see value in maintaining the information; but > I also see why Hugh and Kirill think this might be unwarranted complexity in > shmem.c. Meaning: we might be able to achieve something similar without it, > and we don't have to solve the problem right now to benefit from large > folios. > > > > > The THP shrinker, could you please confirm if it is this following thread? > > > > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/ > > Yes, although there was no follow up so far. Possibly, because in the > current khugepaged approach, there will be a constant back-and-forth between > khugepaged collapsing memory (and wasting memory in the default setting), > and the THP shrinker reclaiming memory; doesn't sound quite effective :) > That needs more thought IMHO. > > [...] > > > > > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing > > > > > is not possible at fallcoate() time, detecting zeropages later and > > > > > retrying to split+free might be an option, without per-block tracking. > > > > > > > > > > > > > > (2) mTHP controls > > > > > > > > > > As a default, we should not be using large folios / mTHP for any shmem, just > > > > > like we did with THP via shmem_enabled. This is what this series currently > > > > > does, and is aprt of the whole mTHP user-space interface design. > > > > > > > > That was clear for me too. But what is the reason we want to boot in 'safe > > > > mode'? What are the implications of not respecing that? > > > > > > [...] > > > > > > > > > > > As I understood from the call, mTHP with sysctl knobs is preferred over > > > > optimistic falloc/write allocation? But is still unclear to me why the former > > > > is preferred. > > > > > > I think Hugh's point was that this should be an opt-in feature, just like > > > PMD-sized THP started out, and still is, an opt-in feature. > > > > I'd be keen to understand the use case for this. Even the current THP controls > > we have in tmpfs. I guess these are just scenarios with no swap involved? > > Are these use cases the same for both tmpfs and shmem anon mm? > > Systems without swap are one case I think. The other reason for a toggle in > the past was to be able to disable it to troubleshoot issues (Hugh mentioned > in the meeting about unlocking new code paths in shmem.c only with a > toggle). > > Staring at my Fedora system: > > $ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled > always within_size advise [never] deny force > > Maybe because it uses tmpfs to mount /tmp (interesting article on lwn.net > about that) and people are concerned about the side-effects (that can > currently waste memory, or result in more reclaim work being required when > exceeding file sizes). > > For VMs, I know that shmem+THP with memory ballooning is problematic and not > really recommended. Agree. We cannot PUNCH_HOLES when using THP unless the punch is PMD-size. > > [...] > > > > > > > > > > > > > > > Also, we should properly fallback within the configured sizes, and not jump > > > > > "over" configured sizes. Unless there is a good reason. > > > > > > > > > > (3) khugepaged > > > > > > > > > > khugepaged needs to handle larger folios properly as well. Until fixed, > > > > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized > > > > > THP later. But really, khugepaged needs to be fixed to handle that. > > > > > > > > > > (4) force/disable > > > > > > > > > > These settings are rather testing artifacts from the old ages. We should not > > > > > add them to the per-size toggles. We might "inherit" it from the global one, > > > > > though. > > > > > > > > > > "within_size" might have value, and especially for consistency, we should > > > > > have them per size. > > > > > > > > > > > > > > > > > > > > So, this series only tackles anonymous shmem, which is a good starting > > > > > point. Ideally, we'd get support for other shmem (especially during fault > > > > > time) soon afterwards, because we won't be adding separate toggles for that > > > > > from the interface POV, and having inconsistent behavior between kernel > > > > > versions would be a bit unfortunate. > > > > > > > > > > > > > > > @Baolin, this series likely does not consider (4) yet. And I suggest we have > > > > > to take a lot of the "anonymous thp" terminology out of this series, > > > > > especially when it comes to documentation. > > > > > > > > > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we > > > > > could get an understanding on the next steps on !anon shmem. > > > > > > > > I realize I've raised so many questions, but it's essential for us to grasp the > > > > mm concerns and usage scenarios. This understanding will provide clarity on the > > > > direction regarding folios for !anon shmem. > > > > > > If I understood correctly, Hugh had strong feelings against not respecting > > > mTHP toggles for shmem. Without per-block tracking, I agree with that. > > > > My understanding was the same. But I have this follow-up question: should > > we respect mTHP toggles without considering mapping constraints (size and > > index)? Or perhaps we should use within_size where we can fit this intermediate > > approach, as long as mTHP granularity is respected? > > Likely we should do exactly the same as we would do with the existing > PMD-sized THPs. > > I recall in the meeting that we discussed that always vs. within_size seems > to have some value, and that we should respect that setting like we did for > PMD-sized THP. > > Which other mapping constraints could we have? Patch 12 in my RFC for LSF [1] adds this shmem_mapping_size_order() (updated to support get_order()) to get the max order 'based on the file size which the mapping currently allows at the given index'. Same check is done here [2] in filemap.c. [1] https://lore.kernel.org/all/20240515055719.32577-13-da.gomez@xxxxxxxxxxx/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?h=v6.10-rc2#n1940 @@ -1726,6 +1726,37 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order, return folio; } +/** + * shmem_mapping_size_order - Get maximum folio order for the given file size. + * @mapping: Target address_space. + * @index: The page index. + * @size: The suggested size of the folio to create. + * + * This returns a high order for folios (when supported) based on the file size + * which the mapping currently allows at the given index. The index is relevant + * due to alignment considerations the mapping might have. The returned order + * may be less than the size passed. + * + * Like __filemap_get_folio order calculation. + * + * Return: The order. + */ +static inline unsigned int +shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, + size_t size) +{ + unsigned int order = get_order(max_t(size_t, size, PAGE_SIZE)); + + if (!mapping_large_folio_support(mapping)) + return 0; + + /* If we're not aligned, allocate a smaller folio */ + if (index & ((1UL << order) - 1)) + order = __ffs(index); + + return min_t(size_t, order, MAX_PAGECACHE_ORDER); +} + > > -- > Cheers, > > David / dhildenb >