Hi Marc, On Tue, Jan 31, 2023 at 2:28 AM Marc Zyngier <maz@xxxxxxxxxx> wrote: > > On Fri, 27 Jan 2023 15:45:15 +0000, > Ricardo Koller <ricarkol@xxxxxxxxxx> wrote: > > > > > The one thing that would convince me to make it an option is the > > > amount of memory this thing consumes. 512+ pages is a huge amount, and > > > I'm not overly happy about that. Why can't this be a userspace visible > > > option, selectable on a per VM (or memslot) basis? > > > > > > > It should be possible. I am exploring a couple of ideas that could > > help when the hugepages are not 1G (e.g., 2M). However, they add > > complexity and I'm not sure they help much. > > > > (will be using PAGE_SIZE=4K to make things simpler) > > > > This feature pre-allocates 513 pages before splitting every 1G range. > > For example, it converts 1G block PTEs into trees made of 513 pages. > > When not using this feature, the same 513 pages would be allocated, > > but lazily over a longer period of time. > > This is an important difference. It avoids the upfront allocation > "thermal shock", giving time to the kernel to reclaim memory from > somewhere else. Doing it upfront means you *must* have 2MB+ of > immediately available memory for each GB of RAM you guest uses. > > > > > Eager-splitting pre-allocates those pages in order to split huge-pages > > into fully populated trees. Which is needed in order to use FEAT_BBM > > and skipping the expensive TLBI broadcasts. 513 is just the number of > > pages needed to break a 1G huge-page. > > I understand that. But it also clear that 1GB huge pages are unlikely > to be THPs, and I wonder if we should treat the two differently. Using > HugeTLBFS pages is significant here. > > > > > We could optimize for smaller huge-pages, like 2M by splitting 1 > > huge-page at a time: only preallocate one 4K page at a time. The > > trick is how to know that we are splitting 2M huge-pages. We could > > either get the vma pagesize or use hints from userspace. I'm not sure > > that this is worth it though. The user will most likely want to split > > big ranges of memory (>1G), so optimizing for smaller huge-pages only > > converts the left into the right: > > > > alloc 1 page | | alloc 512 pages > > split 2M huge-page | | split 2M huge-page > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | => | split 2M huge-page > > ... > > alloc 1 page | | split 2M huge-page > > split 2M huge-page | | split 2M huge-page > > > > Still thinking of what else to do. > > I think the 1G case fits your own use case, but I doubt this covers > the majority of the users. Most people rely on the kernel ability to > use THPs, which are capped at the first level of block mapping. > > 2MB (and 32MB for 16kB base pages) are the most likely mappings in my > experience (512MB with 64kB pages are vanishingly rare). > > Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it > fits the model. For THPs, where everything is opportunistic and the > user not involved, this is a lot more debatable. > > This is why I'd like this behaviour to be a buy-in, either directly (a > first class userspace API) or indirectly (the provenance of the > memory). This all makes sense, thanks for the explanation. I decided to implement something for both cases: small caches (~1 page) where the PUDs are split one PMD at a time, and bigger caches (>513) where the PUDs can be split with a single replacement. The user specifies the size of the cache via a capability, and size of 0 implies no eager splitting (the feature is off). Thanks, Ricardo > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.