On 06/06/2024 19:38, Catalin Marinas wrote: > On Thu, Jun 06, 2024 at 11:17:36AM +0100, Marc Zyngier wrote: >> On Wed, 05 Jun 2024 16:08:49 +0100, >> Steven Price <steven.price@xxxxxxx> wrote: >>> 2. Use a special (global) memory allocator that does the >>> set_memory_decrypted() dance on the pages that it allocates but allows >>> packing the allocations. I'm not aware of an existing kernel API for >>> this, so it's potentially quite a bit of code. The benefit is that it >>> reduces memory consumption in a realm guest, although fragmentation >>> still means we're likely to see a (small) growth. >>> >>> Any thoughts on what you think would be best? >> >> I would expect that something similar to kmem_cache could be of help, >> only with the ability to deal with variable object sizes (in this >> case: minimum of 256 bytes, in increments defined by the >> implementation, and with a 256 byte alignment). > > Hmm, that's doable but not that easy to make generic. We'd need a new > class of kmalloc-* caches (e.g. kmalloc-decrypted-*) which use only > decrypted pages together with a GFP_DECRYPTED flag or something to get > the slab allocator to go for these pages and avoid merging with other > caches. It would actually be the page allocator parsing this gfp flag, > probably in post_alloc_hook() to set the page decrypted and re-encrypt > it in free_pages_prepare(). A slight problem here is that free_pages() > doesn't get the gfp flag, so we'd need to store some bit in the page > flags. Maybe the flag is not that bad, do we have something like for > page_to_phys() to give us the high IPA address for decrypted pages? > > Similarly if we go for a kmem_cache (or a few for multiple sizes). One > can specify a constructor which could set the memory decrypted but > there's no destructor (and also the constructor is per object, not per > page, so we'd need some refcounting). > > Another approach contained within the driver is to use mempool_create() > with our own _alloc_fn/_free_fn that sets the memory decrypted/encrypted > accordingly, though sub-page allocations need additional tracking. Also > that's fairly similar to kmem_cache, fixed size. > > Yet another option would be to wire it somehow in the DMA API but the > minimum allocation is already a page size, so we don't gain anything. > > What gets somewhat closer to what we need is gen_pool. It can track > different sizes, we just need to populate the chunks as needed. I don't > think this would work as a generic allocator but may be good enough > within the ITS code. > > If there's a need for such generic allocations in other parts of the > kernel, my preference would be something around kmalloc caches and a new > GFP flag (first option; subject to the selling it to the mm folk). But > that's more of a separate prototyping effort that may or may not > succeed. > > Anyway, we could do some hacking around gen_pool as a temporary solution > (maybe as a set of patches on top of this series to be easier to revert) > and start investigating a proper decrypted page allocator in parallel. > We just need to find a victim that has the page allocator fresh in mind > (Ryan or Alexandru ;)). Thanks for the suggestions Catalin. I had a go at implementing something with gen_pool - the below (very lightly tested) hack seems to work. This is on top of the current series. I *think* it should also be safe to drop the whole alignment part with this custom allocator, which could actually save memory. But I haven't quite got my head around that yet. Steve ----8<----- diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ca72f830f4cc..e78634d4d22c 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -12,6 +12,7 @@ #include <linux/crash_dump.h> #include <linux/delay.h> #include <linux/efi.h> +#include <linux/genalloc.h> #include <linux/interrupt.h> #include <linux/iommu.h> #include <linux/iopoll.h> @@ -165,7 +166,7 @@ struct its_device { struct its_node *its; struct event_lpi_map event_map; void *itt; - u32 itt_order; + u32 itt_sz; u32 nr_ites; u32 device_id; bool shared; @@ -225,6 +226,50 @@ static void its_free_pages(void *addr, unsigned int order) free_pages((unsigned long)addr, order); } +static struct gen_pool *itt_pool; + +static void *itt_alloc_pool(int node, int size) +{ + unsigned long addr; + struct page *page; + + if (size >= PAGE_SIZE) { + page = its_alloc_pages_node(node, + GFP_KERNEL | __GFP_ZERO, + get_order(size)); + + return page_address(page); + } + + do { + addr = gen_pool_alloc(itt_pool, size); + if (addr) + break; + + page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 1); + if (!page) + break; + + gen_pool_add(itt_pool, (unsigned long)page_address(page), + PAGE_SIZE, node); + } while (!addr); + + return (void*)addr; +} + +static void itt_free_pool(void *addr, int size) +{ + if (!addr) + return; + + if (size >= PAGE_SIZE) { + its_free_pages(addr, get_order(size)); + return; + } + + gen_pool_free(itt_pool, (unsigned long)addr, size); +} + /* * Skip ITSs that have no vLPIs mapped, unless we're on GICv4.1, as we * always have vSGIs mapped. @@ -3450,9 +3495,7 @@ static struct its_device *its_create_device(struct its_node *its, u32 dev_id, unsigned long *lpi_map = NULL; unsigned long flags; u16 *col_map = NULL; - struct page *page; void *itt; - int itt_order; int lpi_base; int nr_lpis; int nr_ites; @@ -3471,13 +3514,8 @@ static struct its_device *its_create_device(struct its_node *its, u32 dev_id, nr_ites = max(2, nvecs); sz = nr_ites * (FIELD_GET(GITS_TYPER_ITT_ENTRY_SIZE, its->typer) + 1); sz = max(sz, ITS_ITT_ALIGN) + ITS_ITT_ALIGN - 1; - itt_order = get_order(sz); - page = its_alloc_pages_node(its->numa_node, - GFP_KERNEL | __GFP_ZERO, - itt_order); - if (!page) - return NULL; - itt = page_address(page); + + itt = itt_alloc_pool(its->numa_node, sz); dev = kzalloc(sizeof(*dev), GFP_KERNEL); @@ -3492,9 +3530,9 @@ static struct its_device *its_create_device(struct its_node *its, u32 dev_id, lpi_base = 0; } - if (!dev || !col_map || (!lpi_map && alloc_lpis)) { + if (!dev || !itt || !col_map || (!lpi_map && alloc_lpis)) { kfree(dev); - its_free_pages(itt, itt_order); + itt_free_pool(itt, sz); bitmap_free(lpi_map); kfree(col_map); return NULL; @@ -3504,7 +3542,7 @@ static struct its_device *its_create_device(struct its_node *its, u32 dev_id, dev->its = its; dev->itt = itt; - dev->itt_order = itt_order; + dev->itt_sz = sz; dev->nr_ites = nr_ites; dev->event_map.lpi_map = lpi_map; dev->event_map.col_map = col_map; @@ -3532,7 +3570,7 @@ static void its_free_device(struct its_device *its_dev) list_del(&its_dev->entry); raw_spin_unlock_irqrestore(&its_dev->its->lock, flags); kfree(its_dev->event_map.col_map); - its_free_pages(its_dev->itt, its_dev->itt_order); + itt_free_pool(its_dev->itt, its_dev->itt_sz); kfree(its_dev); } @@ -5722,6 +5760,10 @@ int __init its_init(struct fwnode_handle *handle, struct rdists *rdists, bool has_v4_1 = false; int err; + itt_pool = gen_pool_create(get_order(ITS_ITT_ALIGN), -1); + if (!itt_pool) + return -ENOMEM; + gic_rdists = rdists; its_parent = parent_domain;