On 24.08.23 13:06, David Hildenbrand wrote:
On 24.08.23 12:44, Catalin Marinas wrote:
On Thu, Aug 24, 2023 at 09:50:32AM +0200, David Hildenbrand wrote:
after re-reading it 2 times, I still have no clue what your patch set is
actually trying to achieve. Probably there is a way to describe how user
space intents to interact with this feature, so to see which value this
actually has for user space -- and if we are using the right APIs and
allocators.
I'll try with an alternative summary, hopefully it becomes clearer (I
think Alex is away until the end of the week, may not reply
immediately). If this still doesn't work, maybe we should try a
different implementation ;).
The way MTE is implemented currently is to have a static carve-out of
the DRAM to store the allocation tags (a.k.a. memory colour). This is
what we call the tag storage. Each 16 bytes have 4 bits of tags, so this
means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup)
and normally hidden from the OS. So a checked memory access to location
X generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X
to Y is linear (subject to a minimum block size to deal with some
address interleaving). The software doesn't need to know about this
correspondence as we have specific instructions like STG/LDG to location
X that lead to a tag store/load to Y.
Now, not all memory used by applications is tagged (mmap(PROT_MTE)).
For example, some large allocations may not use PROT_MTE at all or only
for the first and last page since initialising the tags takes time. The
side-effect is that of these 3% DRAM, only part, say 1% is effectively
used. Some people want the unused tag storage to be released for normal
data usage (i.e. give it to the kernel page allocator).
So the first complication is that a PROT_MTE page allocation at address
X will need to reserve the tag storage at location Y (and migrate any
data in that page if it is in use).
To make things worse, pages in the tag storage/carve-out range cannot
use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.
Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.
To your question about user APIs/ABIs, that's entirely transparent. As
with the current kernel (without this dynamic tag storage), a user only
needs to ask for PROT_MTE mappings to get tagged pages.
Thanks, that clarifies things a lot.
So it sounds like you might want to provide that tag memory using CMA.
That way, only movable allocations can end up on that CMA memory area,
and you can allocate selected tag pages on demand (similar to the
alloc_contig_range() use case).
That also solves the issue that such tag memory must not be longterm-pinned.
Regarding one complication: "The kernel needs to know where to allocate
a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
(mprotect()) and the range it is in does not support tagging.",
simplified handling would be if it's in a MIGRATE_CMA pageblock, it
doesn't support tagging. You have to migrate to a !CMA page (for
example, not specifying GFP_MOVABLE as a quick way to achieve that).
Okay, I now realize that this patch set effectively duplicates some CMA
behavior using a new migrate-type. Yeah, that's probably not what we
want just to identify if memory is taggable or not.
Maybe there is a way to just keep reusing most of CMA instead.
Another simpler idea to get started would be to just intercept the first
PROT_MTE, and allocate all CMA memory. In that case, systems that don't
ever use PROT_MTE can have that additional 3% of memory.
You probably know better how frequent it is that only a handful of
applications use PROT_MTE, such that there is still a significant
portion of tag memory to be reused (and if it's really worth optimizing
for that scenario).
--
Cheers,
David / dhildenb