Re: [PATCH 00/15] kasan: x86: arm64: risc-v: KASAN tag-based mode for x86

Jessica Clarke <jrtc27@xxxxxxxxxx> · Tue, 4 Feb 2025 23:36:23 +0000

On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@xxxxxxxxxx> wrote:
> ARM64 supports MTE which is hardware support for tagging 16 byte granules
> and verification of tags in pointers all in hardware and on some platforms
> with *no* performance penalty since the tag is stored in the ECC areas of
> DRAM and verified at the same time as the ECC.
> 
> Could we get support for that? This would allow us to enable tag checking
> in production systems without performance penalty and no memory overhead.

It’s not “no performance penalty”, there is a cost to tracking the MTE
tags for checking. In asynchronous (or asymmetric) mode that’s not too
bad, but in synchronous mode there is a significant overhead even with
ECC. Normally on a store, once you’ve translated it and have the data,
you can buffer it up and defer the actual write until some time later.
If you hit in the L1 cache then that will probably be quite soon, but
if you miss then you have to wait for the data to come back from lower
levels of the hierarchy, potentially all the way out to DRAM. Or if you
have a write-around cache then you just send it out to the next level
when it’s ready. But now, if you have synchronous MTE, you cannot
retire your store instruction until you know what the tag for the
location you’re storing to is; effectively you have to wait until you
can do the full cache lookup, and potentially miss, until it can
retire. This puts pressure on the various microarchitectural structures
that track instructions as they get executed, as instructions are now
in flight for longer. Yes, it may well be that it is quicker for the
memory controller to get the tags from ECC bits than via some other
means, but you’re already paying many many cycles at that point, with
the relevant store being stuck unable to retire (and thus every
instruction after it in the instruction stream) that whole time, and no
write allocate or write around schemes can help you, because you
fundamentally have to wait for the tags to be read before you know if
the instruction is going to trap.

Now, you can choose to not use synchronous mode due to that overhead,
but that’s nuance that isn’t considered by your reply here and has some
consequences.

Jess