On Thu, Oct 22, 2020 at 5:16 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > > On Thu, Oct 22, 2020 at 3:19 PM Andrey Konovalov <andreyknvl@xxxxxxxxxx> wrote: > > > > This patchset is not complete (hence sending as RFC), but I would like to > > start the discussion now and hear people's opinions regarding the > > questions mentioned below. > > > > === Overview > > > > This patchset adopts the existing hardware tag-based KASAN mode [1] for > > use in production as a memory corruption mitigation. Hardware tag-based > > KASAN relies on arm64 Memory Tagging Extension (MTE) [2] to perform memory > > and pointer tagging. Please see [3] and [4] for detailed analysis of how > > MTE helps to fight memory safety problems. > > > > The current plan is reuse CONFIG_KASAN_HW_TAGS for production, but add a > > boot time switch, that allows to choose between a debugging mode, that > > includes all KASAN features as they are, and a production mode, that only > > includes the essentials like tag checking. > > > > It is essential that switching between these modes doesn't require > > rebuilding the kernel with different configs, as this is required by the > > Android GKI initiative [5]. > > > > The patch titled "kasan: add and integrate kasan boot parameters" of this > > series adds a few new boot parameters: > > > > kasan.mode allows choosing one of main three modes: > > > > - kasan.mode=off - no checks at all > > - kasan.mode=prod - only essential production features > > - kasan.mode=full - all features > > > > Those mode configs provide default values for three more internal configs > > listed below. However it's also possible to override the default values > > by providing: > > > > - kasan.stack=off/on - enable stacks collection > > (default: on for mode=full, otherwise off) > > - kasan.trap=async/sync - use async or sync MTE mode > > (default: sync for mode=full, otherwise async) > > - kasan.fault=report/panic - only report MTE fault or also panic > > (default: report) > > > > === Benchmarks > > > > For now I've only performed a few simple benchmarks such as measuring > > kernel boot time and slab memory usage after boot. The benchmarks were > > performed in QEMU and the results below exclude the slowdown caused by > > QEMU memory tagging emulation (as it's different from the slowdown that > > will be introduced by hardware and therefore irrelevant). > > > > KASAN_HW_TAGS=y + kasan.mode=off introduces no performance or memory > > impact compared to KASAN_HW_TAGS=n. > > > > kasan.mode=prod (without executing the tagging instructions) introduces > > 7% of both performace and memory impact compared to kasan.mode=off. > > Note, that 4% of performance and all 7% of memory impact are caused by the > > fact that enabling KASAN essentially results in CONFIG_SLAB_MERGE_DEFAULT > > being disabled. > > > > Recommended Android config has CONFIG_SLAB_MERGE_DEFAULT disabled (I assume > > for security reasons), but Pixel 4 has it enabled. It's arguable, whether > > "disabling" CONFIG_SLAB_MERGE_DEFAULT introduces any security benefit on > > top of MTE. Without MTE it makes exploiting some heap corruption harder. > > With MTE it will only make it harder provided that the attacker is able to > > predict allocation tags. > > > > kasan.mode=full has 40% performance and 30% memory impact over > > kasan.mode=prod. Both come from alloc/free stack collection. FTR, this only accounts for slab memory overhead that comes from redzones that store stack ids. There's also page_alloc overhead from the stacks themselves, which I didn't measure yet. > > > > === Questions > > > > Any concerns about the boot parameters? > > For boot parameters I think we are now "safe" in the sense that we > provide maximum possible flexibility and can defer any actual > decisions. Perfect! I realized that I actually forgot to think about the default values when no boot params are specified, I'll fix this in the next version. > > Should we try to deal with CONFIG_SLAB_MERGE_DEFAULT-like behavor mentioned > > above? > > How hard it is to allow KASAN with CONFIG_SLAB_MERGE_DEFAULT? Are > there any principal conflicts? I'll explore this. > The numbers you provided look quite substantial (on a par of what MTE > itself may introduce). So I would assume if a vendor does not have > CONFIG_SLAB_MERGE_DEFAULT disabled, it may not want to disable it > because of MTE (effectively doubles overhead). Sounds reasonable. Thanks!