On Tue, May 16, 2017 at 01:49:10PM -0700, Dmitry Vyukov wrote: > > Anyway, I have missed inline instrumentation completely. > > > > I will attach the fix in the bottom. It doesn't look beautiful > > since it breaks layer design (some check will be done at report > > function). However, I think that it's a good trade-off. > > > I can confirm that inline works with that patch. Thanks for confirming! > > I can also confirm that it reduces memory usage. I've booted qemu with > 2G ram and run some fixed workload. Before: > 31853 dvyukov 20 0 3043200 765464 21312 S 366.0 4.7 2:39.53 > qemu-system-x86 > 7528 dvyukov 20 0 3043200 732444 21676 S 333.3 4.5 2:23.19 > qemu-system-x86 > After: > 6192 dvyukov 20 0 3043200 394244 20636 S 17.9 2.4 2:32.95 > qemu-system-x86 > 6265 dvyukov 20 0 3043200 388860 21416 S 399.3 2.4 3:02.88 > qemu-system-x86 > 9005 dvyukov 20 0 3043200 383564 21220 S 397.1 2.3 2:35.33 > qemu-system-x86 > > However, I see some very significant slowdowns with inline > instrumentation. I did 3 tests: > 1. Boot speed, I measured time for a particular message to appear on > console. Before: > [ 2.504652] random: crng init done > [ 2.435861] random: crng init done > [ 2.537135] random: crng init done > After: > [ 7.263402] random: crng init done > [ 7.263402] random: crng init done > [ 7.174395] random: crng init done > > That's ~3x slowdown. > > 2. I've run bench_readv benchmark: > https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_readv.c > as: > while true; do time ./bench_readv bench_readv 300000 1; done > > Before: > sys 0m7.299s > sys 0m7.218s > sys 0m6.973s > sys 0m6.892s > sys 0m7.035s > sys 0m6.982s > sys 0m6.921s > sys 0m6.940s > sys 0m6.905s > sys 0m7.006s > > After: > sys 0m8.141s > sys 0m8.077s > sys 0m8.067s > sys 0m8.116s > sys 0m8.128s > sys 0m8.115s > sys 0m8.108s > sys 0m8.326s > sys 0m8.529s > sys 0m8.164s > sys 0m8.380s > > This is ~19% slowdown. > > 3. I've run bench_pipes benchmark: > https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_pipes.c > as: > while true; do time ./bench_pipes 10 10000 1; done > > Before: > sys 0m5.393s > sys 0m6.178s > sys 0m5.909s > sys 0m6.024s > sys 0m5.874s > sys 0m5.737s > sys 0m5.826s > sys 0m5.664s > sys 0m5.758s > sys 0m5.421s > sys 0m5.444s > sys 0m5.479s > sys 0m5.461s > sys 0m5.417s > > After: > sys 0m8.718s > sys 0m8.281s > sys 0m8.268s > sys 0m8.334s > sys 0m8.246s > sys 0m8.267s > sys 0m8.265s > sys 0m8.437s > sys 0m8.228s > sys 0m8.312s > sys 0m8.556s > sys 0m8.680s > > This is ~52% slowdown. > > > This does not look acceptable to me. I would ready to pay for this, > say, 10% of performance. But it seems that this can have up to 2-4x > slowdown for some workloads. I found the reasons of above regression. There are two reasons. 1. In my implementation, original shadow to the memory allocated from memblock is black shadow so it causes to call kasan_report(). It will pass the check since per page shadow would be zero shadow but it causes some overhead. 2. Memory used by stackdepot is in a similar situation with #1. It allocates page and divide it to many objects. Then, use it like as object. Although there is "KASAN_SANITIZE_stackdepot.o := n" which try to disable sanitizer, there is a function call (memcmp() in find_stack()) to other file and sanitizer work for it. #1 problem can be fixed but more investigation is needed. I will respin the series after fixing it. #2 problem also can be fixed. There are two options here. First, uses private memcmp() for stackdepot and disable sanitizer for it. I think that this is a right approach since it slowdown the performance in all KASAN build cases. And, we don't want to sanitize KASAN itself. Second, I can provide a function to map the actual shadow manually. It will reduce the case calling kasan_report(). See the attached patch. It implements later approach on #2 problem. It would reduce performance regression. I have tested your bench_pipes test with it and found that performance is restored. However, there is still remaining problem, #1, so I'm not sure that it completely restore your regression. Could you check that if possible? Anyway, I think that respin is needed to fix this performance problem completely. > > > Your use-case is embed devices where you care a lot about both code > size and memory consumption, right? Yes. > I see 2 possible ways forward: > 1. Enable this new mode only for outline, but keep current scheme for > inline. Then outline will be "small but slow" type of configuration. Performance problem is not that bad in OUTLINE build. Therefore, this is a reasonable option to have. > 2. Somehow fix slowness (at least in inline mode). I will try to fix slowness as much as possible. If slowness cannot be acceptable after such effort, we can choose the direction at that moment. > > > Mapping zero page to non-kernel memory could cause true-negative > > problem since we cannot flush the TLB in all cpus. We will read zero > > shadow value value in this case even if actual shadow value is not > > zero. This is one of the reason that black page is introduced in this > > patchset. > > What does make your current patch work then? > Say we map a new shadow page, update the page shadow to say that there > is mapped shadow. Then another CPU loads the page shadow and then > loads from the newly mapped shadow. If we don't flush TLB, what makes > the second CPU see the newly mapped shadow? There is a fix-up processing to see the newly mapped shadow in other cpus. check_memory_region_slow() exists for that purpose. In stale TLB case, we will see black shadow and fall in to this function. In this function, we flush stale TLB and re-check so we can see correct result. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>