On Fri, 4 Dec 2020 at 20:04, Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> wrote: > LKFT started testing KCSAN enabled kernel from the linux next tree. > Here we have found BUG: KCSAN: data-race in tick_nohz_next_event / > tick_nohz_stop_tick Thank you for looking into KCSAN. Would it be possible to collect these reports in a moderation queue for now? I'm currently trying to work out a strategy on how to best proceed with all the data races in the kernel. We do know there are plenty. On syzbot's internal moderation queue, we're currently looking at >300 of them (some here: https://syzkaller.appspot.com/upstream?manager=ci2-upstream-kcsan-gce). Part of this strategy involves prioritizing certain concurrency bug classes. Let's define the following buckets for concurrency bugs: A. Data race, where failure due to current compilers is unlikely (supposedly "benign"); merely marking the accesses appropriately is sufficient. Finding a crash for these will require a miscompilation, but otherwise look "benign" at the C-language level. B. Race-condition bugs where the bug manifests as a data race, too -- simply marking things doesn't fix the problem. These are the types of bugs where a data race would point out a more severe issue. C. Race-condition bugs where the bug never manifests as a data race. An example of these might be 2 threads that acquire the necessary locks, yet some interleaving of them still results in a bug (e.g. because the logic inside the critical sections is buggy). These are harder to detect with KCSAN as-is, and require using ASSERT_EXCLUSIVE_ACCESS() or ASSERT_EXCLUSIVE_WRITER() in the right place. See https://lwn.net/Articles/816854/. One problem currently is that the kernel has quite a lot type-(A) reports if we run KCSAN, which makes it harder to identify bugs of type (B) and (C). My wish for the future is that we can get to a place, where the kernel has almost no unintentional (A) issues, so that we primarily find (B) and (C) bugs. The report below looks to be of type (A). Generally, the best strategy for resolving these is to send a patch, and not a report. However, be aware that sometimes it is really quite difficult to say if we're looking at a type (A) or (B) issue, in which case it may still be fair to send a report and briefly describe what you think is happening (because that'll increase the likelihood of getting a response). I recommend also reading "Developer/Maintainer data-race strategies" in https://lwn.net/Articles/816854/ -- specifically note "[...] you should not respond to KCSAN reports by mindlessly adding READ_ONCE(), data_race(), and WRITE_ONCE(). Instead, a patch addressing a KCSAN report must clearly identify the fix's approach and why that approach is appropriate." I recommend reading https://lwn.net/Articles/816850/ for more details. > This report is from an x86_64 machine clang-11 linux next 20201201. > Since we are running for the first time we do not call this regression. > > [ 47.811425] BUG: KCSAN: data-race in tick_nohz_next_event / > tick_nohz_stop_tick > [ 47.818738] > [ 47.820239] write to 0xffffffffa4cbe920 of 4 bytes by task 0 on cpu 2: > [ 47.826766] tick_nohz_stop_tick+0x8b/0x310 > [ 47.830951] tick_nohz_idle_stop_tick+0xcb/0x170 > [ 47.835571] do_idle+0x193/0x250 > [ 47.838804] cpu_startup_entry+0x25/0x30 > [ 47.842728] start_secondary+0xa0/0xb0 > [ 47.846482] secondary_startup_64_no_verify+0xc2/0xcb > [ 47.851531] > [ 47.853034] read to 0xffffffffa4cbe920 of 4 bytes by task 0 on cpu 3: > [ 47.859473] tick_nohz_next_event+0x165/0x1e0 > [ 47.863831] tick_nohz_get_sleep_length+0x94/0xd0 > [ 47.868539] menu_select+0x250/0xac0 > [ 47.872116] cpuidle_select+0x47/0x50 > [ 47.875781] do_idle+0x17c/0x250 > [ 47.879015] cpu_startup_entry+0x25/0x30 > [ 47.882942] start_secondary+0xa0/0xb0 > [ 47.886694] secondary_startup_64_no_verify+0xc2/0xcb > [ 47.891743] > [ 47.893234] Reported by Kernel Concurrency Sanitizer on: > [ 47.898541] CPU: 3 PID: 0 Comm: swapper/3 Not tainted > 5.10.0-rc6-next-20201201 #2 > [ 47.906017] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > 2.2 05/23/2018 This report should have line numbers, otherwise it's impossible to say which accesses are racing. [ For those curious, this is the same report on syzbot's moderation queue, with line numbers: https://syzkaller.appspot.com/bug?id=d835c53d1a5e27922fcd1fbefc926a74790156cb ] Thanks, -- Marco