The WARNING I reported with the kernel v5.13-rc5 is still observed with v5.15-rc7. It took long, but now I think I understand the cause. Here I share my findings in line. On Jul 09, 2021 / 03:57, Shinichiro Kawasaki wrote: > On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote: > > On Jun 10, 2021 / 16:32, Hillf Danton wrote: > > > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote: > > > >+ Jens and linux-kernel > > > > > > > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote: > > > >> Hi there, > > > >> > > > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5, > > > >> block/008 failed. A WARNING below was the cause of the failure. > > > >> > > > >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0 > > > >> > > > >> I'm trying to recreate the failure repeating the test case, but so far, I am > > > >> not able to. This failure looks rare, but actually, I observed it 3 times in > > > >> the past one year. > > > >> > > > >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1] > > > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2] > > > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3] > > > >> > > > >> The test case block/008 does IO during CPU hotplug, and the WARNING in > > > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())". > > > >> So it is likely that the test case triggers the warning, but I don't have > > > >> clear view why and how the warning was triggered. It was observed on various > > > >> block devices, so I would like to ask linux-block experts if anyone can tell > > > >> what is going on. Comments will be appreciated. > > > > > > [...] > > > > > > >> [40041.712804][T135817] ------------[ cut here ]------------ > > > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.= > > > >c:3175 ttwu_queue_wakelist+0x284/0x2f0 > > > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe= > > > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi= > > > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_= > > > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip= > > > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6= > > > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc = > > > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table= > > > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x= > > > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_= > > > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev= > > > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi = > > > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr= > > > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc= > > > >32c_intel sysimgblt > > > >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni= > > > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran= > > > >sport_sas pkcs8_key_parser [last unloaded: null_blk] > > > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r= > > > >c5+ #1 > > > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, = > > > >BIOS 3.2 11/22/2019 > > > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0 > > > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4= > > > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0= > > > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04= > > > > > > > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046 > > > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: = > > > >ffff888811440000 > > > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: = > > > >ffffffffb603d6e8 > > > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: = > > > >ffffffffb6ba6167 > > > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: = > > > >0000000000000000 > > > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: = > > > >ffff888811440001 > > > >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0= > > > >000) knlGS:0000000000000000 > > > >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033= > > > > > > > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: = > > > >00000000001706e0 > > > >> [40041.941669][T135817] Call Trace: > > > >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110 > > > >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0 > > > > > > 2) __queue_work > > > raw_spin_lock(&pwq->pool->lock) with irq disabled > > > insert_work > > > wake_up_worker(pool); > > > wake_up_process first_idle_worker(pool); > > > > > > Even if waker is lucky enough running on worker's CPU, what is weird is an > > > idle worker can trigger the warning, given > > > > > > if (smp_load_acquire(&p->on_cpu) && > > > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) > > > goto unlock; > > > > > > because p->on_cpu must have been false for quite a while. The call trace indicated that the try_to_wake_up() called ttwu_queue_wakelist(). So I had assumed that the ttwu_queue_wakelist() call was in the hunk that Hillf quoted above. I also had thorugh that weird, but I noticed that there is another path to call ttwu_queue_wakelist(). try_to_wake_up() calls ttwu_queue() at another place, and ttwu_queue() calls ttwu_queue_wakelist(). I confirmed that the warning is reported with this call path. I think this path can happen on p->cpu false condition. > > > > > > Is there any chance for CPU hotplug to make a difference? I found sd_llc_id change by CPU hotplug affects ttwu_queue_wakelist(). When the warning is reported, ttwu_queue_cond() returns true in the ttwu_queue_wakelist() hunk below. if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) { if (WARN_ON_ONCE(cpu == smp_processor_id())) return false; And in ttwu_queue_cond() hunk below, cpus_share_cache() returns false, even though smp_processor_id() and cpu are same cpu. This is weird. Why the single same cpu does not share the cache? /* * If the CPU does not share cache, then queue the task on the * remote rqs wakelist to avoid accessing remote data. */ if (!cpus_share_cache(smp_processor_id(), cpu)) return true; In cpus_share_cache(), sd_llc_id is checked twice with the line below. return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); When this_cpu == that_cpu, it is expected sd_llc_id is same. However, when CPU hotplug is ongoing, scheduler domain destroy and rebuild happens and modifies sd_llc_id value. Then, the two sd_llc_id references in the code can have different values. I think this cpus_share_cached() needs a fix assuming that sd_llc_id may change. My idea is to add check this_cpu == that_cpu. Based on this idea, I created a fix and the warning is avoided. I will post the patch for further discussion and review. -- Best Regards, Shin'ichiro Kawasaki > > > > > > Thoughts are welcome. > > > > > > Hillf > > > > Hillf, thank you very much for the comments, and sorry about this late reply. > > > > I walked through related functions to understand your comments and, but I have > > to say that I still don't have enough background knowledge to provide valuable > > comments back to you. I understand that the waker and the wakee are on same CPU, > > and it is weird that p->on_cpu is true. This looks indicating that the task > > scheduler failing to control task status on task migration triggered by CPU > > hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and > > task migration are folded into the design and implementation. > > > > The blktests test case block/008 runs I/Os to a test target block device 30 > > seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when > > there are N CPUs are available, it offlines N-1 CPUs to have only one online > > CPU, then onlines all CPUs again. It repeats this online and offline until I/O > > workload completes. When the only one CPU is online, the waker and the wakee can > > be on the same CPU. Or, one of the waker or the wakee might have been migrated > > from other CPUs to the only one online CPU. But still it is not clear for me why > > it results in the WARNING. > > > > Now I'm trying to recreate the failure. By repeating test cases in "block > > group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to > > 5 hours to recreate it. The test target block device used was a null_blk with > > rather unique configuration (zoned device with zone capacity smaller than zone > > size). I will try to confirm the failure recreation with latest kernel version > > and other block devices. > > I tried some device setups, and found that dm-linear device on null_blk > recreates the warning consistently. In case anyone wishes to recreate it, let > me share the bash script below which I used. I tried it several times on the > kernel v5.13, and all tries recreated the warning. With my system (nproc is 8), > it took from 3 to 6 hours to recreate. > > -------------------------------------------------------------------------------- > #!/bin/bash > > # create a null_blk device > declare sysfs=/sys/kernel/config/nullb/nullb0 > modprobe null_blk nr_devices=0 > mkdir "${sysfs}" > echo 1024 > "${sysfs}"/size > echo 1 > "${sysfs}"/memory_backed > echo 1 > "${sysfs}"/power > sleep 1 > > # create a dm-linear device on the null_blk device > echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test > sleep 1 > > # run blktests, block/008 many times > git clone https://github.com/osandov/blktests.git > cd blktests > echo "TEST_DEVS=( /dev/mapper/test )" > config > for ((i=0;i<1000;i++)); do > echo $i; > if ! ./check block/008; then > break; > fi > done > > -- > Best Regards, > Shin'ichiro Kawasaki