On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote: > On Jun 10, 2021 / 16:32, Hillf Danton wrote: > > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote: > > >+ Jens and linux-kernel > > > > > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote: > > >> Hi there, > > >> > > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5, > > >> block/008 failed. A WARNING below was the cause of the failure. > > >> > > >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0 > > >> > > >> I'm trying to recreate the failure repeating the test case, but so far, I am > > >> not able to. This failure looks rare, but actually, I observed it 3 times in > > >> the past one year. > > >> > > >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1] > > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2] > > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3] > > >> > > >> The test case block/008 does IO during CPU hotplug, and the WARNING in > > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())". > > >> So it is likely that the test case triggers the warning, but I don't have > > >> clear view why and how the warning was triggered. It was observed on various > > >> block devices, so I would like to ask linux-block experts if anyone can tell > > >> what is going on. Comments will be appreciated. > > > > [...] > > > > >> [40041.712804][T135817] ------------[ cut here ]------------ > > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.= > > >c:3175 ttwu_queue_wakelist+0x284/0x2f0 > > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe= > > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi= > > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_= > > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip= > > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6= > > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc = > > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table= > > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x= > > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_= > > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev= > > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi = > > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr= > > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc= > > >32c_intel sysimgblt > > >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni= > > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran= > > >sport_sas pkcs8_key_parser [last unloaded: null_blk] > > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r= > > >c5+ #1 > > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, = > > >BIOS 3.2 11/22/2019 > > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0 > > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4= > > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0= > > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04= > > > > > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046 > > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: = > > >ffff888811440000 > > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: = > > >ffffffffb603d6e8 > > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: = > > >ffffffffb6ba6167 > > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: = > > >0000000000000000 > > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: = > > >ffff888811440001 > > >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0= > > >000) knlGS:0000000000000000 > > >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033= > > > > > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: = > > >00000000001706e0 > > >> [40041.941669][T135817] Call Trace: > > >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110 > > >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0 > > > > 2) __queue_work > > raw_spin_lock(&pwq->pool->lock) with irq disabled > > insert_work > > wake_up_worker(pool); > > wake_up_process first_idle_worker(pool); > > > > Even if waker is lucky enough running on worker's CPU, what is weird is an > > idle worker can trigger the warning, given > > > > if (smp_load_acquire(&p->on_cpu) && > > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) > > goto unlock; > > > > because p->on_cpu must have been false for quite a while. > > > > Is there any chance for CPU hotplug to make a difference? > > > > Thoughts are welcome. > > > > Hillf > > Hillf, thank you very much for the comments, and sorry about this late reply. > > I walked through related functions to understand your comments and, but I have > to say that I still don't have enough background knowledge to provide valuable > comments back to you. I understand that the waker and the wakee are on same CPU, > and it is weird that p->on_cpu is true. This looks indicating that the task > scheduler failing to control task status on task migration triggered by CPU > hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and > task migration are folded into the design and implementation. > > The blktests test case block/008 runs I/Os to a test target block device 30 > seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when > there are N CPUs are available, it offlines N-1 CPUs to have only one online > CPU, then onlines all CPUs again. It repeats this online and offline until I/O > workload completes. When the only one CPU is online, the waker and the wakee can > be on the same CPU. Or, one of the waker or the wakee might have been migrated > from other CPUs to the only one online CPU. But still it is not clear for me why > it results in the WARNING. > > Now I'm trying to recreate the failure. By repeating test cases in "block > group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to > 5 hours to recreate it. The test target block device used was a null_blk with > rather unique configuration (zoned device with zone capacity smaller than zone > size). I will try to confirm the failure recreation with latest kernel version > and other block devices. I tried some device setups, and found that dm-linear device on null_blk recreates the warning consistently. In case anyone wishes to recreate it, let me share the bash script below which I used. I tried it several times on the kernel v5.13, and all tries recreated the warning. With my system (nproc is 8), it took from 3 to 6 hours to recreate. -------------------------------------------------------------------------------- #!/bin/bash # create a null_blk device declare sysfs=/sys/kernel/config/nullb/nullb0 modprobe null_blk nr_devices=0 mkdir "${sysfs}" echo 1024 > "${sysfs}"/size echo 1 > "${sysfs}"/memory_backed echo 1 > "${sysfs}"/power sleep 1 # create a dm-linear device on the null_blk device echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test sleep 1 # run blktests, block/008 many times git clone https://github.com/osandov/blktests.git cd blktests echo "TEST_DEVS=( /dev/mapper/test )" > config for ((i=0;i<1000;i++)); do echo $i; if ! ./check block/008; then break; fi done -- Best Regards, Shin'ichiro Kawasaki