Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Fri, 9 Jul 2021 03:57:04 +0000

On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote:
> On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> > >+ Jens and linux-kernel
> > >
> > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> > >> Hi there,
> > >> 
> > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> > >> block/008 failed. A WARNING below was the cause of the failure.
> > >> 
> > >>     WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > >> 
> > >> I'm trying to recreate the failure repeating the test case, but so far, I am
> > >> not able to. This failure looks rare, but actually, I observed it 3 times in
> > >> the past one year.
> > >> 
> > >> 1) Oct/2020, kernel: v5.9-rc7  test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> > >> 
> > >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> > >> So it is likely that the test case triggers the warning, but I don't have
> > >> clear view why and how the warning was triggered. It was observed on various
> > >> block devices, so I would like to ask linux-block experts if anyone can tell
> > >> what is going on. Comments will be appreciated.
> > 
> > [...]
> > 
> > >> [40041.712804][T135817] ------------[ cut here ]------------
> > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> > >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> > >32c_intel sysimgblt
> > >> [40041.728481][T135817]  fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> > >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> > >c5+ #1
> > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> > >BIOS 3.2 11/22/2019
> > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> > >
> > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> > >ffff888811440000
> > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> > >ffffffffb603d6e8
> > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> > >ffffffffb6ba6167
> > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> > >0000000000000000
> > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> > >ffff888811440001
> > >> [40041.918364][T135817] FS:  00007f8eabf14b80(0000) GS:ffff888811440000(0=
> > >000) knlGS:0000000000000000
> > >> [40041.927229][T135817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> > >
> > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> > >00000000001706e0
> > >> [40041.941669][T135817] Call Trace:
> > >> [40041.944895][T135817]  ? lock_is_held_type+0x98/0x110
> > >> [40041.949860][T135817]  try_to_wake_up+0x6f9/0x15e0
> > 
> > 2) __queue_work
> >      raw_spin_lock(&pwq->pool->lock) with irq disabled
> >      insert_work
> >        wake_up_worker(pool);
> >          wake_up_process first_idle_worker(pool);
> > 
> > Even if waker is lucky enough running on worker's CPU, what is weird is an
> > idle worker can trigger the warning, given
> > 
> > 	if (smp_load_acquire(&p->on_cpu) &&
> > 	    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> > 		goto unlock;
> > 
> > because p->on_cpu must have been false for quite a while.
> > 
> > Is there any chance for CPU hotplug to make a difference?
> > 
> > Thoughts are welcome.
> > 
> > Hillf
> 
> Hillf, thank you very much for the comments, and sorry about this late reply.
> 
> I walked through related functions to understand your comments and, but I have
> to say that I still don't have enough background knowledge to provide valuable
> comments back to you. I understand that the waker and the wakee are on same CPU,
> and it is weird that p->on_cpu is true. This looks indicating that the task
> scheduler failing to control task status on task migration triggered by CPU
> hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
> task migration are folded into the design and implementation.
> 
> The blktests test case block/008 runs I/Os to a test target block device 30
> seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
> there are N CPUs are available, it offlines N-1 CPUs to have only one online
> CPU, then onlines all CPUs again. It repeats this online and offline until I/O
> workload completes. When the only one CPU is online, the waker and the wakee can
> be on the same CPU. Or, one of the waker or the wakee might have been migrated
> from other CPUs to the only one online CPU. But still it is not clear for me why
> it results in the WARNING.
> 
> Now I'm trying to recreate the failure. By repeating test cases in "block
> group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
> 5 hours to recreate it. The test target block device used was a null_blk with
> rather unique configuration (zoned device with zone capacity smaller than zone
> size). I will try to confirm the failure recreation with latest kernel version
> and other block devices.

I tried some device setups, and found that dm-linear device on null_blk
recreates the warning  consistently. In case anyone wishes to recreate it, let
me share the bash script below which I used. I tried it several times on the
kernel v5.13, and all tries recreated the warning. With my system (nproc is 8),
it took from 3 to 6 hours to recreate.

--------------------------------------------------------------------------------
#!/bin/bash

# create a null_blk device
declare sysfs=/sys/kernel/config/nullb/nullb0
modprobe null_blk nr_devices=0
mkdir "${sysfs}"
echo 1024 > "${sysfs}"/size
echo 1 > "${sysfs}"/memory_backed
echo 1 > "${sysfs}"/power
sleep 1

# create a dm-linear device on the null_blk device
echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test
sleep 1

# run blktests, block/008 many times
git clone https://github.com/osandov/blktests.git
cd blktests
echo "TEST_DEVS=( /dev/mapper/test )" > config
for ((i=0;i<1000;i++)); do
        echo $i;
        if ! ./check block/008; then
                break;
        fi
done

-- 
Best Regards,
Shin'ichiro Kawasaki