Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Fri, 29 Oct 2021 00:54:43 +0000

The WARNING I reported with the kernel v5.13-rc5 is still observed with
v5.15-rc7. It took long, but now I think I understand the cause. Here I share my
findings in line.

On Jul 09, 2021 / 03:57, Shinichiro Kawasaki wrote:
> On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote:
> > On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> > > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> > > >+ Jens and linux-kernel
> > > >
> > > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> > > >> Hi there,
> > > >> 
> > > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> > > >> block/008 failed. A WARNING below was the cause of the failure.
> > > >> 
> > > >>     WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > > >> 
> > > >> I'm trying to recreate the failure repeating the test case, but so far, I am
> > > >> not able to. This failure looks rare, but actually, I observed it 3 times in
> > > >> the past one year.
> > > >> 
> > > >> 1) Oct/2020, kernel: v5.9-rc7  test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> > > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> > > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> > > >> 
> > > >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> > > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> > > >> So it is likely that the test case triggers the warning, but I don't have
> > > >> clear view why and how the warning was triggered. It was observed on various
> > > >> block devices, so I would like to ask linux-block experts if anyone can tell
> > > >> what is going on. Comments will be appreciated.
> > > 
> > > [...]
> > > 
> > > >> [40041.712804][T135817] ------------[ cut here ]------------
> > > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> > > >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> > > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> > > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> > > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> > > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> > > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> > > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> > > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> > > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> > > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> > > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> > > >32c_intel sysimgblt
> > > >> [40041.728481][T135817]  fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> > > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> > > >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> > > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> > > >c5+ #1
> > > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> > > >BIOS 3.2 11/22/2019
> > > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> > > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> > > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> > > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> > > >
> > > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> > > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> > > >ffff888811440000
> > > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> > > >ffffffffb603d6e8
> > > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> > > >ffffffffb6ba6167
> > > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> > > >0000000000000000
> > > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> > > >ffff888811440001
> > > >> [40041.918364][T135817] FS:  00007f8eabf14b80(0000) GS:ffff888811440000(0=
> > > >000) knlGS:0000000000000000
> > > >> [40041.927229][T135817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> > > >
> > > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> > > >00000000001706e0
> > > >> [40041.941669][T135817] Call Trace:
> > > >> [40041.944895][T135817]  ? lock_is_held_type+0x98/0x110
> > > >> [40041.949860][T135817]  try_to_wake_up+0x6f9/0x15e0
> > > 
> > > 2) __queue_work
> > >      raw_spin_lock(&pwq->pool->lock) with irq disabled
> > >      insert_work
> > >        wake_up_worker(pool);
> > >          wake_up_process first_idle_worker(pool);
> > > 
> > > Even if waker is lucky enough running on worker's CPU, what is weird is an
> > > idle worker can trigger the warning, given
> > > 
> > > 	if (smp_load_acquire(&p->on_cpu) &&
> > > 	    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> > > 		goto unlock;
> > > 
> > > because p->on_cpu must have been false for quite a while.

The call trace indicated that the try_to_wake_up() called ttwu_queue_wakelist().
So I had assumed that the ttwu_queue_wakelist() call was in the hunk that Hillf
quoted above. I also had thorugh that weird, but I noticed that there is another
path to call ttwu_queue_wakelist(). try_to_wake_up() calls ttwu_queue() at
another place, and ttwu_queue() calls ttwu_queue_wakelist(). I confirmed that
the warning is reported with this call path. I think this path can happen on
p->cpu false condition.

> > > 
> > > Is there any chance for CPU hotplug to make a difference?

I found sd_llc_id change by CPU hotplug affects ttwu_queue_wakelist(). When the
warning is reported, ttwu_queue_cond() returns true in the ttwu_queue_wakelist()
hunk below.

	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
		if (WARN_ON_ONCE(cpu == smp_processor_id()))
			return false;

And in ttwu_queue_cond() hunk below, cpus_share_cache() returns false, even
though smp_processor_id() and cpu are same cpu. This is weird. Why the single
same cpu does not share the cache?

	/*
	 * If the CPU does not share cache, then queue the task on the
	 * remote rqs wakelist to avoid accessing remote data.
	 */
	if (!cpus_share_cache(smp_processor_id(), cpu))
		return true;

In cpus_share_cache(), sd_llc_id is checked twice with the line below.

	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);

When this_cpu == that_cpu, it is expected sd_llc_id is same. However, when CPU
hotplug is ongoing, scheduler domain destroy and rebuild happens and modifies
sd_llc_id value. Then, the two sd_llc_id references in the code can have
different values.

I think this cpus_share_cached() needs a fix assuming that sd_llc_id may
change. My idea is to add check this_cpu == that_cpu. Based on this idea, I
created a fix and the warning is avoided. I will post the patch for further
discussion and review.

-- 
Best Regards,
Shin'ichiro Kawasaki

> > > 
> > > Thoughts are welcome.
> > > 
> > > Hillf
> > 
> > Hillf, thank you very much for the comments, and sorry about this late reply.
> > 
> > I walked through related functions to understand your comments and, but I have
> > to say that I still don't have enough background knowledge to provide valuable
> > comments back to you. I understand that the waker and the wakee are on same CPU,
> > and it is weird that p->on_cpu is true. This looks indicating that the task
> > scheduler failing to control task status on task migration triggered by CPU
> > hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
> > task migration are folded into the design and implementation.
> > 
> > The blktests test case block/008 runs I/Os to a test target block device 30
> > seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
> > there are N CPUs are available, it offlines N-1 CPUs to have only one online
> > CPU, then onlines all CPUs again. It repeats this online and offline until I/O
> > workload completes. When the only one CPU is online, the waker and the wakee can
> > be on the same CPU. Or, one of the waker or the wakee might have been migrated
> > from other CPUs to the only one online CPU. But still it is not clear for me why
> > it results in the WARNING.
> > 
> > Now I'm trying to recreate the failure. By repeating test cases in "block
> > group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
> > 5 hours to recreate it. The test target block device used was a null_blk with
> > rather unique configuration (zoned device with zone capacity smaller than zone
> > size). I will try to confirm the failure recreation with latest kernel version
> > and other block devices.
> 
> I tried some device setups, and found that dm-linear device on null_blk
> recreates the warning  consistently. In case anyone wishes to recreate it, let
> me share the bash script below which I used. I tried it several times on the
> kernel v5.13, and all tries recreated the warning. With my system (nproc is 8),
> it took from 3 to 6 hours to recreate.
> 
> --------------------------------------------------------------------------------
> #!/bin/bash
> 
> # create a null_blk device
> declare sysfs=/sys/kernel/config/nullb/nullb0
> modprobe null_blk nr_devices=0
> mkdir "${sysfs}"
> echo 1024 > "${sysfs}"/size
> echo 1 > "${sysfs}"/memory_backed
> echo 1 > "${sysfs}"/power
> sleep 1
> 
> # create a dm-linear device on the null_blk device
> echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test
> sleep 1
> 
> # run blktests, block/008 many times
> git clone https://github.com/osandov/blktests.git
> cd blktests
> echo "TEST_DEVS=( /dev/mapper/test )" > config
> for ((i=0;i<1000;i++)); do
>         echo $i;
>         if ! ./check block/008; then
>                 break;
>         fi
> done
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki