Re: A hang bug of dm on s390x

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote:
> Hi guys,
> 
> I encountered  a hang issue on a s390x system.  The tested kernel is
> not preemptible and booting with "nr_cpus=1"
> 
> The test steps:
>       umount /home
>       lvremove /dev/rhel_s390x-kvm-011/home
>       ## uncomment "snapshot_autoextend_threshold = 70" and
>       "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf
> 
>       systemctl enable lvm2-monitor.service
>       systemctl start lvm2-monitor.service
> 
>       lvremove -y rhel_s390x-kvm-011/thinp
>       lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
>       lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
>       mkfs.ext4 /dev/rhel_s390x-kvm-011/src
>       mount /dev/rhel_s390x-kvm-011/src /mnt
>       for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
> bs=100M count=1; done
> 
> And the system hangs with the console log [1]
> 
> The related kernel config
> 
>     CONFIG_PREEMPT_NONE_BUILD=y
>     CONFIG_PREEMPT_NONE=y
>     CONFIG_PREEMPT_COUNT=y
>     CONFIG_SCHED_CORE=y
> 
> It turns out that when hanging, the kernel is stuck in the dead-loop
> in the function dm_wq_work()
>         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>                 spin_lock_irq(&md->deferred_lock);
>                 bio = bio_list_pop(&md->deferred);
>                 spin_unlock_irq(&md->deferred_lock);
> 
>                 if (!bio)
>                         break;
>                 thread_cpu = smp_processor_id();
>                 submit_bio_noacct(bio);
>         }
> where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
> keeps generating new bio, and the condition "if (!bio)" can not be
> meet.
> 
> 
> After applying the following patch, the issue is gone.
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e1ea3a7bd9d9..95c9cb07a42f 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
>                         break;
> 
>                 submit_bio_noacct(bio);
> +               cond_resched();
>         }
>  }
> 
> But I think it is not a proper solution. And without this patch, if
> removing nr_cpus=1 (the system has two cpus), the issue can not be
> triggered. That says when more than one cpu, the above loop can exit
> by the condition "if (!bio)"
> 
> Any ideas?

I think the patch is correct.

For kernel built without CONFIG_PREEMPT, in case of single cpu core,
if the dm target(such as dm-thin) needs another wq or kthread for
handling IO, then dm target side is blocked because dm_wq_work()
holds the single cpu, sooner or later, dm target may have not
resource to handle new io from dm core and returns REQUEUE.

Then dm_wq_work becomes one dead loop.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel




[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux