Re: [PATCH] kernfs: support kernfs notify in memory recliam context

junxiao.bi@xxxxxxxxxx · Tue, 14 Nov 2023 12:09:19 -0800

On 11/14/23 11:06 AM, Tejun Heo wrote:

Hello,

On Tue, Nov 14, 2023 at 10:59:47AM -0800, Junxiao Bi wrote:
kernfs notify is used in write path of md (md_write_start) to wake up
userspace daemon, like "mdmon" for updating md superblock of imsm raid,
md write will wait for that update done before issuing the write, if this
How is forward progress guarnateed for that userspace daemon? This sounds
like a really fragile setup.

For imsm raid, userspace daemon "mdmon" is responsible for updating raid 
metadata, kernel will use kernfs_notify to wake up the daemon anywhere 
metadata update is required. If the daemon can't move forward, write may 
hung, but that will be a bug in the daemon?

write is used for memory reclaim, the system may hung due to kernel notify
can't be executed, that's because kernel notify is executed by "system_wq"
which doesn't have a rescuer thread and kworker thread may not be created
due to memory pressure, then userspace daemon can't be woke up and md write
will hung.

According Tejun, this can't be fixed by add RECLAIM to "system_wq" because
that workqueue is shared and someone else might occupy that rescuer thread,
to fix this from md side, have to replace kernfs notify with other way to
communite with userspace daemon, that will break userspace interface,
so use a separated workqueue for kernefs notify to allow it be used in
memory reclaim context.
I'm not necessarily against the change but please go into a bit more details
on how and why it's structured this way and add a comment explaining
explaining who's depending on kernfs notify for reclaim forward progress.

"kthreadd" was doing memory reclaim and stuck by md flush work, md flush 
work was stuck by md_write_start, where it was waiting 
"MD_SB_CHANGE_PENDING" flag to be cleared, before waiting, it invoked 
kernefs_notify to wake up userspace daemon which should update the 
meatadata and clear the flag.

PID: 2        TASK: ffff8df829539e40  CPU: 103  COMMAND: "kthreadd"
 #0 [ffffaf14800f3220] __schedule at ffffffff9488cbac
 #1 [ffffaf14800f32c0] schedule at ffffffff9488d1c6
 #2 [ffffaf14800f32d8] schedule_timeout at ffffffff948916e6
 #3 [ffffaf14800f3360] wait_for_completion at ffffffff9488ddeb
 #4 [ffffaf14800f33c8] flush_work at ffffffff940b5103
 #5 [ffffaf14800f3448] xlog_cil_force_lsn at ffffffffc0571791 [xfs]
 #6 [ffffaf14800f34e8] _xfs_log_force_lsn at ffffffffc056f79f [xfs]
 #7 [ffffaf14800f3570] xfs_log_force_lsn at ffffffffc056fa8c [xfs]
 #8 [ffffaf14800f35a8] __dta___xfs_iunpin_wait_3444 at ffffffffc05595c4 
[xfs]
 #9 [ffffaf14800f3620] xfs_iunpin_wait at ffffffffc055c229 [xfs]
#10 [ffffaf14800f3630] __dta_xfs_reclaim_inode_3358 at ffffffffc054f8cc 
[xfs]
#11 [ffffaf14800f3680] xfs_reclaim_inodes_ag at ffffffffc054fd56 [xfs]
#12 [ffffaf14800f3818] xfs_reclaim_inodes_nr at ffffffffc0551013 [xfs]
#13 [ffffaf14800f3838] xfs_fs_free_cached_objects at ffffffffc0565469 [xfs]
#14 [ffffaf14800f3848] super_cache_scan at ffffffff942959a7
#15 [ffffaf14800f38a0] shrink_slab at ffffffff941fa935
#16 [ffffaf14800f3988] shrink_node at ffffffff942005d8
#17 [ffffaf14800f3a10] do_try_to_free_pages at ffffffff94200ae2
#18 [ffffaf14800f3a78] try_to_free_pages at ffffffff94200e89
#19 [ffffaf14800f3b00] __alloc_pages_slowpath at ffffffff941ed82c
#20 [ffffaf14800f3c20] __alloc_pages_nodemask at ffffffff941ee191
#21 [ffffaf14800f3c90] __vmalloc_node_range at ffffffff9423a8e7
#22 [ffffaf14800f3d00] copy_process at ffffffff94096670
#23 [ffffaf14800f3de8] _do_fork at ffffffff94097f30
#24 [ffffaf14800f3e68] kernel_thread at ffffffff94098219
#25 [ffffaf14800f3e78] kthreadd at ffffffff940bd4e5
#26 [ffffaf14800f3f50] ret_from_fork at ffffffff94a00354

PID: 852      TASK: ffff8e351fc51e40  CPU: 77   COMMAND: "md"
 #0 [ffffaf148e983c68] __schedule at ffffffff9488cbac
 #1 [ffffaf148e983d08] schedule at ffffffff9488d1c6
 #2 [ffffaf148e983d20] md_write_start at ffffffff9469fc75
 #3 [ffffaf148e983d80] raid1_make_request at ffffffffc038d8bd [raid1]
 #4 [ffffaf148e983da8] md_handle_request at ffffffff9469cc24
 #5 [ffffaf148e983e18] md_submit_flush_data at ffffffff9469cce1
 #6 [ffffaf148e983e38] process_one_work at ffffffff940b5bd9
 #7 [ffffaf148e983e80] rescuer_thread at ffffffff940b6334
 #8 [ffffaf148e983f08] kthread at ffffffff940bc245
 #9 [ffffaf148e983f50] ret_from_fork at ffffffff94a00354

bool md_write_start(struct mddev *mddev, struct bio *bi)
{
    ...

   >>> process 852 go into the "if" and set "MD_SB_CHANGE_PENDING"

    if (mddev->in_sync || mddev->sync_checkers) {
        spin_lock(&mddev->lock);
        if (mddev->in_sync) {
            mddev->in_sync = 0;
            set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
            set_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
            md_wakeup_thread(mddev->thread);
            did_change = 1;
        }
        spin_unlock(&mddev->lock);
    }
    rcu_read_unlock();

    >>> invoke kernfs_notify to wake up userspace daemon

    if (did_change)
        sysfs_notify_dirent_safe(mddev->sysfs_state);
    if (!mddev->has_superblocks)
        return true;

   >>>> hung here waiting userspace daemon clear that flag.

    wait_event(mddev->sb_wait,
           !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) ||
           is_md_suspended(mddev));
    if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
        percpu_ref_put(&mddev->writes_pending);
        return false;
    }
    return true;
}

Thanks,

Junxiao.

Thanks.