RE: Fix a devlink AB-BA deadlock on net namespace deletion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Did you audit if it is safe to not hold the pernet_ops_rwsem when traversing the pernet_list list?
Last time several months back when I reviewed this area for this issue, it appeared that pernet_ops_rwsem must be held while traversing pernet_list.

You also need to fix the mail client to send text only patches.


From: 张广辉 <zhang.guanghui@xxxxxxxx> 
Sent: Sunday, April 24, 2022 2:02 AM
To: 张广辉 <zhang.guanghui@xxxxxxxx>; Roi Dayan <roid@xxxxxxxxxx>; Saeed Mahameed <saeedm@xxxxxxxxxx>; Parav Pandit <parav@xxxxxxxxxx>; Jason Gunthorpe <jgg@xxxxxxxxxx>; gregkh <gregkh@xxxxxxxxxxxxxxxxxxx>
Cc: linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>; stable <stable@xxxxxxxxxxxxxxx>
Subject: Fix a devlink AB-BA deadlock on net namespace deletion


Hi  all

Deleting a netns holds pernet_ops_rwsem and then takes devlink_mutex. 
at that time changing mode to switchdev, holds the devlink_mutex, unregistered to netdevice notifier and then takes pernet_ops_rwsem. 
So AB-BA deadlock problem can happen. I have made a patch to fix the deadlock problem, it work well. please help with the review. Thanks 


 Example sequence is: 
\$ ip netns add foo
\$ ip netns del foo & 
\$ devlink  dev eswitch set pci/0000:af:00.1 mode switchdev

Process A:                                                                                                                                                Process B:
cleanup_net()                                                              genl_family_rcv_msg_doit                                               
  down_read(&pernet_ops_rwsem); <- first sem acquired                               
     ops_pre_exit_list()                                                           pre_doit 
                                                                              devlink_nl_pre_doit mutex_lock(&devlink_mutex); <-first devlink_mutex acquired
       pre_exit()
         devlink_pernet_pre_exit() mutex_lock(&devlink_mutex);<-first devlink_mutex acquired
                                                                                       devlink_nl_cmd_eswitch_set_doit
                                                                                           mlx5_devlink_eswitch_mode_set 
                                                                                                mlx5_lag_disable_change
                                                                                                     mlx5_disable_lag
                                                                                                       mlx5_rescan_drivers_locked
                                                                                                         device_del
                                                                                                           ...
                                                                                                           unregister_netdevice_notifier 
                                                                                                             down_write(&pernet_ops_rwsem);<- first sem acquired
 

 deleting netns trace:
[  248.061947] INFO: task kworker/u160:3:1179 blocked for more than 122 seconds.
[  248.061953]       Not tainted 5.15.13-0.el9.x86_64 #1
[  248.061955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.061956] task:kworker/u160:3  state:D stack:    0 pid: 1179 ppid:     2 flags:0x00004000
[  248.061962] Workqueue: netns cleanup_net
[  248.061970] Call Trace:
[  248.061972]  <TASK>
[  248.061975]  __schedule+0x200/0x540
[  248.061982]  schedule+0x44/0xa0
[  248.061984]  schedule_preempt_disabled+0xa/0x10
[  248.061986]  __mutex_lock.constprop.0+0x212/0x400
[  248.061989]  devlink_pernet_pre_exit+0x2a/0x140
[  248.061994]  cleanup_net+0x1d2/0x3a0
[  248.061997]  process_one_work+0x1e8/0x390
[  248.062003]  worker_thread+0x53/0x3c0
[  248.062005]  ? process_one_work+0x390/0x390
[  248.062007]  kthread+0x10c/0x130
[  248.062011]  ? set_kthread_struct+0x40/0x40
[  248.062014]  ret_from_fork+0x1f/0x30
[  248.062020]  </TASK>

changing mode to switchdev trace:

[  248.062078] task:devlink         state:D stack:    0 pid: 8546 ppid:  8542 flags:0x00004000
[  248.062081] Call Trace:
[  248.062082]  <TASK>
[  248.062083]  __schedule+0x200/0x540
[  248.062087]  ? free_msg+0x3f/0xb0 [mlx5_core]
[  248.062156]  schedule+0x44/0xa0
[  248.062158]  rwsem_down_write_slowpath+0x19c/0x3c0
[  248.062165]  unregister_netdevice_notifier+0x1c/0xb0
[  248.062168]  mlx5_ib_roce_cleanup+0x8a/0x110 [mlx5_ib]
[  248.062184]  mlx5r_remove+0x36/0x60 [mlx5_ib]
[  248.062196]  auxiliary_bus_remove+0x18/0x30
[  248.062200]  __device_release_driver+0x177/0x240
[  248.062203]  device_release_driver+0x24/0x30
[  248.062205]  bus_remove_device+0xd8/0x140
[  248.062210]  device_del+0x18b/0x400
[  248.062213]  mlx5_rescan_drivers_locked.part.0+0x7e/0x150 [mlx5_core]
[  248.062267]  mlx5_disable_lag+0x149/0x160 [mlx5_core]
[  248.062318]  mlx5_lag_disable_change+0x60/0xa0 [mlx5_core]
[  248.062369]  mlx5_devlink_eswitch_mode_set+0x4b/0x1a0 [mlx5_core]
[  248.062436]  devlink_nl_cmd_eswitch_set_doit+0xc1/0x150
[  248.062440]  genl_family_rcv_msg_doit+0xe7/0x150
[  248.062445]  genl_rcv_msg+0xdc/0x1e0
[  248.062448]  ? __devlink_port_phys_port_name_get+0x1e0/0x1e0
[  248.062451]  ? genl_get_cmd+0xd0/0xd0
[  248.062454]  netlink_rcv_skb+0x4e/0xf0
[  248.062457]  genl_rcv+0x24/0x40
[  248.062460]  netlink_unicast+0x1fe/0x2d0
[  248.062463]  netlink_sendmsg+0x24f/0x4b0
[  248.062466]  sock_sendmsg+0x5b/0x60
[  248.062469]  __sys_sendto+0xf0/0x160
[  248.062473]  ? handle_mm_fault+0xbf/0x280
[  248.062478]  ? do_user_addr_fault+0x1d0/0x670
[  248.062482]  __x64_sys_sendto+0x20/0x30
[  248.062484]  do_syscall_64+0x38/0x90
[  248.062487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  248.062492] RIP: 0033:0x7ff8cc469c3a
[  248.062494] RSP: 002b:00007ffe06025e08 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  248.062497] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007ff8cc469c3a
[  248.062499] RDX: 0000000000000038 RSI: 000055c261bf7440 RDI: 0000000000000003
[  248.062501] RBP: 0000000000000000 R08: 00007ff8cc52d200 R09: 000000000000000c
[  248.062502] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  248.062503] R13: 000055c261bf72a0 R14: 000055c260a01d5c R15: 000055c261bf7440
[  248.062505]  </TASK>


the patch details: 

diff --git a/linux/net/core/net_namespace.c b/linux/net/core/net_namespace.c
index 202fa5eac..5c872db1f 100644
--- a/linux/net/core/net_namespace.c
+++ b/linux/net/core/net_namespace.c
@@ -576,6 +576,7 @@ static void cleanup_net(struct work_struct *work)
                list_add_tail(&net->exit_list, &net_exit_list);
        }

+       up_read(&pernet_ops_rwsem);
        /* Run all of the network namespace pre_exit methods */
        list_for_each_entry_reverse(ops, &pernet_list, list)
                ops_pre_exit_list(ops, &net_exit_list);
@@ -596,7 +597,6 @@ static void cleanup_net(struct work_struct *work)
        list_for_each_entry_reverse(ops, &pernet_list, list)
                ops_free_list(ops, &net_exit_list);

-       up_read(&pernet_ops_rwsem);

        /* Ensure there are no outstanding rcu callbacks using this
         * network namespace.
 





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux