On 5/18/20 11:53 AM, Gabriel Krisman Bertazi wrote: > Lee Duncan <lduncan@xxxxxxxx> writes: > >> On 5/7/20 10:59 PM, Gabriel Krisman Bertazi wrote: >>> iscsi suffers from a deadlock in case a management command submitted via >>> the netlink socket sleeps on an allocation while holding the >>> rx_queue_mutex, if that allocation causes a memory reclaim that >>> writebacks to a failed iscsi device. Then, the recovery procedure can >>> never make progress to recover the failed disk or abort outstanding IO >>> operations to complete the reclaim (since rx_queue_mutex is locked), >>> thus locking the system. >>> >>> Nevertheless, just marking all allocations under rx_queue_mutex as >>> GFP_NOIO (or locking the userspace process with something like >>> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on >>> other subsystems that try to grab locked mutexes, whose threads are >>> GFP_IO, leading to the same deadlock. One instance where this situation >>> can be observed is in the backtraces below, stitched from multiple bugs >>> reports, involving the kobj uevent sent when a session is created. >>> >>> The root of the problem is not the fact that iscsi does GFP_IO >>> allocations, that is acceptable. The actual problem is that >>> rx_queue_mutex has a very large granularity, covering every unrelated >>> netlink command execution at the same time as the error recovery path. >>> >>> The proposed fix leverages the recently added mechanism to stop failed >>> connections from the kernel, by enabling it to execute even though a >>> management command from the netlink socket is being run (rx_queue_mutex >>> is held), provided that the command is known to be safe. It splits the >>> rx_queue_mutex in two mutexes, one protecting from concurrent command >>> execution from the netlink socket, and one protecting stop_conn from >>> racing with other connection management operations that might conflict >>> with it. >>> >>> It is not very pretty, but it is the simplest way to resolve the >>> deadlock. I considered making it a lock per connection, but some >>> external mutex would still be needed to deal with iscsi_if_destroy_conn. >>> >>> The patch was tested by forcing a memory shrinker (unrelated, but used >>> bufio/dm-verity) to reclaim ISCSI pages every time >>> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate >>> reclaims that might happen with GFP_KERNEL on that path. Then, a faulty >>> hung target causes a connection to fail during intensive IO, at the same >>> time a new session is added by iscsid. >>> >>> The following stacktraces are stiches from several bug reports, showing >>> a case where the deadlock can happen. >>> >>> iSCSI-write >>> holding: rx_queue_mutex >>> waiting: uevent_sock_mutex >>> >>> kobject_uevent_env+0x1bd/0x419 >>> kobject_uevent+0xb/0xd >>> device_add+0x48a/0x678 >>> scsi_add_host_with_dma+0xc5/0x22d >>> iscsi_host_add+0x53/0x55 >>> iscsi_sw_tcp_session_create+0xa6/0x129 >>> iscsi_if_rx+0x100/0x1247 >>> netlink_unicast+0x213/0x4f0 >>> netlink_sendmsg+0x230/0x3c0 >>> >>> iscsi_fail iscsi_conn_failure >>> waiting: rx_queue_mutex >>> >>> schedule_preempt_disabled+0x325/0x734 >>> __mutex_lock_slowpath+0x18b/0x230 >>> mutex_lock+0x22/0x40 >>> iscsi_conn_failure+0x42/0x149 >>> worker_thread+0x24a/0xbc0 >>> >>> EventManager_ >>> holding: uevent_sock_mutex >>> waiting: dm_bufio_client->lock >>> >>> dm_bufio_lock+0xe/0x10 >>> shrink+0x34/0xf7 >>> shrink_slab+0x177/0x5d0 >>> do_try_to_free_pages+0x129/0x470 >>> try_to_free_mem_cgroup_pages+0x14f/0x210 >>> memcg_kmem_newpage_charge+0xa6d/0x13b0 >>> __alloc_pages_nodemask+0x4a3/0x1a70 >>> fallback_alloc+0x1b2/0x36c >>> __kmalloc_node_track_caller+0xb9/0x10d0 >>> __alloc_skb+0x83/0x2f0 >>> kobject_uevent_env+0x26b/0x419 >>> dm_kobject_uevent+0x70/0x79 >>> dev_suspend+0x1a9/0x1e7 >>> ctl_ioctl+0x3e9/0x411 >>> dm_ctl_ioctl+0x13/0x17 >>> do_vfs_ioctl+0xb3/0x460 >>> SyS_ioctl+0x5e/0x90 >>> >>> MemcgReclaimerD" >>> holding: dm_bufio_client->lock >>> waiting: stuck io to finish (needs iscsi_fail thread to progress) >>> >>> schedule at ffffffffbd603618 >>> io_schedule at ffffffffbd603ba4 >>> do_io_schedule at ffffffffbdaf0d94 >>> __wait_on_bit at ffffffffbd6008a6 >>> out_of_line_wait_on_bit at ffffffffbd600960 >>> wait_on_bit.constprop.10 at ffffffffbdaf0f17 >>> __make_buffer_clean at ffffffffbdaf18ba >>> __cleanup_old_buffer at ffffffffbdaf192f >>> shrink at ffffffffbdaf19fd >>> do_shrink_slab at ffffffffbd6ec000 >>> shrink_slab at ffffffffbd6ec24a >>> do_try_to_free_pages at ffffffffbd6eda09 >>> try_to_free_mem_cgroup_pages at ffffffffbd6ede7e >>> mem_cgroup_resize_limit at ffffffffbd7024c0 >>> mem_cgroup_write at ffffffffbd703149 >>> cgroup_file_write at ffffffffbd6d9c6e >>> sys_write at ffffffffbd6662ea >>> system_call_fastpath at ffffffffbdbc34a2 >>> >>> Reported-by: Khazhismel Kumykov <khazhy@xxxxxxxxxx> >>> Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> >>> --- >>> drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++-------- >>> 1 file changed, 49 insertions(+), 18 deletions(-) >>> >>> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c >>> index 17a45716a0fe..d99c17306dff 100644 >>> --- a/drivers/scsi/scsi_transport_iscsi.c >>> +++ b/drivers/scsi/scsi_transport_iscsi.c >>> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class, >>> static struct sock *nls; >>> static DEFINE_MUTEX(rx_queue_mutex); >>> >>> +/* >>> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing >>> + * against the kernel stop_connection recovery mechanism >>> + */ >>> +static DEFINE_MUTEX(conn_mutex); >>> + >>> static LIST_HEAD(sesslist); >>> static LIST_HEAD(sessdestroylist); >>> static DEFINE_SPINLOCK(sesslock); >>> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost, >>> } >>> EXPORT_SYMBOL_GPL(iscsi_offload_mesg); >>> >>> +/* >>> + * This can be called without the rx_queue_mutex, if invoked by the kernel >>> + * stop work. But, in that case, it is guaranteed not to race with >>> + * iscsi_destroy by conn_mutex. >>> + */ >>> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag) >>> +{ >>> + /* >>> + * It is important that this path doesn't rely on >>> + * rx_queue_mutex, otherwise, a thread doing allocation on a >>> + * start_session/start_connection could sleep waiting on a >>> + * writeback to a failed iscsi device, that cannot be recovered >>> + * because the lock is held. If we don't hold it here, the >>> + * kernel stop_conn_work_fn has a chance to stop the broken >>> + * session and resolve the allocation. >>> + * >>> + * Still, the user invoked .stop_conn() needs to be serialized >>> + * with stop_conn_work_fn by a private mutex. Not pretty, but >>> + * it works. >>> + */ >>> + mutex_lock(&conn_mutex); >>> + conn->transport->stop_conn(conn, flag); >>> + mutex_unlock(&conn_mutex); >>> + >>> +} >>> + >>> static void stop_conn_work_fn(struct work_struct *work) >>> { >>> struct iscsi_cls_conn *conn, *tmp; >>> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work) >>> uint32_t sid = iscsi_conn_get_sid(conn); >>> struct iscsi_cls_session *session; >>> >>> - mutex_lock(&rx_queue_mutex); >>> - >>> session = iscsi_session_lookup(sid); >>> if (session) { >>> if (system_state != SYSTEM_RUNNING) { >>> session->recovery_tmo = 0; >>> - conn->transport->stop_conn(conn, >>> - STOP_CONN_TERM); >>> + iscsi_if_stop_conn(conn, STOP_CONN_TERM); >>> } else { >>> - conn->transport->stop_conn(conn, >>> - STOP_CONN_RECOVER); >>> + iscsi_if_stop_conn(conn, STOP_CONN_RECOVER); >>> } >>> } >>> >>> list_del_init(&conn->conn_list_err); >>> - >>> - mutex_unlock(&rx_queue_mutex); >>> - >>> - /* we don't want to hold rx_queue_mutex for too long, >>> - * for instance if many conns failed at the same time, >>> - * since this stall other iscsi maintenance operations. >>> - * Give other users a chance to proceed. >>> - */ >>> - cond_resched(); >>> } >>> } >> >> I'm curious about why you removed the cond_resched() here. Is it because >> it is no longer needed, with shorter (mutex) waiting time? >> > > Hi Lee, > > My thought was that the main reason for cond_resched here was to give a > chance for other iscsi maintenance operations to run. But, since > conn_mutex, differently from rx_queue_mutex, won't stop most of the > operations, it felt unnecessary. If you disagree, I can submit a v2 > that doesn't change it. > > As far as I know, use of cond_resched() seems semi-random. I suspected your answer would be what you said. I'm fine with that. If we see iscsi transmit threads start to hog CPU time we can always revisit this choice. > >>> >>> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev >>> spin_unlock_irqrestore(&connlock, flags); >>> >>> ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n"); >>> + >>> + mutex_lock(&conn_mutex); >>> if (transport->destroy_conn) >>> transport->destroy_conn(conn); >>> + mutex_unlock(&conn_mutex); >>> >>> return 0; >>> } >>> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) >>> break; >>> } >>> >>> + mutex_lock(&conn_mutex); >>> ev->r.retcode = transport->bind_conn(session, conn, >>> ev->u.b_conn.transport_eph, >>> ev->u.b_conn.is_leading); >>> + mutex_unlock(&conn_mutex); >>> + >>> if (ev->r.retcode || !transport->ep_connect) >>> break; >>> >>> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) >>> break; >>> case ISCSI_UEVENT_START_CONN: >>> conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid); >>> - if (conn) >>> + if (conn) { >>> + mutex_lock(&conn_mutex); >>> ev->r.retcode = transport->start_conn(conn); >>> + mutex_unlock(&conn_mutex); >>> + } >>> else >>> err = -EINVAL; >>> break; >>> case ISCSI_UEVENT_STOP_CONN: >>> conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid); >>> if (conn) >>> - transport->stop_conn(conn, ev->u.stop_conn.flag); >>> + iscsi_if_stop_conn(conn, ev->u.stop_conn.flag); >>> else >>> err = -EINVAL; >>> break; >>> case ISCSI_UEVENT_SEND_PDU: >>> conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid); >>> - if (conn) >>> + if (conn) { >>> + mutex_lock(&conn_mutex); >>> ev->r.retcode = transport->send_pdu(conn, >>> (struct iscsi_hdr*)((char*)ev + sizeof(*ev)), >>> (char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size, >>> ev->u.send_pdu.data_size); >>> + mutex_unlock(&conn_mutex); >>> + } >>> else >>> err = -EINVAL; >>> break; >>> >> >> My question above is for my own information, so I'll still say: >> >> Reviewed-by: Lee Duncan <lduncan@xxxxxxxx> >