Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections

Tony Lu <tonylu@xxxxxxxxxxxxxxxxx> · Tue, 16 Aug 2022 20:40:50 +0800

On Tue, Aug 16, 2022 at 11:35:15AM +0200, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@xxxxxxxxxxxxxxxxx>
> > 
> > This patch set attempts to optimize the parallelism of SMC-R connections,
> > mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> > occur after thoses optimization.
> > 
> 
> Thank you again for your submission!
> Let me give you a quick update from our side:
> We tested your patches on top of the net-next kernel on our s390 systems.
> They did crash our systems. After verifying our environment we pulled
> console logs and now we can tell that there is indeed a problem with your
> patches regarding SMC-D. So please do not integrate this change as of right
> now. I'm going to do more in depth reviews of your patches but i need some
> time for them so here is a quick a description of the problem:
> 
> It is a SMC-D problem, that occurs while building up the connection. In
> smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the SMC-R
> path you do grab the pointer, for SMC-D that never happens. Still you are
> using this refernce for SMC-D => Crash. This problem can be reproduced using
> the SMC-D path. Here is an example console output:

Got it.

> 
> [  779.516382] Unable to handle kernel pointer dereference in virtual kernel
> address space
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800
> P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs
> mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
> 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c
> (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001
> 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640
> 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800
> 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954
> 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg
> %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs
> %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij
> %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la
> %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg
> %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg
> %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300
> [smc]
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 01.
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 00.
> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]
> 
> 
> I'm going to send the review for the first patch right away (which is the
> one causing the crash), so far I'm done with it. The others are going to
> follow. Maybe you can look over the problem and come up with a solution,
> otherwise we are going to decide if we want to look into it as soon as I'm
> done with the reviews. Thank you for your patience.

Thanks for pointing this issue. We will fix this soon in v2.

Tony Lu