Re: Known SMBDirect issue?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2023-09-19 10:57 GMT+09:00, Tom Talpey <tom@xxxxxxxxxx>:
> On 9/18/2023 7:19 PM, Namjae Jeon wrote:
>> 2023-09-19 7:38 GMT+09:00, Tom Talpey <tom@xxxxxxxxxx>:
>>> Namjae, after building a 6.6.0-rc2 kernel to test here at the IOLab,
>>> I was surprised to see the smbdirect connection break during the
>>> Connectathon "special" tests. The basic tests all work fine, but shortly
>>> after the special tests begin, I start seeing this on the server (this
>>> is with softRoCE, though I see similar failures over softiWarp):
>> I'll try to reproduce it tonight. I found no problems in testing with
>> the Windows client last week. I will have to check with cifs.ko &
>> softROCE.
>>
>
> On further testing, it appears to be triggered by listing directories
> on the share. All "ls -l /mnt/foo/..." attempts return an empty listing,
> although files exist. Steve and I get the same result with TCP and RDMA.
>
> But while trying to debug, we got a hang, with this dmesg. Any new ideas?
There is big change to support read compound. It looks a bug caused by that.
I'll take a look.

Thanks!
>
> [  673.085542] ksmbd: cli req too short, len 184 not 142. cmd:5 mid:109
> [  673.085580] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> [  673.085591] #PF: supervisor read access in kernel mode
> [  673.085600] #PF: error_code(0x0000) - not-present page
> [  673.085608] PGD 0 P4D 0
> [  673.085620] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [  673.085631] CPU: 3 PID: 1039 Comm: kworker/3:0 Not tainted
> 6.6.0-rc2-tmt #16
> [  673.085643] Hardware name: AZW U59/U59, BIOS JTKT001 05/05/2022
> [  673.085651] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
> [  673.085719] RIP: 0010:ksmbd_conn_write+0x68/0xc0 [ksmbd]
> [  673.085780] Code: 4c 89 ef e8 8a b8 0a e0 48 8b 73 38 49 8b 7c 24 50
> 44 0f b6 83 89 00 00 00 8b 53 44 48 8b 06 44 8b 8b 8c 00 00 00 41 c0 e8
> 03 <8b> 08 48 8b 47 08 41 83 e0 01 0f c9 81 e1 ff ff ff 00 48 8b 40 20
> [  673.085798] RSP: 0018:ffffc900005e3de8 EFLAGS: 00010246
> [  673.085808] RAX: 0000000000000000 RBX: ffff88811ade4f00 RCX:
> 0000000000000000
> [  673.085817] RDX: 0000000000000000 RSI: ffff88810c2a9780 RDI:
> ffff88810c2a9ac0
> [  673.085826] RBP: ffffc900005e3e00 R08: 0000000000000000 R09:
> 0000000000000000
> [  673.085834] R10: ffffffffa3168160 R11: 63203a64626d736b R12:
> ffff8881057c8800
> [  673.085842] R13: ffff8881057c8820 R14: ffff8882781b2380 R15:
> ffff8881057c8800
> [  673.085852] FS:  0000000000000000(0000) GS:ffff888278180000(0000)
> knlGS:0000000000000000
> [  673.085864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  673.085872] CR2: 0000000000000000 CR3: 000000015b63c000 CR4:
> 0000000000350ee0
> [  673.085883] Call Trace:
> [  673.085890]  <TASK>
> [  673.085900]  ? show_regs+0x6a/0x80
> [  673.085916]  ? __die+0x25/0x70
> [  673.085926]  ? page_fault_oops+0x154/0x4b0
> [  673.085938]  ? tick_nohz_tick_stopped+0x18/0x50
> [  673.085954]  ? __irq_work_queue_local+0xba/0x140
> [  673.085967]  ? do_user_addr_fault+0x30f/0x6c0
> [  673.085979]  ? exc_page_fault+0x79/0x180
> [  673.085992]  ? asm_exc_page_fault+0x27/0x30
> [  673.086009]  ? ksmbd_conn_write+0x68/0xc0 [ksmbd]
> [  673.086067]  ? ksmbd_conn_write+0x46/0xc0 [ksmbd]
> [  673.086123]  handle_ksmbd_work+0x28d/0x4b0 [ksmbd]
> [  673.086177]  process_one_work+0x178/0x350
> [  673.086193]  ? __pfx_worker_thread+0x10/0x10
> [  673.086202]  worker_thread+0x2f3/0x420
> [  673.086210]  ? _raw_spin_unlock_irqrestore+0x27/0x50
> [  673.086222]  ? __pfx_worker_thread+0x10/0x10
> [  673.086230]  kthread+0x103/0x140
> [  673.086242]  ? __pfx_kthread+0x10/0x10
> [  673.086253]  ret_from_fork+0x39/0x60
> [  673.086263]  ? __pfx_kthread+0x10/0x10
> [  673.086274]  ret_from_fork_asm+0x1b/0x30
> [  673.086291]  </TASK>
> [  673.086296] Modules linked in: cmac nls_utf8 rpcrdma sunrpc rdma_ucm
> ib_iser libiscsi scsi_transport_iscsi rdma_rxe ip6_udp_tunnel udp_tunnel
> siw ib_uverbs ksmbd crc32_generic cifs_arc4 nls_ucs2_utils rdma_cm iw_cm
> ib_cm ib_core ccm binfmt_misc snd_sof_pci_intel_icl
> snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation
> snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci
> snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda
> snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus
> snd_soc_core snd_compress ac97_bus x86_pkg_temp_thermal intel_powerclamp
> snd_hda_codec_hdmi nls_iso8859_1 snd_pcm_dmaengine xfs iwlmvm
> snd_usb_audio snd_hda_intel coretemp snd_intel_dspcfg intel_rapl_msr
> mei_hdcp snd_usbmidi_lib snd_intel_sdw_acpi kvm_intel mac80211
> snd_rawmidi snd_hda_codec libarc4 snd_seq_device btusb kvm btrtl
> snd_hda_core mc snd_hwdep btintel iwlwifi btbcm snd_pcm btmtk
> processor_thermal_device_pci_legacy intel_cstate
> processor_thermal_device bluetooth wmi_bmof cfg80211
> [  673.086451]  snd_timer processor_thermal_rfim processor_thermal_mbox
> processor_thermal_rapl 8250_dw snd ecdh_generic ecc mei_me soundcore
> intel_rapl_common int340x_thermal_zone mei intel_soc_dts_iosf acpi_pad
> acpi_tad mac_hid sch_fq_codel msr efi_pstore ip_tables x_tables autofs4
> btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy
> async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
> linear hid_generic usbhid hid i915 drm_buddy i2c_algo_bit ttm
> drm_display_helper cec crct10dif_pclmul crc32_pclmul spi_pxa2xx_platform
> ghash_clmulni_intel rc_core dw_dmac sha512_ssse3 dw_dmac_core
> drm_kms_helper aesni_intel i2c_i801 crypto_simd i2c_smbus r8169 cryptd
> drm intel_lpss_pci realtek intel_lpss ahci libahci xhci_pci idma64
> xhci_pci_renesas video wmi pinctrl_jasperlake
> [  673.086699] CR2: 0000000000000000
> [  673.086708] ---[ end trace 0000000000000000 ]---
> [  673.182824] RIP: 0010:ksmbd_conn_write+0x68/0xc0 [ksmbd]
> [  673.182844] Code: 4c 89 ef e8 8a b8 0a e0 48 8b 73 38 49 8b 7c 24 50
> 44 0f b6 83 89 00 00 00 8b 53 44 48 8b 06 44 8b 8b 8c 00 00 00 41 c0 e8
> 03 <8b> 08 48 8b 47 08 41 83 e0 01 0f c9 81 e1 ff ff ff 00 48 8b 40 20
> [  673.182865] RSP: 0018:ffffc900005e3de8 EFLAGS: 00010246
> [  673.182868] RAX: 0000000000000000 RBX: ffff88811ade4f00 RCX:
> 0000000000000000
> [  673.182871] RDX: 0000000000000000 RSI: ffff88810c2a9780 RDI:
> ffff88810c2a9ac0
> [  673.182873] RBP: ffffc900005e3e00 R08: 0000000000000000 R09:
> 0000000000000000
> [  673.182875] R10: ffffffffa3168160 R11: 63203a64626d736b R12:
> ffff8881057c8800
> [  673.182878] R13: ffff8881057c8820 R14: ffff8882781b2380 R15:
> ffff8881057c8800
> [  673.182880] FS:  0000000000000000(0000) GS:ffff888278180000(0000)
> knlGS:0000000000000000
> [  673.182883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  673.182886] CR2: 0000000000000000 CR3: 000000011295c000 CR4:
> 0000000000350ee0
> [  673.182888] note: kworker/3:0[1039] exited with irqs disabled
>
>
>> Thanks for your report!
>>>
>>> [ 1266.623187] rxe0: qp#17 do_complete: non-flush error status = 2
>>> [ 1266.623233] ksmbd: smb_direct: Recv error. status='local QP operation
>>> error (2)' opcode=0
>>> [ 1266.623605] ksmbd: smb_direct: disconnected
>>> [ 1266.623610] ksmbd: sock_read failed: -107
>>> [ 1266.628656] rxe0: qp#18 do_complete: non-flush error status = 2
>>> [ 1266.628684] ksmbd: smb_direct: Recv error. status='local QP operation
>>> error (2)' opcode=0
>>> [ 1266.628820] ksmbd: smb_direct: disconnected
>>> [ 1266.628824] ksmbd: sock_read failed: -107
>>> [ 1266.633354] rxe0: qp#19 do_complete: non-flush error status = 2
>>> [ 1266.633380] ksmbd: smb_direct: Recv error. status='local QP operation
>>> error (2)' opcode=0
>>> [ 1266.633583] ksmbd: smb_direct: disconnected
>>>
>>> The local QP error 2 is IB_WC_LOC_QP_OP_ERR, which is a buffer error
>>> of some sort, could be a receive buffer unavailable or maybe a length
>>> overrun. Both of these seem highly improbable, because the "basic" tests
>>> run fine. The client sees only a disconnection with IB_WC_REM_OP_ERR,
>>> which is expected in this case.
>>>
>>> OTOH it could be a client send issue, maybe a too-large datagram or an
>>> smbdirect credit overrun. But it's the server detecting the error, so
>>> I'm starting there for now.
>>>
>>> This worked as recently as 6.5, definitely it was all fine in 6.4. I am
>>> not yet able to drill down to the level of figuring out what SMB3
>>> payload was being received by ksmbd.
>>>
>>> Steve tells me you test over RDMA semi-often. Have you seen this?
>>> Any ideas are welcome.
>>>
>>> Tom.
>>>
>>
>



[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux