Re: insight into a WARNING from softROCE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Moni/Yonatan?

On Fri, Dec 08, 2017 at 02:50:10PM -0500, Olga Kornievskaia wrote:
> Hi folks,
>
> Can somebody give me an insight into to following WARNING (at the end
> of the message)  that I see logged in var log messages while using
> softROCE (NFSoRDMA)? This is typically associated with a hiccup in
> communication I see happening over RDMA (long delays).
>
> It's coming form the WARN here in rxe_comp.c:
>
>                 case COMPST_ERROR:
>                         WARN_ON_ONCE(wqe->status == IB_WC_SUCCESS);
>                         do_complete(qp, wqe);
>                         rxe_qp_error(qp);
>
>                         if (pkt) {
>                                 rxe_drop_ref(pkt->qp);
>
> With a little bit of printks I tracked it to:
> COMPST_ERROR is coming from "retrying counter exceeding"
> (RXE_CNT_RETRY_EXCEEDED)  in COMPST_ERROR_RETRY. COMPST_ERROR_RETRY is
> coming from check_psn(). I see that packet psn is greater then the wqe
> psn. I have noticed that can happen (but not always) after
> update_wqe_psn() has number of packets left to send some number larger
> than 1.
>
> Goal is to figure out why the hiccups are happening and I think this is a clue.
>
> Thank you for any info.
>
> Dec  5 16:42:16 localhost kernel: ------------[ cut here ]------------
> Dec  5 16:42:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
> drivers/infiniband/sw/rxe/rxe_comp.c:741 rxe_completer+0xd84/0xe30
> [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: Modules linked in: rpcrdma ib_ucm
> ib_umad rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm
> ib_cm ib_uverbs ib_core rfcomm fuse ip6t_rpfilter ipt_REJECT
> nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep
> snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul ext4
> ghash_clmulni_intel mbcache jbd2 aesni_intel snd_ens1371
> snd_ac97_codec glue_helper ppdev lrw ac97_bus snd_seq gf128mul
> uvcvideo ablk_helper cryptd vmw_balloon videobuf2_vmalloc
> videobuf2_memops
> Dec  5 16:42:16 localhost kernel: btusb snd_pcm videobuf2_core pcspkr
> btrtl videodev btbcm btintel snd_timer snd_rawmidi bluetooth
> snd_seq_device snd vmw_vmci rfkill shpchp i2c_piix4 soundcore
> parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sr_mod cdrom vmwgfx sd_mod crc_t10dif
> crct10dif_generic drm_kms_helper ata_generic syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm pata_acpi crct10dif_pclmul ahci
> crct10dif_common mptspi crc32c_intel libahci scsi_transport_spi
> mptscsih serio_raw ata_piix libata mptbase e1000 i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
> Dec  5 16:42:16 localhost kernel: CPU: 0 PID: 0 Comm: swapper/0 Not
> tainted 3.10.0 #2
> Dec  5 16:42:16 localhost kernel: Hardware name: VMware, Inc. VMware
> Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
> 07/02/2015
> Dec  5 16:42:16 localhost kernel: Call Trace:
> Dec  5 16:42:16 localhost kernel: <IRQ>  [<ffffffff94cb9865>]
> dump_stack+0x19/0x1b
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686968>] __warn+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686aad>]
> warn_slowpath_null+0x1d/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a73734>]
> rxe_completer+0xd84/0xe30 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f24f>]
> rxe_do_task+0x9f/0x110 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f3b8>]
> rxe_run_task+0x18/0x40 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a729a5>]
> rxe_comp_queue_pkt+0x45/0x50 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a77bf8>]
> rxe_rcv+0x2a8/0x920 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc03bcc5f>] ?
> ipt_do_table+0x31f/0x4f0 [ip_tables]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa10>] ?
> net_to_rxe+0x80/0x80 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>]
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>] ?
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c148bb>]
> udp_queue_rcv_skb+0x1bb/0x4a0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15108>]
> __udp4_lib_rcv+0x568/0xb90
> Dec  5 16:42:16 localhost kernel: [<ffffffffc09281de>] ?
> ipv4_confirm+0x4e/0x100 [nf_conntrack_ipv4]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15b9a>] udp_rcv+0x1a/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be30ce>]
> ip_local_deliver_finish+0x8e/0x1d0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be33b9>]
> ip_local_deliver+0x59/0xd0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be3040>] ?
> ip_rcv_finish+0x300/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2db8>] ip_rcv_finish+0x78/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be36e6>] ip_rcv+0x2b6/0x410
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2d40>] ?
> inet_del_offload+0x40/0x40
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9f9d4>]
> __netif_receive_skb_core+0x2e4/0x820
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ff28>]
> __netif_receive_skb+0x18/0x60
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ffb0>]
> netif_receive_skb_internal+0x40/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0b58>]
> napi_gro_receive+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f33e8>]
> e1000_clean_rx_irq+0x2b8/0x510 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f4078>]
> e1000_clean+0x278/0x8d0 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0483>] net_rx_action+0x123/0x320
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fb4f>] __do_softirq+0xef/0x280
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccc51c>] call_softirq+0x1c/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff9462c4c5>] do_softirq+0x65/0xa0
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fed5>] irq_exit+0x105/0x110
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccd036>] do_IRQ+0x56/0xe0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc1b2d>]
> common_interrupt+0x6d/0x6d
> Dec  5 16:42:16 localhost kernel: <EOI>  [<ffffffff94cc0dd6>] ?
> native_safe_halt+0x6/0x10
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc0c6e>] ? default_idle+0x1e/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94633f86>] ? arch_cpu_idle+0x26/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff946e6efa>] ?
> cpu_startup_entry+0x14a/0x1c0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ca8e17>] ? rest_init+0x77/0x80
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516c05a>] ?
> start_kernel+0x433/0x454
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516ba30>] ?
> repair_env_string+0x5c/0x5c
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b120>] ?
> early_idt_handler_array+0x120/0x120
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b5ef>] ?
> x86_64_start_reservations+0x24/0x26
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b740>] ?
> x86_64_start_kernel+0x14f/0x172
> Dec  5 16:42:16 localhost kernel: [<ffffffff946001a5>] ? start_cpu+0x5/0x14
> Dec  5 16:42:16 localhost kernel: ---[ end trace c96ed928ed9503ca ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux