Hi Roman, Thanks a lot for your time. I'm trying to use msg/async/rdma/iwarp with X722 network interface card on ceph v14.2.0 version. Below is my previous configuration: diff --git a/src/vstart.sh b/src/vstart.sh index 22ca3c6318..bbae18c4ef 100755 --- a/src/vstart.sh +++ b/src/vstart.sh @@ -518,7 +518,21 @@ ms bind msgr1 = true enable experimental unrecoverable data corrupting features = * osd_crush_chooseleaf_type = 0 debug asok assert abort = true -$msgr_conf + + ms bind msgr2 = false + ms bind msgr1 = true +;tell the ceph use the AsyncMessenger + RDMA as your message type + ms_type = async+rdma + ms_async_rdma_device_name = i40iw1 + ms_async_rdma_type = iwarp + ms_async_rdma_support_srq = false + ms_async_rdma_cm = true +;; ms_async_rdma_port_num = 1 +; ms_async_rdma_send_buffers = 1024 +; ms_async_rdma_receive_buffers = 16384 +; ms_async_rdma_receive_queue_len = 1024 +;; ms_async_rdma_buffer_size = 4096 + $extra_conf EOF if [ "$lockdep" -eq 1 ] ; then (END) I also tried to follow your suggestion to add below configuration at the same time. However, I still hit the same problem. ms_type = async+rdma ms_cluster_type = async+rdma Previously, msg/async/rdma/iwarp(x722) works well with the iwarp patches without PR 20172. However, it doesn't work on master branch now. You can find more information in https://tracker.ceph.com/issues/39238. Thanks for your explanation of the credentials problems. I'll follow your trace to check it with using x722 NIC. Please let me know if you have some ideas to debug it. Thanks for your help a lot. Regards, Changcheng On 12:57 Mon 15 Apr, Roman Penyaev wrote: > On 2019-04-12 12:42, Liu, Changcheng wrote: > > Hi all, > > I'm enabling Ceph/RDMA(iWARP) in Ceph/V14.2.0. > > It always hit segmentation fault at querying rdma devices after > > quering radma devices succesffully for several times. > > > > I traced the living kernel and found the problem in function > > ib_uverbs_write: > > 1. ib_safe_file_access(filp) is false, then ib_uverbs_write > > return -EACCESS. > > 2. filp->f_cred == current_cred() is false, then > > ib_safe_file_access return false. > > > > Could anyone give some suggestion to further check that > > filp->f_cred is not equal to current_cred? > > > > Below is the kernel code and traced log. > > file: drivers/infiniband/core/uverbs_main.c > > 712 static ssize_t ib_uverbs_write(struct file *filp, const > > char __user *buf, > > 713 size_t count, loff_t *pos) > > 714 { > > 715 +---- 9 lines: struct ib_uverbs_file *file = > > filp->private_data;----- > > 724 if (!ib_safe_file_access(filp)) { > > 725 pr_err_once("uverbs_write: process %d (%s) changed > > security contexts after opening file descriptor, this is not > > allowed.\n", > > 726 task_tgid_vnr(current), current->comm); > > 727 return -EACCES; > > 728 } > > 729 +--- 74 lines: if (count < > > sizeof(hdr))------------------------------- > > 803 } > > > > file: kernel/include/rdma/ib.h > > 91 /* > > 92 * The IB interfaces that use write() as bi-directional > > ioctl() are > > 93 * fundamentally unsafe, since there are lots of ways to > > trigger "write()" > > 94 * calls from various contexts with elevated privileges. > > That includes the > > 95 * traditional suid executable error message writes, but > > also various kernel > > 96 * interfaces that can write to file descriptors. > > 97 * > > 98 * This function provides protection for the legacy API by > > restricting the > > 99 * calling context. > > 100 */ > > 101 static inline bool ib_safe_file_access(struct file *filp) > > 102 { > > 103 return filp->f_cred == current_cred() && > > !uaccess_kernel(); > > 104 } > > > > Kernel trace log: > > root@nstcloudcc1:/sys/kernel/debug/tracing# cat > > /sys/kernel/debug/tracing/trace > > # tracer: nop > > # > > # _-----=> irqs-off > > # / _----=> need-resched > > # | / _---=> hardirq/softirq > > # || / _--=> preempt-depth > > # ||| / delay > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > > # | | | |||| | | > > <...>-87018 [003] .... 15409.847504: rdma_verb_fs: > > (ib_uverbs_write+0x3c/0x3d0 [ib_uverbs]) > > filp_f_cred=0xffff8906bd855b00 current_cred=0xffff8906ad773500 > > get_fs=0xffffffffffffffff > > <...>-87018 [003] d... 15409.847510: rdma_ib_verb: > > (__vfs_write+0x1b/0x40 <- ib_uverbs_write) ret=0xfffffffffffffff3 > > t_name="msgr-worker-0" > > Hi Liu, > > Let me guess you are trying start the whole cluster with "ms_type = > async+rdma" > option set? If yes, then setting "ms_cluster_type = async+rdma" should > help. > > Returning to your question and changed credentials. The problem you hit is > in > the order of RDMA inition, namely opening of "/dev/infiniband/uverbs0", and > calling setuid(), which changes current->cred pointer inside a kernel (see > commit_creds() call). > > Here is the ftrace where it is clear that uverbs0 is firstly opened, > then setuid() is called and then write() fails: > > > 4050 openat(AT_FDCWD, "/dev/infiniband/uverbs0", O_RDWR|O_CLOEXEC) = 16 > ... > 4050 setuid(167 <unfinished ...> > 4050 <... setuid resumed> ) = 0 > ... > 4050 write(16, "\30\0\0\0\32\0\20\0\300W*\307\221\177\0\0\300R\346bbU\0\0\0\0\0\0\4\0\0\0"..., > 104 <unfinished ...> > 4050 <... write resumed> ) = -1 EACCES (Permission denied) > > > Backtraces are the following (in the order we hit them): > > Init RDMA connection: > > #0 0x00007fffef53bbb0 in AsyncConnection::AsyncConnection(CephContext*, > AsyncMessenger*, DispatchQueue*, Worker*, bool, bool) () from > /usr/lib64/ceph/libceph-common.so.0 > #1 0x00007fffef545c49 in AsyncMessenger::create_connect(entity_addrvec_t > const&, int) () from /usr/lib64/ceph/libceph-common.so.0 > #2 0x00007fffef5466ae in AsyncMessenger::connect_to(int, entity_addrvec_t > const&) () from /usr/lib64/ceph/libceph-common.so.0 > #3 0x00007fffef5e85af in MonClient::_add_conn(unsigned int, unsigned long) > () from /usr/lib64/ceph/libceph-common.so.0 > #4 0x00007fffef5e8ce3 in MonClient::_add_conns(unsigned long) () from > /usr/lib64/ceph/libceph-common.so.0 > #5 0x00007fffef5ee1bf in MonClient::_reopen_session(int) () from > /usr/lib64/ceph/libceph-common.so.0 > #6 0x00007fffef5efe15 in MonClient::authenticate(double) () from > /usr/lib64/ceph/libceph-common.so.0 > #7 0x00007fffef5f0736 in MonClient::get_monmap_and_config() () from > /usr/lib64/ceph/libceph-common.so.0 > #8 0x00005555559a68f5 in global_pre_init() > ... > #10 0x0000555555663c11 in main () > > and init RDMA: > > Thread 4 "msgr-worker-1" hit Breakpoint 7, 0x00007fffeeb727e0 in open64 () > from /lib64/libpthread.so.0 > $67 = 0x555557240240 "/dev/infiniband/uverbs1" > #0 0x00007fffeeb727e0 in open64 () from /lib64/libpthread.so.0 > #1 0x00007fffec80edaa in verbs_open_device () from > /usr/lib64/libibverbs.so.1 > #2 0x00007fffef59d940 in Device::Device(CephContext*, ibv_device*, > ibv_context*) () from /usr/lib64/ceph/libceph-common.so.0 > #3 0x00007fffef5a1517 in Infiniband::init() () from > /usr/lib64/ceph/libceph-common.so.0 > #4 0x00007fffef5b595a in RDMAWorker::connect(entity_addr_t const&, > SocketOptions const&, ConnectedSocket*) () from > /usr/lib64/ceph/libceph-common.so.0 > #5 0x00007fffef53f677 in AsyncConnection::process() () from > /usr/lib64/ceph/libceph-common.so.0 > #6 0x00007fffef591847 in EventCenter::process_events(unsigned int, > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () from > /usr/lib64/ceph/libceph-common.so.0 > #7 0x00007fffef595c88 in ?? () from /usr/lib64/ceph/libceph-common.so.0 > #8 0x00007fffee69638f in ?? () from /usr/lib64/libstdc++.so.6 > #9 0x00007fffeeb68569 in start_thread () from /lib64/libpthread.so.0 > #10 0x00007fffeddc19af in clone () from /lib64/libc.so.6 > > > and only then setuid() is called: > > #0 0x00007fffedd90030 in setuid () from /lib64/libc.so.6 > #1 0x00005555559a7844 in global_init() () > #2 0x0000555555663c11 in main () > > > It seems the proper solution should be to start mon connections after > setuid() > is invoked. > > Also according to the code (global_init.c::global_pre_init()) a simple > workaround > can be to use --no-mon-config option, then no monitor connection is created > inside > global_pre_init() under the "if (!conf->no_mon_config)" path, but I doubt > this is > a good way, just a workaround. > > -- > Roman >