RE: crash in librdmacm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> we got a bug report from a customer that FhGFS server daemons frequently
> crash. Looking into it, we are almost sure it is related to
> librdmacm/kernel-ib. So the crash in librdmacm resolves to:
> 
> > (gdb) l *(rdma_get_cm_event+0x102)
> > 0x3d6dc04522 is in rdma_get_cm_event (src/cma.c:1975).
> > 1970                    break;
> > 1971            default:
> > 1972                    evt->id_priv = (void *) (uintptr_t) resp.uid;
> > 1973                    evt->event.id = &evt->id_priv->id;
> > 1974                    evt->event.status = resp.status;
> > 1975                    if (ucma_is_ud_qp(evt->id_priv->id.qp_type))
> > 1976                            ucma_copy_ud_event(evt, &resp.param.ud);
> > 1977                    else
> > 1978                            ucma_copy_conn_event(evt, &resp.param.conn);
> > 1979                    break;
> 
> 
> Now the complete shows that the issue comes up directly after initiating
> a connection and we are just querying the file descriptor for events. So
> from our point of view there is not much we can do about it.
> 
> Possibly this is already fixed by commit
> 418edaaba96e58112b15c82b4907084e2a9caf42 and this commit is also not yet
> in the latest RHEL kernel being used on the customer system.
> However, the commit messages states it is for RDMA_CM_EVENT_ESTABLISHED
> only, while the resolved address above points to another event. Does the
> commit really on fix this event or would it also fix further events?

The above commit fixes an issue where the uid can be returned 0 from the kernel.

A call to resolve an address should not have this issue.  The issue is limited to connections formed from a listen request.  The reason for this is that passive side connections create the kernel id first, which is then linked up with the user space id.  Active side processing creates the user space id, then the kernel id.

> While looking where file->mut is being set, I notice that in
> ucma_create_id() the call of ucma_alloc_ctx() is protected by a mutex,
> but the mutex is immediately given up after that call. In
> ucma_alloc_ctx() the ctx is already added to file->ctx_list, but the
> ctx->uid and ctx->cm_id are then modified outside a mutex in
> ucma_create_id(). Shouldn't the mutex_unlock() be done after assigning
> those values?

It shouldn't matter, as long as the uid and cm_id are set before any events can be generated.  Events are not generated on newly created id's.
 
- Sean
��.n��������+%������w��{.n�����{���fk��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux