Re: [ceph-users] Help needed porting Ceph to RSockets

Andreas Bluemle <andreas.bluemle@xxxxxxxxxxx> · Tue, 13 Aug 2013 16:06:12 +0200

Hi Matthew,

I found a workaround for my (our) problem: in the librdmacm
code, rsocket.c, there is a global constant polling_time, which
is set to 10 microseconds at the moment.

I raise this to 10000 - and all of a sudden things work nicely.

I think we are looking at two issues here:
1. the thread structure of ceph messenger
   For a given socket connection, there are 3 threads of interest
   here: the main messenger thread, the Pipe::reader and the
   Pipe::writer.

   For a ceph client like the ceph admin command, I see the following
   sequence
     - the connection to the ceph monitor is created by the
       main messenger  thread, the Pipe::reader and
       Pipe::writer are instantiated.
     - the requested command is sent to the ceph monitor, the
       answer is read and printed
     - at this point the Pipe::reader already has called
       tcp_read_wait(), polling for more data or connection termination
     - after the response had been printed, the main loop calls the
       shutdown routines which in in turn shutdown() the socket

    There is some time between the last two steps - and this gap is
    long enough to open a race:

2. rpoll, ibv and poll
   the rpoll implementation in rsockets is split in 2 phases:
   - a busy loop which checks the state of the underlying ibv queue pair
   - the call to real poll() system call (i.e. the uverbs(?)
     implementation of poll() inside the kernel)

   The busy loop has a maximum duration of polling_time (10 microseconds
   by default) - and is able detect the local shutdown and returns a
   POLLHUP.

   The poll() system call (i.e. the uverbs implementation of poll() 
   in the kernel) does not detect the local shutdown - and only returns
   after the caller supplied timeout expires.

Increasing the rsockets polloing_time from 10 to 10000 microseconds
results in the rpoll to detect the local shutdown within the busy loop.

Decreasing the ceph "ms tcp read timeout" from the default of 900 to 5
seconds serves a similar purpose, but is much coarser.