Re: Fork and RDMA operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Aug 13, 2016 at 12:27 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Fri, 12 Aug 2016, Haomai Wang wrote:
>> Hi Vu,
>>
>> Actually you and me go into the same loop, in my async backend
>> pr(https://github.com/ceph/ceph/pull/10264) the
>> commit(https://github.com/ceph/ceph/pull/10264/commits/7055bafbcbf06425e71808cb2b089d1d04706728)
>> defines a interface for prefork/postfork things. It's much like
>> global_init_prefork_start/global_init_postfork_start but it's a
>> generic interface.
>>
>> Refer to Kefu's comment why we need this:
>> =============
>> note to myself, w.r.t. the before/after daemonize hook
>>
>> 1. it's a natural way to do bind/rebind in the event thread
>> 2. we do bind before daemon(2) now
>> 3. the child process after daemon(2) is a single threaded process, and
>> all event threads are terminated, so no threads is taking care of
>> bind/rebind after daemon(2),
>>
>> that's why we need to re-spawn the threads after daemon(2).
>> =============
>>
>> So let's resolve alike problem like this
>
> It seems like it would be simpler to push the fork before any important
> operations.  (And BTW with systemd and upstart we don't fork anyway; it's
> just there for sysvinit.)  The preforker thing is there to make it easy to
> fork early, but keep the parent waiting around so that you can do more
> intialization, print errors, and terminate with an error code if something
> (post-fork) goes wrong.  In theory, there's no reason why we couldn't make
> this almost the very first thing the daemon does so that *all* work is
> done in the child...

Yes, but I think it would be a big task now. A lot of works need to be
done next.

>
> sage
>
>
>>
>>
>> On Fri, Aug 12, 2016 at 9:22 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Thu, 11 Aug 2016, Vu Pham wrote:
>> >> Hello all,
>> >>
>> >> The background:
>> >> We have tested scaling with xio messenger and faced multiple *unknown*
>> >> problems (hard to trace and reproduce). We recently find out that the
>> >> daemonize/fork support isn't full in ibverbs. It assumes that the parent
>> >> process will do the RDMA operations. Any child process try to do rdma
>> >> operations will experience various unexpected problems.
>> >>
>> >> ceph-osd/ceph-mon/ceph-mds daemonize (fork) after creating messengers.
>> >> Xio messenger will initialize accelio library and register RDMA memory
>> >> in the 1st call to XioMessenger constructor.
>> >> This situation is very problematic where child process do rdma
>> >> operations as described above
>> >>
>> >> http://www.rdmamojo.com/2012/05/24/ibv_fork_init
>> >> http://www.spinics.net/lists/linux-rdma/msg03364.html
>> >> I create this PR which forces to daemonize/fork before creating
>> >> messenger
>> >>
>> >> https://github.com/ceph/ceph/pull/10600
>> >>
>> >> I have tested this patch by bringing up a cluster with 4 nodes, 8
>> >> osds/node, two monitors and run I/Os (4K - 4M block size) from 4 fio
>> >> clients.
>> >>
>> >> Is there any known problem to daemonize/fork before creating messenger?
>> >> Could you help to review and provide feedback?
>> >
>> > This more or less works.  The main issue is that we don't catch errors as
>> > early as we did and a daemon may appear to start and then immediately
>> > exit without printing an error.
>> >
>> > There is a Preforker class in common that is meant to address this (it's
>> > used by ceph-fuse and ceph-mon already).  It does the fork early, when
>> > prefork() is called, and keeps stdout/stderr open for an interim period
>> > until you call preforker.exit() or .daemonize().  Any exit code gets
>> > passed back to the parent over a socket.  I'm guessing that the mon is
>> > already working fine since you're just moving the prefork.daemonize() line
>> > around (the actual fork happened way back at
>> >
>> >         https://github.com/vuhuong/ceph-upstream/blob/5cd5546fb7ddb9ef69380476b0a80038ba74a405/src/ceph_mon.cc#L500
>> >
>> > ) and you just need to make ceph_osd.cc and ceph_mds.cc use Preforker in a
>> > similar way.
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux