On Thu, Aug 25, 2016 at 10:43 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > > On 08/24/2016 05:32 PM, Sage Weil wrote: >> >> On Wed, 24 Aug 2016, Casey Bodley wrote: >>> >>> On 08/12/2016 12:27 PM, Sage Weil wrote: >>>> >>>> It seems like it would be simpler to push the fork before any important >>>> operations. (And BTW with systemd and upstart we don't fork anyway; >>>> it's >>>> just there for sysvinit.) The preforker thing is there to make it easy >>>> to >>>> fork early, but keep the parent waiting around so that you can do more >>>> intialization, print errors, and terminate with an error code if >>>> something >>>> (post-fork) goes wrong. In theory, there's no reason why we couldn't >>>> make >>>> this almost the very first thing the daemon does so that *all* work is >>>> done in the child... >>>> >>>> sage >>>> >>> I've recently run into issues related to fork as well (see my "memory >>> leaks >>> related to CephContext and global_init_daemonize()" email). Trying to >>> manage >>> resources across a fork is difficult and error-prone, so changing how we >>> daemonize could eliminate a whole class of these bugs. And as Haomai >>> points >>> out, we're going to great lengths to make things work in the current >>> model. >>> >>> I'm a big fan of Sage's theory that all work could be done in the child >>> process, and I'm willing to take on the project if we can reach a >>> consensus on >>> the design. >>> >>> Whether or not we're interested in long-term support for SysV, the >>> ability to >>> daemonize is useful our for development workflow (vstart.sh in >>> particular). To >>> fill this role, some basic requirements for the parent process are: >>> * don't exit until initialization is finished >>> * return an error code if initialization failed >>> >>> As Sage pointed out, this is exactly what Preforker is doing for >>> ceph-mon. So >>> we can start by changing the other daemons to use that instead of >>> global_init_daemonize(). >>> >>> The next step is to prevent global_init()/common_init()/CephContext from >>> doing >>> any work in the parent process (esp. spawning threads for Log, >>> CephContextServiceThread, and AdminSocket). Decoupling the config parsing >>> from >>> CephContext initialization seems like a natural way to accomplish that. >>> So the >>> parent would create and initialize the md_config_t object, then after >>> fork, >>> the child would pass that as an argument when creating the CephContext. >>> >>> How does that sound for a start? >> >> Sounds good to me! The last step (decoupling) sounds like it's not >> strictly necessary but is probably worthwhile. It will get deep into a >> bunch of crufty code that isn't much fun to deal with, so I suspect you'll >> either run away screaming or come up with something pretty satisfying that >> removes a bunch of ugly code. >> >> sage > > > Thanks Sage. I'll have to investigate some more to see how messy it will be, > but I'm more worried about trying to construct a 'partial' CephContext for > parsing, and exposing that to the application when most of its capabilities > (logging, especially) are missing. The parent process won't be able to call > any code with dout()s, so I think it's safest to avoid constructing a > CephContext in the first place. I'd like to add some thoughts here if I may? Recently Kefu and I were discussing uses of fork in the ceph codebase at length due to the issue seen in http://tracker.ceph.com/issues/16556 Over recent years supporting multiple products, including glibc, I have seen many problems arising from what I consider to be misuse of fork. There is a saying, "The only safe function call to make after fork() is exec()" and this is reflected in the fork man page. "After a fork(2) in a multithreaded program, the child can safely call only async-signal-safe functions (see signal(7)) until such time as it calls execve(2)." The list of async-signal-safe functions is not very extensive and does not include things like malloc and free and I have seen on multiple occasions deadlocks where the child is trying to acquire a malloc arena lock for example that was locked when fork was called and hanging forever. Basically, if a lock is locked when you fork the child will get a copy (COW) of the lock in the locked state with no possibility of the lock being unlocked since the child is single threaded. if the child then tries to acquire this lock it will hang forever. The same can happen with other resources. Note that even if your code is single threaded some libraries may use threads in the background and this can also catch you out. Sometimes these bugs can be hideously difficult to pin down and happen extremely rarely and only on certain machines, etc. due to timing. These types of problems with fork are well documented. https://bugs.chromium.org/p/chromium/issues/detail?id=35374#c51 http://www.evanjones.ca/fork-is-dangerous.html http://thorstenball.com/blog/2014/10/13/why-threads-cant-fork/ https://github.com/gperftools/gperftools/issues/178 (tcmalloc and its use of locks) Google shows a preference for clone in this document, posix_spawn is another alternative. https://github.com/google/googletest/blob/master/googletest/docs/AdvancedGuide.md#death-tests-and-threads There are partial solutions like pthread_atfork but it is not a comprehensive solution, nor is it without its own problems. Anyway, I just wanted to get this information out there for discussion so we can make an educated decision going forward and so that everyone is aware of the potential issues with uses of fork in ceph. -- Cheers, Brad -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html