On Thu, Apr 7, 2016 at 7:24 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote: > On Thu, Apr 7, 2016 at 6:23 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote: >> On Thu, Apr 7, 2016 at 6:00 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote: >>> >>> >>> On 04/07/2016 05:37 PM, Kaushal M wrote: >>>> >>>> On 7 Apr 2016 5:36 p.m., "Niels de Vos" <ndevos@xxxxxxxxxx >>>> <mailto:ndevos@xxxxxxxxxx>> wrote: >>>>> >>>>> On Thu, Apr 07, 2016 at 05:13:54PM +0530, Kaushal M wrote: >>>>> > On Thu, Apr 7, 2016 at 5:11 PM, Kaushal M <kshlmster@xxxxxxxxx >>>> <mailto:kshlmster@xxxxxxxxx>> wrote: >>>>> > > We've hit another regression. >>>>> > > >>>>> > > With management encryption enabled, daemons like NFS and SHD don't >>>>> > > start on the current heads of release-3.7 and master branches. >>>>> > > >>>>> > > I still have no clear root cause for it, and would appreciate some >>>> help. >>>>> > >>>>> > This was working with 3.7.9 from what I've heard. >>>>> >>>>> Do we have a simple test-case for this? If someone write a script, we >>>>> should be able to "git bisect" it pretty quickly. >>>> >>>> I am doing this right now. >>> "b33f3c9 glusterd: Bug fixes for IPv6 support" has caused this >>> regression. I am yet to find the RCA though. >> >> git-bisect agrees with this as well. >> >> I initially thought it was because GlusterD didn't listen on IPv6 >> (checked using `ss`). >> This change makes it so that connections to localhost use ::1 instead >> of 127.0.0.1, and so the connection failed. >> This should have caused all connection attempts to fail, irrespective >> of it being encrypted or not. >> But the failure only happens when management encryption is enabled. >> So this theory doesn't make sense. > > This is the part of the problem! > > The initial IPv6 connection to ::1 fails for non encrypted connections as well. > But these connections correctly retry connect with the next address > once the first connect attempt fails. > Since the next address is 127.0.0.1, the connection succeeds, volfile > is fetched and the daemon starts. > > Encrypted connections on the other hand, give up after the first > failure and don't attempt a reconnect. > This is somewhat surprising to me, as I'd recently fixed an issue > which caused crashes when encrypted connections attempted a reconnect > after a failure to connect. > > I'll diagnose this a little bit more and try to find a solution. Found the full problem. This is mainly a result of the fix I did, that I mentioned above. (A slight correction is that actually it wasn't crashes that it fixed, but a encrypted reconnect issue in GlusterD). I'm posting the root-cause as I described in the commit message for the fix for this. """ With commit d117466 socket_poller() wasn't launched from socket_connect (for encrypted connections), if connect() failed. This was done to prevent the socket private data from being double unreffed, from the cleanups in both socket_poller() and socket_connect(). This allowed future reconnects to happen successfully. If a socket reconnects is sort of decided by the rpc notify function registered. The above change worked with glusterd, as the glusterd rpc notify function (glusterd_peer_rpc_notify()) continuously allowed reconnects on failure. mgmt_rpc_notify(), the rpc notify function in glusterfsd, behaves differently. For a DISCONNECT event, if more volfile servers are available or if more addresses are available in the dns cache, it allows reconnects. If not it terminates the program. For a CONNECT event, it attempts to do a volfile fetch rpc request. If sending this rpc fails, it immediately terminates the program. One side effect of commit d117466, was that the encrypted socket was registered with epoll, unintentionally, on a connect failure. A weird thing happens because of this. The epoll notifier notifies mgmt_rpc_notify() of a CONNECT event, instead of a DISCONNECT as expected. This causes mgmt_rpc_notify() to attempt an unsuccessful volfile fetch rpc request, and terminate. (I still don't know why the epoll raises the CONNECT event) Commit 46bd29e fixed some issues with IPv6 in GlusterFS. This caused address resolution in GlusterFS to also request of IPv6 addresses (AF_UNSPEC) instead of just IPv4. On most systems, this causes the IPv6 addresses to be returned first. GlusterD listens on 0.0.0.0:24007 by default. While this attaches to all interfaces, it only listens on IPv4 addresses. GlusterFS daemons and bricks are given 'localhost' as the volfile server. This resolves to '::1' as the first address. When using management encryption, the above reasons cause the daemon processes to fail to fetch volfiles and terminate. The solution to this is simple. Instead of cleaning up the encrypted socket in socket_connect(), launch socket_poller() and let it cleanup the socket instead. This prevents the unintentional registration with epoll, and socket_poller() sends the correct events to the rpc notify functions, which allows proper reconnects to happen. """ I'll post the commit to r.g.o later, after doing a little more testing to verify it, but for now I'm off to have dinner. ~kaushal > >> >> One other thing was on my laptop, even bricks failed to start when >> glusterd was started with management encryption. >> But on a VM, the bricks started, but other daemons failed. >> >>>> >>>>> >>>>> Niels >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel@xxxxxxxxxxx >>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>> _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel