On Thu, Apr 7, 2016 at 9:06 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote: > On Thu, Apr 7, 2016 at 7:24 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote: >> On Thu, Apr 7, 2016 at 6:23 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote: >>> On Thu, Apr 7, 2016 at 6:00 PM, Atin Mukherjee <amukherj@xxxxxxxxxx> wrote: >>>> >>>> >>>> On 04/07/2016 05:37 PM, Kaushal M wrote: >>>>> >>>>> On 7 Apr 2016 5:36 p.m., "Niels de Vos" <ndevos@xxxxxxxxxx >>>>> <mailto:ndevos@xxxxxxxxxx>> wrote: >>>>>> >>>>>> On Thu, Apr 07, 2016 at 05:13:54PM +0530, Kaushal M wrote: >>>>>> > On Thu, Apr 7, 2016 at 5:11 PM, Kaushal M <kshlmster@xxxxxxxxx >>>>> <mailto:kshlmster@xxxxxxxxx>> wrote: >>>>>> > > We've hit another regression. >>>>>> > > >>>>>> > > With management encryption enabled, daemons like NFS and SHD don't >>>>>> > > start on the current heads of release-3.7 and master branches. >>>>>> > > >>>>>> > > I still have no clear root cause for it, and would appreciate some >>>>> help. >>>>>> > >>>>>> > This was working with 3.7.9 from what I've heard. >>>>>> >>>>>> Do we have a simple test-case for this? If someone write a script, we >>>>>> should be able to "git bisect" it pretty quickly. >>>>> >>>>> I am doing this right now. >>>> "b33f3c9 glusterd: Bug fixes for IPv6 support" has caused this >>>> regression. I am yet to find the RCA though. >>> >>> git-bisect agrees with this as well. >>> >>> I initially thought it was because GlusterD didn't listen on IPv6 >>> (checked using `ss`). >>> This change makes it so that connections to localhost use ::1 instead >>> of 127.0.0.1, and so the connection failed. >>> This should have caused all connection attempts to fail, irrespective >>> of it being encrypted or not. >>> But the failure only happens when management encryption is enabled. >>> So this theory doesn't make sense. >> >> This is the part of the problem! >> >> The initial IPv6 connection to ::1 fails for non encrypted connections as well. >> But these connections correctly retry connect with the next address >> once the first connect attempt fails. >> Since the next address is 127.0.0.1, the connection succeeds, volfile >> is fetched and the daemon starts. >> >> Encrypted connections on the other hand, give up after the first >> failure and don't attempt a reconnect. >> This is somewhat surprising to me, as I'd recently fixed an issue >> which caused crashes when encrypted connections attempted a reconnect >> after a failure to connect. >> >> I'll diagnose this a little bit more and try to find a solution. > > Found the full problem. This is mainly a result of the fix I did, that > I mentioned above. > (A slight correction is that actually it wasn't crashes that it fixed, > but a encrypted reconnect issue in GlusterD). > > I'm posting the root-cause as I described in the commit message for > the fix for this. > """ > With commit d117466 socket_poller() wasn't launched from > socket_connect > (for encrypted connections), if connect() failed. This was done to > prevent the socket private data from being double unreffed, from the > cleanups in both socket_poller() and socket_connect(). This allowed > future reconnects to happen successfully. > > If a socket reconnects is sort of decided by the rpc notify function > registered. The above change worked with glusterd, as the glusterd rpc > notify function (glusterd_peer_rpc_notify()) continuously allowed > reconnects on failure. > > mgmt_rpc_notify(), the rpc notify function in glusterfsd, behaves > differently. > > For a DISCONNECT event, if more volfile servers are available or if > more > addresses are available in the dns cache, it allows reconnects. If not > it terminates the program. > > For a CONNECT event, it attempts to do a volfile fetch rpc request. If > sending this rpc fails, it immediately terminates the program. > > One side effect of commit d117466, was that the encrypted socket was > registered with epoll, unintentionally, on a connect failure. A weird > thing happens because of this. The epoll notifier notifies > mgmt_rpc_notify() of a CONNECT event, instead of a DISCONNECT as > expected. This causes mgmt_rpc_notify() to attempt an unsuccessful > volfile fetch rpc request, and terminate. > (I still don't know why the epoll raises the CONNECT event) > > Commit 46bd29e fixed some issues with IPv6 in GlusterFS. This caused > address resolution in GlusterFS to also request of IPv6 addresses > (AF_UNSPEC) instead of just IPv4. On most systems, this causes the > IPv6 > addresses to be returned first. > > GlusterD listens on 0.0.0.0:24007 by default. While this attaches to > all > interfaces, it only listens on IPv4 addresses. GlusterFS daemons and > bricks are given 'localhost' as the volfile server. This resolves to > '::1' as the first address. > > When using management encryption, the above reasons cause the daemon > processes to fail to fetch volfiles and terminate. > > The solution to this is simple. Instead of cleaning up the encrypted > socket in socket_connect(), launch socket_poller() and let it cleanup > the socket instead. This prevents the unintentional registration with > epoll, and socket_poller() sends the correct events to the rpc notify > functions, which allows proper reconnects to happen. > """ > > I'll post the commit to r.g.o later, after doing a little more testing > to verify it, but for now I'm off to have dinner. Decided to post the change as an RFC review. I still need to open bug reports for this. https://review.gluster.org/13926 > > ~kaushal > >> >>> >>> One other thing was on my laptop, even bricks failed to start when >>> glusterd was started with management encryption. >>> But on a VM, the bricks started, but other daemons failed. >>> >>>>> >>>>>> >>>>>> Niels >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel@xxxxxxxxxxx >>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>> _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel