I'm currently trying to straighten out the encrypted transport (SSL/TLS socket) code, and make it more robust, and work well with IPv6 in particular [1]. When testing the changes, the mount.glusterfs script cause some troubles. The mount script tries to check if the mount is online by performing a stat on the mount point after the glusterfs command returns, and umounts if the stat fails. This is a check is racey and doesn't always do the right thing. The check is racey because it could be run before the client translators have been able to connect to the bricks. The following sequence of events happen when the mount happens, which help explain the race. - mount script runs the glusterfs command - mount process fetches the volfile - mount process initalizes the graph. The client xlator is also initialized now, but the connections aren't started. - mount process sends a PARENT_UP event to the graph. The client now begins the connection process (portmap first, followed by connecting to the brick). It's not guaranteed yet if the connection happened. - mount process returns - mount script does a stat on mount point to check health In an environment (like the on I'm testing in) the connection couldn't be completed by the time the health check is done. In my environment, the client connection sequence is as follows, - the portmap connection is started - the first address returned for the hostname is a IPv6 address. With the IPv6 change that was merged recently name lookups are done with AF_UNSPEC, which return IPv6. My envrionment returns v6 addresses first for getaddrinfo calls (which I think is the default for a lot of environments) - the connection fails as glusterd doesn't listen on IPv6 addresses (it listens on 0.0.0.0 which v4 only) - a reconnection is made with the next address. This takes a while because of the encrypted transports. - portmap query is done after connection is established and the port is obtained - the client xlator now reconnects to the obtained port. - (same above cycle of connection/reconnection happens) - once connection is established, handshakes are done - CHILD_UP event is sent After this point the client xlator becomes usable. But this is not reached before the mount script does the health check in my environment. So the mount ends up being terminated. Now the simplest solution would be to sleep for some time before doing the check to give the xlators time to get ready. But this is non-deterministic and isn't something I'm very fond of. This turning out to be problematic in my very simple environment, and I think it's gonna be a bigger problem in larger more complex environments. My environment is, - single node - single brick volume - client is the same node - IO transport encryption is on - Management transport encryption is on - IPv6 enabled in kernel, no actual IPv6 network is in place (disabling IPv6 in kernel causes the problem to stop, but I want to test with IPv6) Does anyone else have ideas on how to fix this? (For now I've disabled this check in the script). ~kaushal [1]: https://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1333317 _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel