Re: mount.glusterfs health check troubles - help appreciated

Kaushal M <kshlmster@xxxxxxxxx> · Mon, 9 May 2016 09:51:19 +0530

On Fri, May 6, 2016 at 4:59 PM, Sachidananda URS <surs@xxxxxxxxxx> wrote:
>
>
> On Fri, May 6, 2016 at 4:04 PM, Kaushal M <kshlmster@xxxxxxxxx> wrote:
>>
>> I'm currently trying to straighten out the encrypted transport
>> (SSL/TLS socket) code, and make it more robust, and work well with
>> IPv6 in particular [1]. When testing the changes, the mount.glusterfs
>> script cause some troubles.
>>
>> The mount script tries to check if the mount is online by performing a
>> stat on the mount point after the glusterfs command returns, and
>> umounts if the stat fails. This is a check is racey and doesn't always
>> do the right thing.
>>
>> The check is racey because it could be run before the client
>> translators have been able to connect to the bricks. The following
>> sequence of events happen when the mount happens, which help explain
>> the race.
>>
>> - mount script runs the glusterfs command
>> - mount process fetches the volfile
>> - mount process initalizes the graph. The client xlator is also
>> initialized now, but the connections aren't started.
>> - mount process sends a PARENT_UP event to the graph. The client now
>> begins the connection process (portmap first, followed by connecting
>> to the brick). It's not guaranteed yet if the connection happened.
>> - mount process returns
>> - mount script does a stat on mount point to check health
>>
>> In an environment (like the on I'm testing in) the connection couldn't
>> be completed by the time the health check is done. In my environment,
>> the client connection sequence is as follows,
>> - the portmap connection is started
>>  - the first address returned for the hostname is a IPv6 address. With
>> the IPv6 change that was merged recently name lookups are done with
>> AF_UNSPEC, which return IPv6. My envrionment returns v6 addresses
>> first for getaddrinfo calls (which I think is the default for a lot of
>> environments)
>>  - the connection fails as glusterd doesn't listen on IPv6 addresses
>> (it listens on 0.0.0.0 which v4 only)
>>  - a reconnection is made with the next address. This takes a while
>> because of the encrypted transports.
>>  - portmap query is done after connection is established and the port
>> is obtained
>> - the client xlator now reconnects to the obtained port.
>>  - (same above cycle of connection/reconnection happens)
>> - once connection is established, handshakes are done
>> - CHILD_UP event is sent
>>
>> After this point the client xlator becomes usable.
>>
>> But this is not reached before the mount script does the health check
>> in my environment. So the mount ends up being terminated.
>>
>> Now the simplest solution would be to sleep for some time before doing
>> the check to give the xlators time to get ready. But this is
>> non-deterministic and isn't something I'm very fond of.
>>
>
>
> Have you tried the wait builtin?

I don't think it would help. `wait` is used to wait for background
processes to complete.
But the mount script launches the mount process in the foreground,
which forks and quits
after the child process gives a return value. `wait` has not much use here.

>
> -sac
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel