Re: gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

Erik Jacobson <erik.jacobson@xxxxxxx> · Tue, 21 Sep 2021 09:59:24 -0500

> Don't forget to run the geo-replication fix script , if you missed to do it
> before the upgrade.

We don't use geo-replication YET but thank you for this thoughtful
reminder.

Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.

I'm not complaining I love gluster but this is just food for thought.

I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.

I LOVE that slowly updating our gluster servers is "Just working".

This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.

Thank you all!!

PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.

> 
> Best Regards,
> Strahil Nikolov
> 
> 
>     On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
>     <erik.jacobson@xxxxxxx> wrote:
>     I pretended I'm a low-level C programmer with network and filesystem
>     experience for a few hours.
> 
>     I'm not sure what the right solution is but what was happening was the
>     code was trying to treat our IPV4 hosts as AF_INET6 and the family was
>     incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
>     but we're hoping to do that on our own time (~50 years like everybody
>     else :)
> 
>     I found a chunk of the code that seemed to be force-setting us to
>     AF_INET6.
> 
>     While I'm sure it is not 100% the correct patch, the patch attached and
>     pasted below is working for me so I'll integrate it with our internal
>     build to continue testing.
> 
>     Please let me know if there is a configuration item I missed or a
>     different way to do this. I added -devel to this email.
> 
>     In the previous thread, you would have seen that we're testing a
>     hopeful change that will upgrade our deployed customers from gluster
>     7.9 to gluster 9.3.
> 
>     Thank you!! Advice on next steps would be appreciated !!
> 
> 
>     diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
>     glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
>     --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c    2021-06-29
>     00:27:44.381408294 -0500
>     +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c    2021-09-20
>     16:34:28.969425361 -0500
>     @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
>         /* Need to update transport-address family if address-family is not
>     provided
>             to command-line arguments
>         */
>     +    /* HPE This is forcing our IPV4 servers in to to an IPV6 address
>     +    * family that is not compatible with IPV4. For now we will just set it
>     +    * to AF_INET.
>     +    */
>     +    /*
>         if (inet_pton(AF_INET6, remote_host, &serveraddr)) {
>             sockaddr->sa_family = AF_INET6;
>         }
>     +    */
>     +    sockaddr->sa_family = AF_INET;
> 
>         /* TODO: gf_resolve is a blocking call. kick in some
>             non blocking dns techniques */
> 
>    
>     On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
>     > I missed the other important log snip:
>     >
>     > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
>     0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
>     hostname not supported}]" repeated 620 times between [2021-09-20
>     15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000]
>     >
>     > So I will dig in to the code some here.
>     >
>     >
>     > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
>     > > Hello all! I hope you are well.
>     > >
>     > > We are starting a new software release cycle and I am trying to find a
>     > > way to upgrade customers from our build of gluster 7.9 to our build of
>     > > gluster 9.3
>     > >
>     > > When we deploy gluster, we foribly remove all references to any host
>     > > names and use only IP addresses. This is because, if for any reason a
>     > > DNS server is unreachable, even if the peer files have IPs and DNS, it
>     > > causes glusterd to be unable to reach peers properly. We can't really
>     > > rely on /etc/hosts either because customers take artistic licene with
>     > > their /etc/hosts files and don't realize that problems that can cause.
>     > >
>     > > So our deployed peer files look something like this:
>     > >
>     > > uuid=46a4b506-029d-4750-acfb-894501a88977
>     > > state=3
>     > > hostname1=172.23.0.16
>     > >
>     > > That is, with full intention, we avoid host names.
>     > >
>     > > When we upgrade to gluster 9.3, we fall over with these errors and
>     > > gluster is now partitioned and the updated gluster servers can't reach
>     > > anybody:
>     > >
>     > > [2021-09-20 15:50:41.731543 +0000] E
>     [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
>     resolution failed on host 172.23.0.16
>     > >
>     > >
>     > > As you can see, we have defined on purpose everything using IPs but in
>     > > 9.3 it appears this method fails. Are there any suggestions short of
>     > > putting real host names in peer files?
>     > >
>     > >
>     > >
>     > > FYI
>     > >
>     > > This supercomputer will be using gluster for part of its system
>     > > management. It is how we deploy the Image Objects (squashfs images)
>     > > hosted on NFS today and served by gluster leader nodes and also store
>     > > system logs, console logs, and other data.
>     > >
>     > > https://www.olcf.ornl.gov/frontier/
>     > >
>     > >
>     > > Erik
>     > > ________
>     > >
>     > >
>     > >
>     > > Community Meeting Calendar:
>     > >
>     > > Schedule -
>     > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     > > Bridge: https://meet.google.com/cpu-eiue-hvk
>     > > Gluster-users mailing list
>     > > Gluster-users@xxxxxxxxxxx
>     > > https://lists.gluster.org/mailman/listinfo/gluster-users
>     > ________
>     >
>     >
>     >
>     > Community Meeting Calendar:
>     >
>     > Schedule -
>     > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     > Bridge: https://meet.google.com/cpu-eiue-hvk
>     > Gluster-users mailing list
>     > Gluster-users@xxxxxxxxxxx
>     > https://lists.gluster.org/mailman/listinfo/gluster-users
> 
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users