Re: gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As far as I know a fix was introduced recently, so even missing to run the script won't be so critical - you can run it afterwards.
I would use Ansible to roll out such updates on a set of nodes - this will prevent human errors and will give the opportunity to run such tiny details like geo-rep modifying script.

P.S.: Out of curiosity, are you using distributed-replicated or distributed-dispersed volumes ?

Best Regards,
Strahil Nikolov


On Tue, Sep 21, 2021 at 17:59, Erik Jacobson
<erik.jacobson@xxxxxxx> wrote:
> Don't forget to run the geo-replication fix script , if you missed to do it
> before the upgrade.

We don't use geo-replication YET but thank you for this thoughtful
reminder.

Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.

I'm not complaining I love gluster but this is just food for thought.

I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.

I LOVE that slowly updating our gluster servers is "Just working".

This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.

Thank you all!!

PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.


>
> Best Regards,
> Strahil Nikolov
>
>
>    On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
>    <erik.jacobson@xxxxxxx> wrote:
>    I pretended I'm a low-level C programmer with network and filesystem
>    experience for a few hours.
>
>    I'm not sure what the right solution is but what was happening was the
>    code was trying to treat our IPV4 hosts as AF_INET6 and the family was
>    incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
>    but we're hoping to do that on our own time (~50 years like everybody
>    else :)
>
>    I found a chunk of the code that seemed to be force-setting us to
>    AF_INET6.
>
>    While I'm sure it is not 100% the correct patch, the patch attached and
>    pasted below is working for me so I'll integrate it with our internal
>    build to continue testing.
>
>    Please let me know if there is a configuration item I missed or a
>    different way to do this. I added -devel to this email.
>
>    In the previous thread, you would have seen that we're testing a
>    hopeful change that will upgrade our deployed customers from gluster
>    7.9 to gluster 9.3.
>
>    Thank you!! Advice on next steps would be appreciated !!
>
>
>    diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
>    glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
>    --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c    2021-06-29
>    00:27:44.381408294 -0500
>    +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c    2021-09-20
>    16:34:28.969425361 -0500
>    @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
>        /* Need to update transport-address family if address-family is not
>    provided
>            to command-line arguments
>        */
>    +    /* HPE This is forcing our IPV4 servers in to to an IPV6 address
>    +    * family that is not compatible with IPV4. For now we will just set it
>    +    * to AF_INET.
>    +    */
>    +    /*
>        if (inet_pton(AF_INET6, remote_host, &serveraddr)) {
>            sockaddr->sa_family = AF_INET6;
>        }
>    +    */
>    +    sockaddr->sa_family = AF_INET;
>
>        /* TODO: gf_resolve is a blocking call. kick in some
>            non blocking dns techniques */
>
>   
>    On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
>    > I missed the other important log snip:
>    >
>    > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
>    0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
>    hostname not supported}]" repeated 620 times between [2021-09-20
>    15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000]
>    >
>    > So I will dig in to the code some here.
>    >
>    >
>    > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
>    > > Hello all! I hope you are well.
>    > >
>    > > We are starting a new software release cycle and I am trying to find a
>    > > way to upgrade customers from our build of gluster 7.9 to our build of
>    > > gluster 9.3
>    > >
>    > > When we deploy gluster, we foribly remove all references to any host
>    > > names and use only IP addresses. This is because, if for any reason a
>    > > DNS server is unreachable, even if the peer files have IPs and DNS, it
>    > > causes glusterd to be unable to reach peers properly. We can't really
>    > > rely on /etc/hosts either because customers take artistic licene with
>    > > their /etc/hosts files and don't realize that problems that can cause.
>    > >
>    > > So our deployed peer files look something like this:
>    > >
>    > > uuid=46a4b506-029d-4750-acfb-894501a88977
>    > > state=3
>    > > hostname1=172.23.0.16
>    > >
>    > > That is, with full intention, we avoid host names.
>    > >
>    > > When we upgrade to gluster 9.3, we fall over with these errors and
>    > > gluster is now partitioned and the updated gluster servers can't reach
>    > > anybody:
>    > >
>    > > [2021-09-20 15:50:41.731543 +0000] E
>    [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
>    resolution failed on host 172.23.0.16
>    > >
>    > >
>    > > As you can see, we have defined on purpose everything using IPs but in
>    > > 9.3 it appears this method fails. Are there any suggestions short of
>    > > putting real host names in peer files?
>    > >
>    > >
>    > >
>    > > FYI
>    > >
>    > > This supercomputer will be using gluster for part of its system
>    > > management. It is how we deploy the Image Objects (squashfs images)
>    > > hosted on NFS today and served by gluster leader nodes and also store
>    > > system logs, console logs, and other data.
>    > >
>    > > https://www.olcf.ornl.gov/frontier/
>    > >
>    > >
>    > > Erik
>    > > ________
>    > >
>    > >
>    > >
>    > > Community Meeting Calendar:
>    > >
>    > > Schedule -
>    > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>    > > Bridge: https://meet.google.com/cpu-eiue-hvk
>    > > Gluster-users mailing list
>    > > Gluster-users@xxxxxxxxxxx
>    > > https://lists.gluster.org/mailman/listinfo/gluster-users
>    > ________
>    >
>    >
>    >
>    > Community Meeting Calendar:
>    >
>    > Schedule -
>    > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>    > Bridge: https://meet.google.com/cpu-eiue-hvk
>    > Gluster-users mailing list
>    > Gluster-users@xxxxxxxxxxx
>    > https://lists.gluster.org/mailman/listinfo/gluster-users
>
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux