I would use Ansible to roll out such updates on a set of nodes - this will prevent human errors and will give the opportunity to run such tiny details like geo-rep modifying script.
P.S.: Out of curiosity, are you using distributed-replicated or distributed-dispersed volumes ?
Best Regards,
Strahil Nikolov
On Tue, Sep 21, 2021 at 17:59, Erik Jacobson<erik.jacobson@xxxxxxx> wrote:> Don't forget to run the geo-replication fix script , if you missed to do it
> before the upgrade.
We don't use geo-replication YET but thank you for this thoughtful
reminder.
Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.
I'm not complaining I love gluster but this is just food for thought.
I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.
I LOVE that slowly updating our gluster servers is "Just working".
This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.
Thank you all!!
PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.
>
> Best Regards,
> Strahil Nikolov
>
>
> On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
> <erik.jacobson@xxxxxxx> wrote:
> I pretended I'm a low-level C programmer with network and filesystem
> experience for a few hours.
>
> I'm not sure what the right solution is but what was happening was the
> code was trying to treat our IPV4 hosts as AF_INET6 and the family was
> incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
> but we're hoping to do that on our own time (~50 years like everybody
> else :)
>
> I found a chunk of the code that seemed to be force-setting us to
> AF_INET6.
>
> While I'm sure it is not 100% the correct patch, the patch attached and
> pasted below is working for me so I'll integrate it with our internal
> build to continue testing.
>
> Please let me know if there is a configuration item I missed or a
> different way to do this. I added -devel to this email.
>
> In the previous thread, you would have seen that we're testing a
> hopeful change that will upgrade our deployed customers from gluster
> 7.9 to gluster 9.3.
>
> Thank you!! Advice on next steps would be appreciated !!
>
>
> diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
> glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
> --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c 2021-06-29
> 00:27:44.381408294 -0500
> +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c 2021-09-20
> 16:34:28.969425361 -0500
> @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
> /* Need to update transport-address family if address-family is not
> provided
> to command-line arguments
> */
> + /* HPE This is forcing our IPV4 servers in to to an IPV6 address
> + * family that is not compatible with IPV4. For now we will just set it
> + * to AF_INET.
> + */
> + /*
> if (inet_pton(AF_INET6, remote_host, &serveraddr)) {
> sockaddr->sa_family = AF_INET6;
> }
> + */
> + sockaddr->sa_family = AF_INET;
>
> /* TODO: gf_resolve is a blocking call. kick in some
> non blocking dns techniques */
>
>
> On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
> > I missed the other important log snip:
> >
> > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
> 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
> hostname not supported}]" repeated 620 times between [2021-09-20
> 15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000]
> >
> > So I will dig in to the code some here.
> >
> >
> > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
> > > Hello all! I hope you are well.
> > >
> > > We are starting a new software release cycle and I am trying to find a
> > > way to upgrade customers from our build of gluster 7.9 to our build of
> > > gluster 9.3
> > >
> > > When we deploy gluster, we foribly remove all references to any host
> > > names and use only IP addresses. This is because, if for any reason a
> > > DNS server is unreachable, even if the peer files have IPs and DNS, it
> > > causes glusterd to be unable to reach peers properly. We can't really
> > > rely on /etc/hosts either because customers take artistic licene with
> > > their /etc/hosts files and don't realize that problems that can cause.
> > >
> > > So our deployed peer files look something like this:
> > >
> > > uuid=46a4b506-029d-4750-acfb-894501a88977
> > > state=3
> > > hostname1=172.23.0.16
> > >
> > > That is, with full intention, we avoid host names.
> > >
> > > When we upgrade to gluster 9.3, we fall over with these errors and
> > > gluster is now partitioned and the updated gluster servers can't reach
> > > anybody:
> > >
> > > [2021-09-20 15:50:41.731543 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host 172.23.0.16
> > >
> > >
> > > As you can see, we have defined on purpose everything using IPs but in
> > > 9.3 it appears this method fails. Are there any suggestions short of
> > > putting real host names in peer files?
> > >
> > >
> > >
> > > FYI
> > >
> > > This supercomputer will be using gluster for part of its system
> > > management. It is how we deploy the Image Objects (squashfs images)
> > > hosted on NFS today and served by gluster leader nodes and also store
> > > system logs, console logs, and other data.
> > >
> > > https://www.olcf.ornl.gov/frontier/
> > >
> > >
> > > Erik
> > > ________
> > >
> > >
> > >
> > > Community Meeting Calendar:
> > >
> > > Schedule -
> > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > > Bridge: https://meet.google.com/cpu-eiue-hvk
> > > Gluster-users mailing list
> > > Gluster-users@xxxxxxxxxxx
> > > https://lists.gluster.org/mailman/listinfo/gluster-users
> > ________
> >
> >
> >
> > Community Meeting Calendar:
> >
> > Schedule -
> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> > Bridge: https://meet.google.com/cpu-eiue-hvk
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > https://lists.gluster.org/mailman/listinfo/gluster-users
>
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users