Re: [Gluster-devel] Upgrade testing to gluster 6

Darrell Budic <budic@xxxxxxxxxxxxxxxx> · Thu, 4 Apr 2019 10:56:33 -0500

I didn’t follow any specific documents, just a generic rolling upgrade one node at a time. Once the first node didn’t reconnect, I tried to follow the workaround in the bug during the upgrade. Basic procedure was:
- take 3 nodes that were initially installed with 3.12.x (forget which, but low number) and had been upgraded directly to 5.5 from 3.12.15
  - op-version was 50400
- on node A:
  - yum install centos-release-gluster6
  - yum upgrade (was some ovirt cockpit components, gluster, and a lib or two this time), hit yes
  - discover glusterd was dead
  - systemctl restart glusterd
  - no peer connections, try iptables -F; systemctl restart glusterd, no change
- following the workaround in the bug, try iptables -F & restart glusterd on other 2 nodes, no effect
  - nodes B & C were still connected to each other and all bricks were fine at this point
- try upgrading other 2 nodes and restarting gluster, no effect (iptables still empty)
  - lost quota here, so all bricks went offline
- read logs, not finding much, but looked at glusterd.vol and compared to new versions
- updated glusterd.vol on A and restarted glusterd
  - A doesn’t show any connected peers, but both other nodes show A as connected
- update glusterd.vol on B & C, restart glusterd
  - all nodes show connected and volumes are active and healing

The only odd thing in my process was that node A did not have any active bricks on it at the time of the upgrade. It doesn’t seem like this mattered since B & C showed the same symptoms between themselves while being upgraded, but I don’t know. The only log entry that referenced anything about peer connections is included below already.

Looks like it was related to my glusterd settings, since that’s what fixed it for me. Unfortunately, I don’t have the bandwidth or the systems to test different versions of that specifically, but maybe you guys can on some test resources? Otherwise, I’ve got another cluster (my production one!) that’s midway through the upgrade from 3.12.15 -> 5.5. I paused when I started getting multiple brick processes on the two nodes that had gone to 5.5 already. I think I’m going to jump the last node right to 6 to try and avoid that mess, and it has the same glusterd.vol settings. I’ll try and capture it’s logs during the upgrade and see if there’s any new info, or if it has the same issues as this group did.

  -Darrell

On Apr 4, 2019, at 2:54 AM, Sanju Rakonde <srakonde@xxxxxxxxxx> wrote:

We don't hit https://bugzilla.redhat.com/show_bug.cgi?id=1694010 while upgrading to glusterfs-6. We tested it in different setups and understood that this issue is seen because of some issue in setup.

regarding the issue you have faced, can you please let us know which documentation you have followed for the upgrade? During our testing, we didn't hit any such issue. we would like to understand what went wrong.

On Thu, Apr 4, 2019 at 2:08 AM Darrell Budic <budic@xxxxxxxxxxxxxxxx> wrote:
Hari-
I was upgrading my test cluster from 5.5 to 6 and I hit this bug (https://bugzilla.redhat.com/show_bug.cgi?id=1694010) or something similar. In my case, the workaround did not work, and I was left with a gluster that had gone into no-quorum mode and stopped all the bricks. Wasn’t much in the logs either, but I noticed my /etc/glusterfs/glusterd.vol files were not the same as the newer versions, so I updated them, restarted glusterd, and suddenly the updated node showed as peer-in-cluster again. Once I updated other notes the same way, things started working again. Maybe a place to look?

My old config (all nodes):
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 10
    option event-threads 1
    option rpc-auth-allow-insecure on
#   option transport.address-family inet6
#   option base-port 49152
end-volume

changed to:
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option transport.socket.listen-port 24007
    option transport.rdma.listen-port 24008
    option ping-timeout 0
    option event-threads 1
    option rpc-auth-allow-insecure on
#   option lock-timer 180
#   option transport.address-family inet6
#   option base-port 49152
    option max-port  60999
end-volume

the only thing I found in the glusterd logs that looks relevant was (repeated for both of the other nodes in this cluster), so no clue why it happened:
[2019-04-03 20:19:16.802638] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <ossuary-san> (<0ecbf953-681b-448f-9746-d1c1fe7a0978>), in state <Peer in Cluster>, has disconnected from glusterd.

On Apr 2, 2019, at 4:53 AM, Atin Mukherjee <atin.mukherjee83@xxxxxxxxx> wrote:

On Mon, 1 Apr 2019 at 10:28, Hari Gowtham <hgowtham@xxxxxxxxxx> wrote:
Comments inline.

On Mon, Apr 1, 2019 at 5:55 AM Sankarshan Mukhopadhyay
<sankarshan.mukhopadhyay@xxxxxxxxx> wrote:
>
> Quite a considerable amount of detail here. Thank you!
>
> On Fri, Mar 29, 2019 at 11:42 AM Hari Gowtham <hgowtham@xxxxxxxxxx> wrote:
> >
> > Hello Gluster users,
> >
> > As you all aware that glusterfs-6 is out, we would like to inform you
> > that, we have spent a significant amount of time in testing
> > glusterfs-6 in upgrade scenarios. We have done upgrade testing to
> > glusterfs-6 from various releases like 3.12, 4.1 and 5.3.
> >
> > As glusterfs-6 has got in a lot of changes, we wanted to test those portions.
> > There were xlators (and respective options to enable/disable them)
> > added and deprecated in glusterfs-6 from various versions [1].
> >
> > We had to check the following upgrade scenarios for all such options
> > Identified in [1]:
> > 1) option never enabled and upgraded
> > 2) option enabled and then upgraded
> > 3) option enabled and then disabled and then upgraded
> >
> > We weren't manually able to check all the combinations for all the options.
> > So the options involving enabling and disabling xlators were prioritized.
> > The below are the result of the ones tested.
> >
> > Never enabled and upgraded:
> > checked from 3.12, 4.1, 5.3 to 6 the upgrade works.
> >
> > Enabled and upgraded:
> > Tested for tier which is deprecated, It is not a recommended upgrade.
> > As expected the volume won't be consumable and will have a few more
> > issues as well.
> > Tested with 3.12, 4.1 and 5.3 to 6 upgrade.
> >
> > Enabled, disabled before upgrade.
> > Tested for tier with 3.12 and the upgrade went fine.
> >
> > There is one common issue to note in every upgrade. The node being
> > upgraded is going into disconnected state. You have to flush the iptables
> > and the restart glusterd on all nodes to fix this.
> >
>
> Is this something that is written in the upgrade notes? I do not seem
> to recall, if not, I'll send a PR

No this wasn't mentioned in the release notes. PRs are welcome.

>
> > The testing for enabling new options is still pending. The new options
> > won't cause as much issues as the deprecated ones so this was put at
> > the end of the priority list. It would be nice to get contributions
> > for this.
> >
>
> Did the range of tests lead to any new issues?

Yes. In the first round of testing we found an issue and had to postpone the
release of 6 until the fix was made available.
https://bugzilla.redhat.com/show_bug.cgi?id=1684029

And then we tested it again after this patch was made available.
and came  across this:
https://bugzilla.redhat.com/show_bug.cgi?id=1694010

This isn’t a bug as we found that upgrade worked seamelessly in two different setup. So we have no issues in the upgrade path to glusterfs-6 release.

Have mentioned this in the second mail as to how to over this situation
for now until the fix is available.

>
> > For the disable testing, tier was used as it covers most of the xlator
> > that was removed. And all of these tests were done on a replica 3 volume.
> >
>
> I'm not sure if the Glusto team is reading this, but it would be
> pertinent to understand if the approach you have taken can be
> converted into a form of automated testing pre-release.

I don't have an answer for this, have CCed Vijay.
He might have an idea.

>
> > Note: This is only for upgrade testing of the newly added and removed
> > xlators. Does not involve the normal tests for the xlator.
> >
> > If you have any questions, please feel free to reach us.
> >
> > [1] https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing
> >
> > Regards,
> > Hari and Sanju.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Regards,
Hari Gowtham.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
-- 
--Atin
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Thanks,
Sanju

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users