Re: gnfs split brain when 1 server in 3x1 down (high load) - help request

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From your reply in the other thread, I'm assuming that the file/gfid in question is not in genuine split-brain or needing heal. i.e. for example with that 1 brick down and 2 bricks up test case, if you tried to read the file from say a temporary fuse mount (which is also now connected to only to 2 bricks since the 3rd one is down) it works fine and there is no EIO error...

...which means that what you have observed is true, i.e. afr_read_txn_refresh_done() is called with err=EIO. You can add logs to see at what point it is EIO set. The call graph is like this: afr_inode_refresh_done()-->afr_txn_refresh_done()-->afr_read_txn_refresh_done().

Maybe https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-common.c#L1188 in afr_txn_refresh_done() is causing it either due to ret being -EIO or event_generation being zero.

If you are comfortable with gdb, you an put a conditional break point in afr_read_txn_refresh_done() at https://github.com/gluster/glusterfs/blob/v7.4/xlators/cluster/afr/src/afr-read-txn.c#L283 when err=EIO and then check the backtrace for who is setting err to EIO.

Regards,
Ravi
On 31/03/20 12:20 pm, Erik Jacobson wrote:
I note that this part of  afr_read_txn() gets triggered a lot.

     if (afr_is_inode_refresh_reqd(inode, this, local->event_generation,
                                   event_generation)) {

Maybe that's normal when one of the three servers are down (but why
isn't it using its local copy by default?)

The comment in that if block is:
         /* servers have disconnected / reconnected, and possibly
            rebooted, very likely changing the state of freshness
            of copies */

But we have one server conssitently down, not a changing situation.

digging digging digging seemed to show this related to cache
invalidation.... Because the paths seemed to suggest the inode needed
refreshing and that seems handled by a case statement named
GF_UPCALL_CACHE_INVALIDATION

However, that must have been a wrong turn since turning off
cache invalidation didn't help.

I'm struggling to wrap my head around the code base and without the
background in these concepts it's a tough hill to climb.

I am going to have to try this again some day with fresh eyes and go to
bed; the machine I have easy access to is going away in the morning.
Now I'll have to reserve time on a contended one but I will do that and
continue digging.

Any suggestions would be greatly appreciated as I think I'm starting to
tip over here on this one.


On Mon, Mar 30, 2020 at 04:04:39PM -0500, Erik Jacobson wrote:
Sadly I am not a  developer,  so I can't answer your questions.
I'm not a FS o rnetwork developer either. I think there is a joke about
playing one on TV but maybe it's netflix now.

Enabling certain debug options made too much information for me to watch
personally (but an expert could probably get through it).

So I started putting targeted 'print' (gf_msg) statements in the code to
see how it got its way to split-brain. Maybe this will ring a bell
for someone.

I can tell the only way we enter the split-brain path is through in the
first if statement of afr_read_txn_refresh_done().

This means afr_read_txn_refresh_done() itself was passed "err" and
that it appears thin_arbiter_count was not set (which makes sense,
I'm using 1x3, not a thin arbiter).

So we jump to the readfn label, and read_subvol() should still be -1.
If I read right, it must mean that this if didn't return true because
my print statement didn't appear:
if ((ret == 0) && spb_choice >= 0) {

So we're still with the original read_subvol == 1,
Which gets us to the split_brain message.

So now I will try to learn why afr_read_txn_refresh_done() would have
'err' set in the first place. I will also learn about
afr_inode_split_brain_choice_get(). Those seem to be the two methods to
have avoided falling in to the split brain hole here.


I put debug statements in these locations. I will mark with !!!!!! what
I see:



diff -Narup glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c
--- glusterfs-7.2-orig/xlators/cluster/afr/src/afr-read-txn.c	2020-01-15 11:43:53.887894293 -0600
+++ glusterfs-7.2-new/xlators/cluster/afr/src/afr-read-txn.c	2020-03-30 15:45:02.917104321 -0500
@@ -279,10 +279,14 @@ afr_read_txn_refresh_done(call_frame_t *
      priv = this->private;

      if (err) {
-        if (!priv->thin_arbiter_count)
+        if (!priv->thin_arbiter_count) {
+            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg crapola 1st if in afr_read_txn_refresh_done() !priv->thin_arbiter_count -- goto to readfn");
!!!!!!!!!!!!!!!!!!!!!!
We hit this error condition and jump to readfn below
!!!!!!!!!!!!!!!!!!!!!!!
              goto readfn;
-        if (err != EINVAL)
+        }
+        if (err != EINVAL) {
+            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj 2nd if in afr_read_txn_refresh_done() err != EINVAL, goto readfn");
              goto readfn;
+        }
          /* We need to query the good bricks and/or thin-arbiter.*/
          afr_ta_read_txn_synctask(frame, this);
          return 0;
@@ -291,6 +295,8 @@ afr_read_txn_refresh_done(call_frame_t *
      read_subvol = afr_read_subvol_select_by_policy(inode, this, local->readable,
                                                     NULL);
      if (read_subvol == -1) {
+        gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg whoops read_subvol returned -1, going to readfn");
+
          err = EIO;
          goto readfn;
      }
@@ -304,11 +310,15 @@ afr_read_txn_refresh_done(call_frame_t *
  readfn:
      if (read_subvol == -1) {
          ret = afr_inode_split_brain_choice_get(inode, this, &spb_choice);
-        if ((ret == 0) && spb_choice >= 0)
+        if ((ret == 0) && spb_choice >= 0) {
!!!!!!!!!!!!!!!!!!!!!!
We never get here, afr_inode_split_brain_choice_get() must not have
returned what was needed to enter.
!!!!!!!!!!!!!!!!!!!!!!
+            gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg read_subvol was -1 to begin with split brain choice found: %d", spb_choice);
              read_subvol = spb_choice;
+        }
      }

      if (read_subvol == -1) {
+       gf_msg(this->name, GF_LOG_ERROR,0,0,"erikj dbg verify this shows up above split-brain error");
!!!!!!!!!!!!!!!!!!!!!!
We hit here. Game over player.
!!!!!!!!!!!!!!!!!!!!!!
+
          AFR_SET_ERROR_AND_CHECK_SPLIT_BRAIN(-1, err);
      }
      afr_read_txn_wind(frame, this, read_subvol);

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users


________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users



[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux