Re: Replica 3 volume with forced quorum 1 fault tolerance and recovery

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Tue, 1 Dec 2020 13:15:04 +0000 (UTC)

Replica 3 with quorum 1 ?
This is not good. I doubt anyone will help you with this. The idea of replica 3 volumes is to tolerate 1 node ,as when a second one is dead - only 1 will accept writes.

You can imagine the situation when 2 bricks are down and data is writen to brick 3. What happens when the brick 1 and 2 is up and running -> how is gluster going to decide where to heal from ?
2 is more than 1 , so the third node should delete the file instead of the opposite.

What are you trying to achive with the quorum 1 ?


Best Regards,
Strahil Nikolov






В вторник, 1 декември 2020 г., 14:09:32 Гринуич+2, Dmitry Antipov <dmantipov@xxxxxxxxx> написа: 





It seems that consistency of replica 3 volume with quorum forced to 1 becomes
broken after a few forced volume restarts initiated after 2 brick failures.
At least it breaks GFAPI clients, and even volume restart doesn't help.

Volume setup is:

Volume Name: test0
Type: Replicate
Volume ID: 919352fb-15d8-49cb-b94c-c106ac68f072
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.112:/glusterfs/test0-000
Brick2: 192.168.1.112:/glusterfs/test0-001
Brick3: 192.168.1.112:/glusterfs/test0-002
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Client is fio with the following options:

[global]
name=write
filename=testfile
ioengine=gfapi_async
volume=test0
brick=localhost
create_on_open=1
rw=randwrite
direct=1
numjobs=1
time_based=1
runtime=600

[test-4-kbytes]
bs=4k
size=1G
iodepth=128

How to reproduce:

0) start the volume;
1) run fio;
2) run 'gluster volume status', select 2 arbitrary brick processes
    and kill them;
3) make sure fio is OK;
4) wait a few seconds, then issue 'gluster volume start [VOL] force'
    to restart bricks, and finally issue 'gluster volume status' again
    to check whether all bricks are running;
5) restart from 2).

This is likely to work for a few times but, sooner or later, it breaks
at 3) and fio detects an I/O error, most probably EIO or ENOTCONN. Starting
from this point, killing and restarting fio yields in error in glfs_creat(),
and even the manual volume restart doesn't help.

NOTE: as of 7914c6147adaf3ef32804519ced850168fff1711, fio's gfapi_async
engine is still incomplete and _silently ignores I/O errors_. Currently
I'm using the following tweak to detect and report them (YMMV, consider
experimental):

diff --git a/engines/glusterfs_async.c b/engines/glusterfs_async.c
index 0392ad6e..27ebb6f1 100644
--- a/engines/glusterfs_async.c
+++ b/engines/glusterfs_async.c
@@ -7,6 +7,7 @@
  #include "gfapi.h"
  #define NOT_YET 1
  struct fio_gf_iou {
+    struct thread_data *td;
      struct io_u *io_u;
      int io_complete;
  };
@@ -80,6 +81,7 @@ static int fio_gf_io_u_init(struct thread_data *td, struct io_u *io_u)
      }
      io->io_complete = 0;
      io->io_u = io_u;
+    io->td = td;
      io_u->engine_data = io;
      return 0;
  }
@@ -95,7 +97,20 @@ static void gf_async_cb(glfs_fd_t * fd, ssize_t ret, void *data)
      struct fio_gf_iou *iou = io_u->engine_data;

      dprint(FD_IO, "%s ret %zd\n", __FUNCTION__, ret);
-    iou->io_complete = 1;
+    if (ret != io_u->xfer_buflen) {
+        if (ret >= 0) {
+            io_u->resid = io_u->xfer_buflen - ret;
+            io_u->error = 0;
+            iou->io_complete = 1;
+        } else
+            io_u->error = errno;
+    }
+
+    if (io_u->error) {
+        log_err("IO failed (%s).\n", strerror(io_u->error));
+        td_verror(iou->td, io_u->error, "xfer");
+    } else
+        iou->io_complete = 1;
  }

  static enum fio_q_status fio_gf_async_queue(struct thread_data fio_unused * td,

--

Dmitry
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users