Sequence of events which ended up with 2 bricks down and a heal failure. What should I do about the heal failure, and before or after replacing the bad disk? First, gluster 10.2 info Volume Name: glust-distr-rep Type: Distributed-Replicate Volume ID: fe0ea6f6-2d1b-4b5c-8af5-0c11ea546270 Status: Started Snapshot Count: 0 Number of Bricks: 9 x 2 = 18 Transport-type: tcp Bricks: Brick1: md1cfsd01:/bricks/b0/br Brick2: md1cfsd02:/bricks/b0/br Brick3: md1cfsd03:/bricks/b0/br Brick4: md1cfsd01:/bricks/b3/br Brick5: md1cfsd02:/bricks/b3/br Brick6: md1cfsd03:/bricks/b3/br Brick7: md1cfsd01:/bricks/b1/br Brick8: md1cfsd02:/bricks/b1/br Brick9: md1cfsd03:/bricks/b1/br Brick10: md1cfsd01:/bricks/b4/br Brick11: md1cfsd02:/bricks/b4/br Brick12: md1cfsd03:/bricks/b4/br Brick13: md1cfsd01:/bricks/b2/br Brick14: md1cfsd02:/bricks/b2/br Brick15: md1cfsd03:/bricks/b2/br Brick16: md1cfsd01:/bricks/b5/br Brick17: md1cfsd02:/bricks/b5/br Brick18: md1cfsd03:/bricks/b5/br Options Reconfigured: performance.md-cache-statfs: on cluster.server-quorum-type: server cluster.min-free-disk: 15 storage.batch-fsync-delay-usec: 0 user.smb: enable features.cache-invalidation: on nfs.disable: on performance.readdir-ahead: on transport.address-family: inet Fun started with a brick(d02:b5) crashing: [2022-08-02 18:59:29.417147 +0000] W [rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of rpc-request failed pending frames: frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) patchset: git://git.gluster.org/glusterfs.git signal received: 7 time of crash: 2022-08-02 18:59:29 +0000 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 10.2 /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7fefb20f7a54] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7fefb20fffc0] /lib/x86_64-linux-gnu/libc.so.6(+0x3bd60)[0x7fefb1ecdd60] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x5a)[0x7fefb211c7aa] /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9a)[0x7fefb209e4fa] /usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xaf4b)[0x7fefac1fff4b] /usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xb964)[0x7fefac200964] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x34)[0x7fefb20eb244] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x1ab)[0x7fefb217cf2b] ... Then a few hours later a read error on a different brick(b2) on the same host: [2022-08-02 22:04:17.808970 +0000] E [MSGID: 113040] [posix-inode-fd-ops.c:1758:posix_readv] 0-glust-distr-rep-posix: read failed on gfid=16b51498-966e-4546-b561-24b0062f4324, fd=0x7ff9f00d6b08, offset=663314432 size=16384, buf=0x7ff9fc0f7000 [Input/output error] [2022-08-02 22:04:17.809057 +0000] E [MSGID: 115068] [server-rpc-fops_v2.c:1369:server4_readv_cbk] 0-glust-distr-rep-server: READ info [{frame=1334746}, {READV_fd_no=4}, {uuid_utoa=16b51498-966e-4546-b561-24b0062f4324}, {client=CTX_ID:6d7535af-769c-4223-aad0-79acffa836ed-GRAPH_ID:0-PID:1414-HOST:r4-16-PC_NAME:glust-distr-rep-client-13-RECON_NO:-1}, {error-xlator=glust-distr-rep-posix}, {errno=5}, {error=Input/output error}] This looks like a real hardware error: [Tue Aug 2 18:03:48 2022] megaraid_sas 0000:03:00.0: 6293 (712778647s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0x20/s4) at 1d267163 [Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=3s [Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 CDB: Read(10) 28 00 1d 26 70 78 00 01 00 00 [Tue Aug 2 18:03:49 2022] blk_update_request: I/O error, dev sdd, sector 489058424 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0 This morning noticing both b2 & b5 were offline, systemctl stopped and started glusterd to restart the bricks. All bricks are now up: Status of volume: glust-distr-rep Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick md1cfsd01:/bricks/b0/br 55386 0 Y 2047 Brick md1cfsd02:/bricks/b0/br 59983 0 Y 3036416 Brick md1cfsd03:/bricks/b0/br 58028 0 Y 2014 Brick md1cfsd01:/bricks/b3/br 59454 0 Y 2041 Brick md1cfsd02:/bricks/b3/br 52352 0 Y 3036421 Brick md1cfsd03:/bricks/b3/br 56786 0 Y 2017 Brick md1cfsd01:/bricks/b1/br 59885 0 Y 2040 Brick md1cfsd02:/bricks/b1/br 55148 0 Y 3036434 Brick md1cfsd03:/bricks/b1/br 52422 0 Y 2068 Brick md1cfsd01:/bricks/b4/br 56378 0 Y 2099 Brick md1cfsd02:/bricks/b4/br 60152 0 Y 3036470 Brick md1cfsd03:/bricks/b4/br 50448 0 Y 2490448 Brick md1cfsd01:/bricks/b2/br 49455 0 Y 2097 Brick md1cfsd02:/bricks/b2/br 53717 0 Y 3036498 Brick md1cfsd03:/bricks/b2/br 51838 0 Y 2124 Brick md1cfsd01:/bricks/b5/br 51002 0 Y 2104 Brick md1cfsd02:/bricks/b5/br 57204 0 Y 3036523 Brick md1cfsd03:/bricks/b5/br 56817 0 Y 2123 Self-heal Daemon on localhost N/A N/A Y 3036660 Self-heal Daemon on md1cfsd03 N/A N/A Y 2627 Self-heal Daemon on md1cfsd01 N/A N/A Y 2623 Then manually triggered a heal, which healed thousands of files but now is stuck on the last 47 according to heal info summary. glfsheal-glust-distr-rep.log has a bunch of entries like so: [2022-08-03 13:08:41.169387 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-glust-distr-rep-client-16: remote operation failed. [{path=<gfid:24977f2f-5fbe-44f2-91bd-605eda824aff>}, {gfid=24977f2f-5fbe-44f2-91bd-605eda824aff}, {errno=2}, {error=No such file or directory}] ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users