have you checked the stayus of the gfids ?
I usually use method 2 from https://docs.gluster.org/en/main/Troubleshooting/gfid-to-path/ to identify the file on the brick.
Then you can use getfattr to identify the status of the files on the bricks.
As you have 3 hosts, you can always implement an arbiter for each brick and mitigate the risk for split brains.
Best Regards,
Strahil Nikolov
On Wed, Aug 3, 2022 at 16:33, Eli V<eliventer@xxxxxxxxx> wrote:Sequence of events which ended up with 2 bricks down and a healfailure. What should I do about the heal failure, and before or afterreplacing the bad disk? First, gluster 10.2 infoVolume Name: glust-distr-repType: Distributed-ReplicateVolume ID: fe0ea6f6-2d1b-4b5c-8af5-0c11ea546270Status: StartedSnapshot Count: 0Number of Bricks: 9 x 2 = 18Transport-type: tcpBricks:Brick1: md1cfsd01:/bricks/b0/brBrick2: md1cfsd02:/bricks/b0/brBrick3: md1cfsd03:/bricks/b0/brBrick4: md1cfsd01:/bricks/b3/brBrick5: md1cfsd02:/bricks/b3/brBrick6: md1cfsd03:/bricks/b3/brBrick7: md1cfsd01:/bricks/b1/brBrick8: md1cfsd02:/bricks/b1/brBrick9: md1cfsd03:/bricks/b1/brBrick10: md1cfsd01:/bricks/b4/brBrick11: md1cfsd02:/bricks/b4/brBrick12: md1cfsd03:/bricks/b4/brBrick13: md1cfsd01:/bricks/b2/brBrick14: md1cfsd02:/bricks/b2/brBrick15: md1cfsd03:/bricks/b2/brBrick16: md1cfsd01:/bricks/b5/brBrick17: md1cfsd02:/bricks/b5/brBrick18: md1cfsd03:/bricks/b5/brOptions Reconfigured:performance.md-cache-statfs: oncluster.server-quorum-type: servercluster.min-free-disk: 15storage.batch-fsync-delay-usec: 0user.smb: enablefeatures.cache-invalidation: onnfs.disable: onperformance.readdir-ahead: ontransport.address-family: inetFun started with a brick(d02:b5) crashing:[2022-08-02 18:59:29.417147 +0000] W[rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission ofrpc-request failedpending frames:frame : type(1) op(WRITE)frame : type(1) op(WRITE)frame : type(1) op(WRITE)patchset: git://git.gluster.org/glusterfs.gitsignal received: 7time of crash:2022-08-02 18:59:29 +0000configuration details:argp 1backtrace 1dlfcn 1libpthread 1llistxattr 1setfsid 1epoll.h 1xattr.h 1st_atim.tv_nsec 1package-string: glusterfs 10.2/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7fefb20f7a54]/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7fefb20fffc0]/lib/x86_64-linux-gnu/libc.so.6(+0x3bd60)[0x7fefb1ecdd60]/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x5a)[0x7fefb211c7aa]/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9a)[0x7fefb209e4fa]/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xaf4b)[0x7fefac1fff4b]/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xb964)[0x7fefac200964]/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x34)[0x7fefb20eb244]/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x1ab)[0x7fefb217cf2b]...Then a few hours later a read error on a different brick(b2) on the same host:[2022-08-02 22:04:17.808970 +0000] E [MSGID: 113040][posix-inode-fd-ops.c:1758:posix_readv] 0-glust-distr-rep-posix: readfailed on gfid=16b51498-966e-4546-b561-24b0062f4324,fd=0x7ff9f00d6b08, offset=663314432 size=16384, buf=0x7ff9fc0f7000[Input/output error][2022-08-02 22:04:17.809057 +0000] E [MSGID: 115068][server-rpc-fops_v2.c:1369:server4_readv_cbk]0-glust-distr-rep-server: READ info [{frame=1334746}, {READV_fd_no=4},{uuid_utoa=16b51498-966e-4546-b561-24b0062f4324},{client=CTX_ID:6d7535af-769c-4223-aad0-79acffa836ed-GRAPH_ID:0-PID:1414-HOST:r4-16-PC_NAME:glust-distr-rep-client-13-RECON_NO:-1},{error-xlator=glust-distr-rep-posix}, {errno=5}, {error=Input/outputerror}]This looks like a real hardware error:[Tue Aug 2 18:03:48 2022] megaraid_sas 0000:03:00.0: 6293(712778647s/0x0002/FATAL) - Unrecoverable medium error during recoveryon PD 04(e0x20/s4) at 1d267163[Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 FAILED Result:hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=3s[Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 CDB: Read(10) 2800 1d 26 70 78 00 01 00 00[Tue Aug 2 18:03:49 2022] blk_update_request: I/O error, dev sdd,sector 489058424 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0This morning noticing both b2 & b5 were offline, systemctl stopped andstarted glusterd to restart the bricks.All bricks are now up:Status of volume: glust-distr-repGluster process TCP Port RDMA Port Online Pid------------------------------------------------------------------------------Brick md1cfsd01:/bricks/b0/br 55386 0 Y 2047Brick md1cfsd02:/bricks/b0/br 59983 0 Y 3036416Brick md1cfsd03:/bricks/b0/br 58028 0 Y 2014Brick md1cfsd01:/bricks/b3/br 59454 0 Y 2041Brick md1cfsd02:/bricks/b3/br 52352 0 Y 3036421Brick md1cfsd03:/bricks/b3/br 56786 0 Y 2017Brick md1cfsd01:/bricks/b1/br 59885 0 Y 2040Brick md1cfsd02:/bricks/b1/br 55148 0 Y 3036434Brick md1cfsd03:/bricks/b1/br 52422 0 Y 2068Brick md1cfsd01:/bricks/b4/br 56378 0 Y 2099Brick md1cfsd02:/bricks/b4/br 60152 0 Y 3036470Brick md1cfsd03:/bricks/b4/br 50448 0 Y 2490448Brick md1cfsd01:/bricks/b2/br 49455 0 Y 2097Brick md1cfsd02:/bricks/b2/br 53717 0 Y 3036498Brick md1cfsd03:/bricks/b2/br 51838 0 Y 2124Brick md1cfsd01:/bricks/b5/br 51002 0 Y 2104Brick md1cfsd02:/bricks/b5/br 57204 0 Y 3036523Brick md1cfsd03:/bricks/b5/br 56817 0 Y 2123Self-heal Daemon on localhost N/A N/A Y 3036660Self-heal Daemon on md1cfsd03 N/A N/A Y 2627Self-heal Daemon on md1cfsd01 N/A N/A Y 2623Then manually triggered a heal, which healed thousands of files butnow is stuck on the last 47 according to heal info summary.glfsheal-glust-distr-rep.log has a bunch of entries like so:[2022-08-03 13:08:41.169387 +0000] W [MSGID: 114031][client-rpc-fops_v2.c:2618:client4_0_lookup_cbk]0-glust-distr-rep-client-16: remote operation failed.[{path=<gfid:24977f2f-5fbe-44f2-91bd-605eda824aff>},{gfid=24977f2f-5fbe-44f2-91bd-605eda824aff}, {errno=2}, {error=No suchfile or directory}]________Community Meeting Calendar:Schedule -Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTCGluster-users mailing list
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users