The number of heal pending on citadel, the one that was upgraded to 7.5, has now gone to 10s of thousands and continues to go up.
On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii <archon810@xxxxxxxxx> wrote:
Hi all,Today, I decided to upgrade one of the four servers (citadel) we have to 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse mounts (I sent the full details earlier in another message). If everything looked OK, I would have proceeded the rolling upgrade for all of them, following the full heal.However, as soon as I upgraded and restarted, the logs filled with messages like these:[2020-04-30 21:39:21.316149] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully
[2020-04-30 21:39:21.382891] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully
[2020-04-30 21:39:21.442440] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully
[2020-04-30 21:39:21.445587] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully
[2020-04-30 21:39:21.571398] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfully
[2020-04-30 21:39:21.668192] E [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (1298437:400:17) failed to complete successfullyThe message "I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-androidpolice_data3-replicate-0: selecting local read_child androidpolice_data3-client-3" repeated 10 times between [2020-04-30 21:46:41.854675] and [2020-04-30 21:48:20.206323]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] 0-androidpolice_data3-client-1: remote operation failed [Transport endpoint is not connected]" repeated 264 times between [2020-04-30 21:46:32.129567] and [2020-04-30 21:48:29.905008]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] 0-androidpolice_data3-client-0: remote operation failed [Transport endpoint is not connected]" repeated 264 times between [2020-04-30 21:46:32.129602] and [2020-04-30 21:48:29.905040]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] 0-androidpolice_data3-client-2: remote operation failed [Transport endpoint is not connected]" repeated 264 times between [2020-04-30 21:46:32.129512] and [2020-04-30 21:48:29.905047]Once in a while, I'm seeing this:==> bricks/mnt-hive_block4-androidpolice_data3.log <==
[2020-04-30 21:45:54.251637] I [MSGID: 115072] [server-rpc-fops_v2.c:1681:server4_setattr_cbk] 0-androidpolice_data3-server: 5725811: SETATTR /androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png (d4556eb4-f15b-412c-a42a-32b4438af557), client: CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, error-xlator: androidpolice_data3-access-control [Operation not permitted][2020-04-30 21:49:10.439701] I [MSGID: 115072] [server-rpc-fops_v2.c:1680:server4_setattr_cbk] 0-androidpolice_data3-server: 201833: SETATTR /androidpolice.com/public/wp-content/uploads (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, error-xlator: androidpolice_data3-access-control [Operation not permitted]
[2020-04-30 21:49:10.453724] I [MSGID: 115072] [server-rpc-fops_v2.c:1680:server4_setattr_cbk] 0-androidpolice_data3-server: 201842: SETATTR /androidpolice.com/public/wp-content/uploads (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, error-xlator: androidpolice_data3-access-control [Operation not permitted]
[2020-04-30 21:49:16.224662] I [MSGID: 115072] [server-rpc-fops_v2.c:1680:server4_setattr_cbk] 0-androidpolice_data3-server: 202865: SETATTR /androidpolice.com/public/wp-content/uploads (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, error-xlator: androidpolice_data3-access-control [Operation not permitted]There's also lots of self-healing happening that I didn't expect at all, since the upgrade only took ~10-15s.[2020-04-30 21:47:38.714448] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-apkmirror_data1-replicate-0: performing metadata selfheal on 4a6ba2d7-7ad8-4113-862b-02e4934a3461
[2020-04-30 21:47:38.765033] I [MSGID: 108026] [afr-self-heal-common.c:1723:afr_log_selfheal] 0-apkmirror_data1-replicate-0: Completed metadata selfheal on 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2
[2020-04-30 21:47:38.765289] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-apkmirror_data1-replicate-0: performing metadata selfheal on f3c62a41-1864-4e75-9883-4357a7091296
[2020-04-30 21:47:38.800987] I [MSGID: 108026] [afr-self-heal-common.c:1723:afr_log_selfheal] 0-apkmirror_data1-replicate-0: Completed metadata selfheal on f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2I'm also seeing "remote operation failed" and "writing to fuse device failed: No such file or directory" messages[2020-04-30 21:46:34.891957] I [MSGID: 108026] [afr-self-heal-common.c:1723:afr_log_selfheal] 0-androidpolice_data3-replicate-0: Completed metadata selfheal on 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3
[2020-04-30 21:45:36.127412] W [MSGID: 114031] [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] 0-androidpolice_data3-client-0: remote operation failed [Operation not permitted]
[2020-04-30 21:45:36.345924] W [MSGID: 114031] [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] 0-androidpolice_data3-client-1: remote operation failed [Operation not permitted]
[2020-04-30 21:46:35.291853] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-androidpolice_data3-replicate-0: selecting local read_child androidpolice_data3-client-2
[2020-04-30 21:46:35.977342] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-androidpolice_data3-replicate-0: performing metadata selfheal on 2692eeba-1ebe-49b6-927f-1dfbcd227591
[2020-04-30 21:46:36.006607] I [MSGID: 108026] [afr-self-heal-common.c:1723:afr_log_selfheal] 0-androidpolice_data3-replicate-0: Completed metadata selfheal on 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3
[2020-04-30 21:46:37.245599] E [fuse-bridge.c:219:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] (--> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] (--> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
[2020-04-30 21:46:50.864797] E [fuse-bridge.c:219:check_and_dump_fuse_W] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] (--> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] (--> /usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directoryThe number of items being healed is going up and down wildly, from 0 to 8000+ and sometimes taking a really long time to return a value. I'm really worried as this is a production system, and I didn't observe this in our test system.gluster v heal apkmirror_data1 info summary
Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
Status: Connected
Total Number of entries: 27
Number of entries in heal pending: 27
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick forge:/mnt/forge_block1/apkmirror_data1
Status: Connected
Total Number of entries: 27
Number of entries in heal pending: 27
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick hive:/mnt/hive_block1/apkmirror_data1
Status: Connected
Total Number of entries: 27
Number of entries in heal pending: 27
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick citadel:/mnt/citadel_block1/apkmirror_data1
Status: Connected
Total Number of entries: 8540
Number of entries in heal pending: 8540
Number of entries in split-brain: 0
Number of entries possibly healing: 0gluster v heal androidpolice_data3 info summary
Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
Status: Connected
Total Number of entries: 1
Number of entries in heal pending: 1
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick forge:/mnt/forge_block4/androidpolice_data3
Status: Connected
Total Number of entries: 1
Number of entries in heal pending: 1
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick hive:/mnt/hive_block4/androidpolice_data3
Status: Connected
Total Number of entries: 1
Number of entries in heal pending: 1
Number of entries in split-brain: 0
Number of entries possibly healing: 0
Brick citadel:/mnt/citadel_block4/androidpolice_data3
Status: Connected
Total Number of entries: 1149
Number of entries in heal pending: 1149
Number of entries in split-brain: 0
Number of entries possibly healing: 0What should I do at this point? The files I tested seem to be replicating correctly, but I don't know if it's the case for all of them, and the heals going up and down, and all these log messages are making me very nervous.Thank you.
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users