On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii <archon810@xxxxxxxxx> wrote: >The good news is the downgrade seems to have worked and was painless. > >zypper install --oldpackage glusterfs-5.13, restart gluster, and almost >immediately there are no heal pending entries anymore. > >The only things still showing up in the logs, besides some healing is >0-glusterfs-fuse: >writing to fuse device failed: No such file or directory: >==> mnt-androidpolice_data3.log <== >[2020-05-01 16:54:21.085643] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >(--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >(--> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >(--> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse: >writing to fuse device failed: No such file or directory >==> mnt-apkmirror_data1.log <== >[2020-05-01 16:54:21.268842] E >[fuse-bridge.c:219:check_and_dump_fuse_W] >(--> >/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d] >(--> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a] >(--> >/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb] >(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (--> >/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse: >writing to fuse device failed: No such file or directory > >It'd be very helpful if it had more info about what failed to write and >why. > >I'd still really love to see the analysis of this failed upgrade from >core >gluster maintainers to see what needs fixing and how we can upgrade in >the >future. > >Thanks. > >Sincerely, >Artem > >-- >Founder, Android Police <http://www.androidpolice.com>, APK Mirror ><http://www.apkmirror.com/>, Illogical Robot LLC >beerpla.net | @ArtemR <http://twitter.com/ArtemR> > > >On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810@xxxxxxxxx> >wrote: > >> I do not have snapshots, no. I have a general file based backup, but >also >> the other 3 nodes are up. >> >> OpenSUSE 15.1. >> >> If I try to downgrade and it doesn't work, what's the brick >replacement >> scenario - is this still accurate? >> >https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick >> >> Any feedback about the issues themselves yet please? Specifically, is >> there a chance this is happening because of the mismatched gluster >> versions? Though, what's the solution then? >> >> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg@xxxxxxxxx> >> wrote: >> >>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii < >>> archon810@xxxxxxxxx> wrote: >>> >If more time is needed to analyze this, is this an option? Shut >down >>> >7.5, >>> >downgrade it back to 5.13 and restart, or would this screw >something up >>> >badly? I didn't up the op-version yet. >>> > >>> >Thanks. >>> > >>> >Sincerely, >>> >Artem >>> > >>> >-- >>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror >>> ><http://www.apkmirror.com/>, Illogical Robot LLC >>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> > >>> > >>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii >>> ><archon810@xxxxxxxxx> >>> >wrote: >>> > >>> >> The number of heal pending on citadel, the one that was upgraded >to >>> >7.5, >>> >> has now gone to 10s of thousands and continues to go up. >>> >> >>> >> Sincerely, >>> >> Artem >>> >> >>> >> -- >>> >> Founder, Android Police <http://www.androidpolice.com>, APK >Mirror >>> >> <http://www.apkmirror.com/>, Illogical Robot LLC >>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> >> >>> >> >>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii >>> ><archon810@xxxxxxxxx> >>> >> wrote: >>> >> >>> >>> Hi all, >>> >>> >>> >>> Today, I decided to upgrade one of the four servers (citadel) we >>> >have to >>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse >mounts >>> >(I sent >>> >>> the full details earlier in another message). If everything >looked >>> >OK, I >>> >>> would have proceeded the rolling upgrade for all of them, >following >>> >the >>> >>> full heal. >>> >>> >>> >>> However, as soon as I upgraded and restarted, the logs filled >with >>> >>> messages like these: >>> >>> >>> >>> [2020-04-30 21:39:21.316149] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> [2020-04-30 21:39:21.382891] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> [2020-04-30 21:39:21.442440] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> [2020-04-30 21:39:21.445587] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> [2020-04-30 21:39:21.571398] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> [2020-04-30 21:39:21.668192] E >>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor >>> >>> (1298437:400:17) failed to complete successfully >>> >>> >>> >>> >>> >>> The message "I [MSGID: 108031] >>> >>> [afr-common.c:2581:afr_local_discovery_cbk] >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> >>> androidpolice_data3-client-3" repeated 10 times between >[2020-04-30 >>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323] >>> >>> The message "W [MSGID: 114031] >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> >>> 0-androidpolice_data3-client-1: remote operation failed >[Transport >>> >endpoint >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>> >21:46:32.129567] >>> >>> and [2020-04-30 21:48:29.905008] >>> >>> The message "W [MSGID: 114031] >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> >>> 0-androidpolice_data3-client-0: remote operation failed >[Transport >>> >endpoint >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>> >21:46:32.129602] >>> >>> and [2020-04-30 21:48:29.905040] >>> >>> The message "W [MSGID: 114031] >>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk] >>> >>> 0-androidpolice_data3-client-2: remote operation failed >[Transport >>> >endpoint >>> >>> is not connected]" repeated 264 times between [2020-04-30 >>> >21:46:32.129512] >>> >>> and [2020-04-30 21:48:29.905047] >>> >>> >>> >>> >>> >>> >>> >>> Once in a while, I'm seeing this: >>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <== >>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072] >>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk] >>> >>> 0-androidpolice_data3-server: 5725811: SETATTR / >>> >>> >>> > >>> >androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png >>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client: >>> >>> >>> >>> >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1, >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >>> >permitted] >>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072] >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> >>> 0-androidpolice_data3-server: 201833: SETATTR / >>> >>> androidpolice.com/public/wp-content/uploads >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >>> >>> >>> >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >>> >permitted] >>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072] >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> >>> 0-androidpolice_data3-server: 201842: SETATTR / >>> >>> androidpolice.com/public/wp-content/uploads >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >>> >>> >>> >>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >>> >permitted] >>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072] >>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk] >>> >>> 0-androidpolice_data3-server: 202865: SETATTR / >>> >>> androidpolice.com/public/wp-content/uploads >>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client: >>> >>> >>> >>> >>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2, >>> >>> error-xlator: androidpolice_data3-access-control [Operation not >>> >permitted] >>> >>> >>> >>> There's also lots of self-healing happening that I didn't expect >at >>> >all, >>> >>> since the upgrade only took ~10-15s. >>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026] >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461 >>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026] >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3] sinks=0 1 2 >>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026] >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on >>> >>> f3c62a41-1864-4e75-9883-4357a7091296 >>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026] >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on >>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3] sinks=0 1 2 >>> >>> >>> >>> >>> >>> I'm also seeing "remote operation failed" and "writing to fuse >>> >device >>> >>> failed: No such file or directory" messages >>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026] >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal >on >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031] >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> >>> 0-androidpolice_data3-client-0: remote operation failed >[Operation >>> >not >>> >>> permitted] >>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031] >>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk] >>> >>> 0-androidpolice_data3-client-1: remote operation failed >[Operation >>> >not >>> >>> permitted] >>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031] >>> >>> [afr-common.c:2543:afr_local_discovery_cbk] >>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child >>> >>> androidpolice_data3-client-2 >>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026] >>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] >>> >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal >on >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591 >>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026] >>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal] >>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal >on >>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2] sinks=3 >>> >>> [2020-04-30 21:46:37.245599] E >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> >>> (--> >>> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> >>> (--> >>> >>> >>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> >>> (--> >>> >>> >>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >0-glusterfs-fuse: >>> >>> writing to fuse device failed: No such file or directory >>> >>> [2020-04-30 21:46:50.864797] E >>> >[fuse-bridge.c:219:check_and_dump_fuse_W] >>> >>> (--> >>> >>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d] >>> >>> (--> >>> >>> >>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a] >>> >>> (--> >>> >>> >>> >>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb] >>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (--> >>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) >0-glusterfs-fuse: >>> >>> writing to fuse device failed: No such file or directory >>> >>> >>> >>> The number of items being healed is going up and down wildly, >from 0 >>> >to >>> >>> 8000+ and sometimes taking a really long time to return a value. >I'm >>> >really >>> >>> worried as this is a production system, and I didn't observe >this in >>> >our >>> >>> test system. >>> >>> >>> >>> >>> >>> >>> >>> gluster v heal apkmirror_data1 info summary >>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1 >>> >>> Status: Connected >>> >>> Total Number of entries: 27 >>> >>> Number of entries in heal pending: 27 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1 >>> >>> Status: Connected >>> >>> Total Number of entries: 27 >>> >>> Number of entries in heal pending: 27 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1 >>> >>> Status: Connected >>> >>> Total Number of entries: 27 >>> >>> Number of entries in heal pending: 27 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1 >>> >>> Status: Connected >>> >>> Total Number of entries: 8540 >>> >>> Number of entries in heal pending: 8540 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> >>> >>> >>> >>> gluster v heal androidpolice_data3 info summary >>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3 >>> >>> Status: Connected >>> >>> Total Number of entries: 1 >>> >>> Number of entries in heal pending: 1 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3 >>> >>> Status: Connected >>> >>> Total Number of entries: 1 >>> >>> Number of entries in heal pending: 1 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3 >>> >>> Status: Connected >>> >>> Total Number of entries: 1 >>> >>> Number of entries in heal pending: 1 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3 >>> >>> Status: Connected >>> >>> Total Number of entries: 1149 >>> >>> Number of entries in heal pending: 1149 >>> >>> Number of entries in split-brain: 0 >>> >>> Number of entries possibly healing: 0 >>> >>> >>> >>> >>> >>> What should I do at this point? The files I tested seem to be >>> >replicating >>> >>> correctly, but I don't know if it's the case for all of them, >and >>> >the heals >>> >>> going up and down, and all these log messages are making me very >>> >nervous. >>> >>> >>> >>> Thank you. >>> >>> >>> >>> Sincerely, >>> >>> Artem >>> >>> >>> >>> -- >>> >>> Founder, Android Police <http://www.androidpolice.com>, APK >Mirror >>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC >>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR> >>> >>> >>> >> >>> >>> I's not supported , but usually it works. >>> >>> In worst case scenario, you can remove the node, wipe gluster on >the >>> node, reinstall the packages and add it - it will require full heal >of the >>> brick and as you have previously reported could lead to performance >>> degradation. >>> >>> I think you are on SLES, but I could be wrong . Do you have btrfs or >LVM >>> snapshots to revert from ? >>> >>> Best Regards, >>> Strahil Nikolov >>> >> Hi Artem, You can increase the brick log level following https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level but keep in mind that logs grow quite fast - so don't keep them above the current level for too much time. Do you have a geo replication running ? About the migration issue - I have no clue why this happened. Last time I skipped a major release(3.12 to 5.5) I got a huge trouble (all files ownership was switched to root) and I have the feeling that it won't happen again if you go through v6. Best Regards, Strahil Nikolov ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users