Re: Upgrade from 5.13 to 7.5 full of weird messages

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Sat, 02 May 2020 10:47:56 +0300

On May 1, 2020 8:03:50 PM GMT+03:00, Artem Russakovskii <archon810@xxxxxxxxx> wrote:
>The good news is the downgrade seems to have worked and was painless.
>
>zypper install --oldpackage glusterfs-5.13, restart gluster, and almost
>immediately there are no heal pending entries anymore.
>
>The only things still showing up in the logs, besides some healing is
>0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory:
>==> mnt-androidpolice_data3.log <==
>[2020-05-01 16:54:21.085643] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>(-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>(--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>/lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] ))))) 0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory
>==> mnt-apkmirror_data1.log <==
>[2020-05-01 16:54:21.268842] E
>[fuse-bridge.c:219:check_and_dump_fuse_W]
>(-->
>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fdf2b0a624d]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fdf2748949a]
>(-->
>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fdf274897bb]
>(--> /lib64/libpthread.so.0(+0x84f9)[0x7fdf2a5f64f9] (-->
>/lib64/libc.so.6(clone+0x3f)[0x7fdf2a32ef2f] ))))) 0-glusterfs-fuse:
>writing to fuse device failed: No such file or directory
>
>It'd be very helpful if it had more info about what failed to write and
>why.
>
>I'd still really love to see the analysis of this failed upgrade from
>core
>gluster maintainers to see what needs fixing and how we can upgrade in
>the
>future.
>
>Thanks.
>
>Sincerely,
>Artem
>
>--
>Founder, Android Police <http://www.androidpolice.com>, APK Mirror
><http://www.apkmirror.com/>, Illogical Robot LLC
>beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
>On Fri, May 1, 2020 at 7:25 AM Artem Russakovskii <archon810@xxxxxxxxx>
>wrote:
>
>> I do not have snapshots, no. I have a general file based backup, but
>also
>> the other 3 nodes are up.
>>
>> OpenSUSE 15.1.
>>
>> If I try to downgrade and it doesn't work, what's the brick
>replacement
>> scenario - is this still accurate?
>>
>https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick
>>
>> Any feedback about the issues themselves yet please? Specifically, is
>> there a chance this is happening because of the mismatched gluster
>> versions? Though, what's the solution then?
>>
>> On Fri, May 1, 2020, 1:07 AM Strahil Nikolov <hunter86_bg@xxxxxxxxx>
>> wrote:
>>
>>> On May 1, 2020 1:25:17 AM GMT+03:00, Artem Russakovskii <
>>> archon810@xxxxxxxxx> wrote:
>>> >If more time is needed to analyze this, is this an option? Shut
>down
>>> >7.5,
>>> >downgrade it back to 5.13 and restart, or would this screw
>something up
>>> >badly? I didn't up the op-version yet.
>>> >
>>> >Thanks.
>>> >
>>> >Sincerely,
>>> >Artem
>>> >
>>> >--
>>> >Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>> ><http://www.apkmirror.com/>, Illogical Robot LLC
>>> >beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>> >
>>> >
>>> >On Thu, Apr 30, 2020 at 3:13 PM Artem Russakovskii
>>> ><archon810@xxxxxxxxx>
>>> >wrote:
>>> >
>>> >> The number of heal pending on citadel, the one that was upgraded
>to
>>> >7.5,
>>> >> has now gone to 10s of thousands and continues to go up.
>>> >>
>>> >> Sincerely,
>>> >> Artem
>>> >>
>>> >> --
>>> >> Founder, Android Police <http://www.androidpolice.com>, APK
>Mirror
>>> >> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> >> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>> >>
>>> >>
>>> >> On Thu, Apr 30, 2020 at 2:57 PM Artem Russakovskii
>>> ><archon810@xxxxxxxxx>
>>> >> wrote:
>>> >>
>>> >>> Hi all,
>>> >>>
>>> >>> Today, I decided to upgrade one of the four servers (citadel) we
>>> >have to
>>> >>> 7.5 from 5.13. There are 2 volumes, 1x4 replicate, and fuse
>mounts
>>> >(I sent
>>> >>> the full details earlier in another message). If everything
>looked
>>> >OK, I
>>> >>> would have proceeded the rolling upgrade for all of them,
>following
>>> >the
>>> >>> full heal.
>>> >>>
>>> >>> However, as soon as I upgraded and restarted, the logs filled
>with
>>> >>> messages like these:
>>> >>>
>>> >>> [2020-04-30 21:39:21.316149] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.382891] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.442440] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.445587] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.571398] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>> [2020-04-30 21:39:21.668192] E
>>> >>> [rpcsvc.c:567:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor
>>> >>> (1298437:400:17) failed to complete successfully
>>> >>>
>>> >>>
>>> >>> The message "I [MSGID: 108031]
>>> >>> [afr-common.c:2581:afr_local_discovery_cbk]
>>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child
>>> >>> androidpolice_data3-client-3" repeated 10 times between
>[2020-04-30
>>> >>> 21:46:41.854675] and [2020-04-30 21:48:20.206323]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-1: remote operation failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between [2020-04-30
>>> >21:46:32.129567]
>>> >>> and [2020-04-30 21:48:29.905008]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-0: remote operation failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between [2020-04-30
>>> >21:46:32.129602]
>>> >>> and [2020-04-30 21:48:29.905040]
>>> >>> The message "W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:850:client4_0_setxattr_cbk]
>>> >>> 0-androidpolice_data3-client-2: remote operation failed
>[Transport
>>> >endpoint
>>> >>> is not connected]" repeated 264 times between [2020-04-30
>>> >21:46:32.129512]
>>> >>> and [2020-04-30 21:48:29.905047]
>>> >>>
>>> >>>
>>> >>>
>>> >>> Once in a while, I'm seeing this:
>>> >>> ==> bricks/mnt-hive_block4-androidpolice_data3.log <==
>>> >>> [2020-04-30 21:45:54.251637] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1681:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 5725811: SETATTR /
>>> >>>
>>> >
>>>
>androidpolice.com/public/wp-content/uploads/2019/03/cielo-breez-plus-hero.png
>>> >>> (d4556eb4-f15b-412c-a42a-32b4438af557), client:
>>> >>>
>>>
>>>
>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-2-RECON_NO:-1,
>>> >>> error-xlator: androidpolice_data3-access-control [Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:10.439701] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 201833: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control [Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:10.453724] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 201842: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:af341e80-70ff-4d23-99ef-3d846a546fc9-GRAPH_ID:0-PID:2358-HOST:forge-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control [Operation not
>>> >permitted]
>>> >>> [2020-04-30 21:49:16.224662] I [MSGID: 115072]
>>> >>> [server-rpc-fops_v2.c:1680:server4_setattr_cbk]
>>> >>> 0-androidpolice_data3-server: 202865: SETATTR /
>>> >>> androidpolice.com/public/wp-content/uploads
>>> >>> (2692eeba-1ebe-49b6-927f-1dfbcd227591), client:
>>> >>>
>>>
>>>
>>CTX_ID:32e2d636-038a-472d-8199-007555d1805f-GRAPH_ID:0-PID:14265-HOST:nexus2-PC_NAME:androidpolice_data3-client-3-RECON_NO:-2,
>>> >>> error-xlator: androidpolice_data3-access-control [Operation not
>>> >permitted]
>>> >>>
>>> >>> There's also lots of self-healing happening that I didn't expect
>at
>>> >all,
>>> >>> since the upgrade only took ~10-15s.
>>> >>> [2020-04-30 21:47:38.714448] I [MSGID: 108026]
>>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461
>>> >>> [2020-04-30 21:47:38.765033] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>>> >>> 4a6ba2d7-7ad8-4113-862b-02e4934a3461. sources=[3]  sinks=0 1 2
>>> >>> [2020-04-30 21:47:38.765289] I [MSGID: 108026]
>>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-apkmirror_data1-replicate-0: performing metadata selfheal on
>>> >>> f3c62a41-1864-4e75-9883-4357a7091296
>>> >>> [2020-04-30 21:47:38.800987] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-apkmirror_data1-replicate-0: Completed metadata selfheal on
>>> >>> f3c62a41-1864-4e75-9883-4357a7091296. sources=[3]  sinks=0 1 2
>>> >>>
>>> >>>
>>> >>> I'm also seeing "remote operation failed" and "writing to fuse
>>> >device
>>> >>> failed: No such file or directory" messages
>>> >>> [2020-04-30 21:46:34.891957] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>>> >>> [2020-04-30 21:45:36.127412] W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> >>> 0-androidpolice_data3-client-0: remote operation failed
>[Operation
>>> >not
>>> >>> permitted]
>>> >>> [2020-04-30 21:45:36.345924] W [MSGID: 114031]
>>> >>> [client-rpc-fops_v2.c:1985:client4_0_setattr_cbk]
>>> >>> 0-androidpolice_data3-client-1: remote operation failed
>[Operation
>>> >not
>>> >>> permitted]
>>> >>> [2020-04-30 21:46:35.291853] I [MSGID: 108031]
>>> >>> [afr-common.c:2543:afr_local_discovery_cbk]
>>> >>> 0-androidpolice_data3-replicate-0: selecting local read_child
>>> >>> androidpolice_data3-client-2
>>> >>> [2020-04-30 21:46:35.977342] I [MSGID: 108026]
>>> >>> [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
>>> >>> 0-androidpolice_data3-replicate-0: performing metadata selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591
>>> >>> [2020-04-30 21:46:36.006607] I [MSGID: 108026]
>>> >>> [afr-self-heal-common.c:1723:afr_log_selfheal]
>>> >>> 0-androidpolice_data3-replicate-0: Completed metadata selfheal
>on
>>> >>> 2692eeba-1ebe-49b6-927f-1dfbcd227591. sources=0 1 [2]  sinks=3
>>> >>> [2020-04-30 21:46:37.245599] E
>>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> >>> (-->
>>>
>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
>0-glusterfs-fuse:
>>> >>> writing to fuse device failed: No such file or directory
>>> >>> [2020-04-30 21:46:50.864797] E
>>> >[fuse-bridge.c:219:check_and_dump_fuse_W]
>>> >>> (-->
>>>
>>/usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x17d)[0x7fd13d50624d]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x849a)[0x7fd1398e949a]
>>> >>> (-->
>>> >>>
>>>
>>/usr/lib64/glusterfs/5.13/xlator/mount/fuse.so(+0x87bb)[0x7fd1398e97bb]
>>> >>> (--> /lib64/libpthread.so.0(+0x84f9)[0x7fd13ca564f9] (-->
>>> >>> /lib64/libc.so.6(clone+0x3f)[0x7fd13c78ef2f] )))))
>0-glusterfs-fuse:
>>> >>> writing to fuse device failed: No such file or directory
>>> >>>
>>> >>> The number of items being healed is going up and down wildly,
>from 0
>>> >to
>>> >>> 8000+ and sometimes taking a really long time to return a value.
>I'm
>>> >really
>>> >>> worried as this is a production system, and I didn't observe
>this in
>>> >our
>>> >>> test system.
>>> >>>
>>> >>>
>>> >>>
>>> >>> gluster v heal apkmirror_data1 info summary
>>> >>> Brick nexus2:/mnt/nexus2_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick forge:/mnt/forge_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick hive:/mnt/hive_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 27
>>> >>> Number of entries in heal pending: 27
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick citadel:/mnt/citadel_block1/apkmirror_data1
>>> >>> Status: Connected
>>> >>> Total Number of entries: 8540
>>> >>> Number of entries in heal pending: 8540
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>>
>>> >>>
>>> >>> gluster v heal androidpolice_data3 info summary
>>> >>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick forge:/mnt/forge_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick hive:/mnt/hive_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1
>>> >>> Number of entries in heal pending: 1
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>>> >>> Status: Connected
>>> >>> Total Number of entries: 1149
>>> >>> Number of entries in heal pending: 1149
>>> >>> Number of entries in split-brain: 0
>>> >>> Number of entries possibly healing: 0
>>> >>>
>>> >>>
>>> >>> What should I do at this point? The files I tested seem to be
>>> >replicating
>>> >>> correctly, but I don't know if it's the case for all of them,
>and
>>> >the heals
>>> >>> going up and down, and all these log messages are making me very
>>> >nervous.
>>> >>>
>>> >>> Thank you.
>>> >>>
>>> >>> Sincerely,
>>> >>> Artem
>>> >>>
>>> >>> --
>>> >>> Founder, Android Police <http://www.androidpolice.com>, APK
>Mirror
>>> >>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> >>> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>>> >>>
>>> >>
>>>
>>> I's not supported  , but usually it works.
>>>
>>> In worst case scenario,  you can remove the node, wipe gluster on
>the
>>> node, reinstall the packages and add it - it will require full heal
>of the
>>> brick and as you have previously reported could lead to performance
>>> degradation.
>>>
>>> I think you are on SLES, but I could be wrong . Do you have btrfs or
>LVM
>>> snapshots to revert from ?
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>

Hi Artem,

You can increase the brick log level following https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level but keep in mind that logs grow quite fast - so don't keep them above the current level for too much time.

Do you have a geo replication running ?

About the migration issue - I have no clue why this happened. Last time I skipped a major release(3.12  to 5.5) I got a huge trouble (all files ownership was switched to root)  and I have the feeling  that it won't happen again if you go through v6.

Best Regards,
Strahil Nikolov
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users