See my comments inline. На 8 юли 2020 г. 0:46:21 GMT+03:00, Marco Fais <evilmf@xxxxxxxxx> написа: >Hi Strahil > >first of all thanks a million for your help -- really appreciate it. >Thanks also for the pointers on the debug. I have tried it, and while I >can't interpret the results I think I might have found something. > >There is a lot of information so hopefully this is relevant. During the >snapshot creation and deletion, I can see the following errors in the >client log: > >[2020-07-07 21:23:06.837381] W [MSGID: 122019] >[ec-helpers.c:401:ec_loc_gfid_check] 0-SSD_Storage-disperse-0: >Mismatching >GFID's in loc >[2020-07-07 21:23:06.837387] D [MSGID: 0] >[defaults.c:1328:default_mknod_cbk] 0-stack-trace: stack-address: >0x7f0dc0001a78, SSD_Storage-disperse-0 returned -1 error: Input/output >error [Input/output error] You have to check brick logs for the first brick in the volume list. >[2020-07-07 21:23:06.837392] W [MSGID: 109002] >[dht-rename.c:1019:dht_rename_links_create_cbk] 0-SSD_Storage-dht: >link/file >/8d49207e-f6b9-41d1-8d35-f6e0fb121980/images/4802e66e-a7e3-42df-a570-7155135566ad/b51133ee-54e0-4001-ab4b-9f0dc1e5c6fc.meta Check the meta file. There was a problem with Gluster where it healed it before the other replica has come up (in your case is a little bit different.Usually only the timestamp inside the file is changed, so you can force gluster to update it by changing the timestamp inside. >on SSD_Storage-disperse-0 failed [Input/output error] Already mentioned it. >[2020-07-07 21:23:06.837850] D [MSGID: 0] [stack.h:502:copy_frame] >0-stack: >groups is null (ngrps: 0) [Invalid argument] >[2020-07-07 21:23:06.839252] D [dict.c:1168:data_to_uint32] >(-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] >-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) >[0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) >[0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has >integer type [Invalid argument] >[2020-07-07 21:23:06.839272] D [MSGID: 0] >[dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing >entries >from SSD_Storage-disperse-0 >[2020-07-07 21:23:06.839281] D [MSGID: 0] >[dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: >SSD_Storage-disperse-0: entry = ., type = 4 >[2020-07-07 21:23:06.839291] D [MSGID: 0] >[dht-common.c:6813:dht_readdirp_cbk] 0-SSD_Storage-dht: >SSD_Storage-disperse-0: Adding entry = . >[2020-07-07 21:23:06.839297] D [MSGID: 0] >[dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: >SSD_Storage-disperse-0: entry = .., type = 4 >[2020-07-07 21:23:06.839324] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-6 returned -1 error: >Stale file handle [Stale file handle] I see multiple of these, but as the message is not 'W' or 'E' , I assume it could happen and it's normal. >[2020-07-07 21:23:06.839327] D [dict.c:1800:dict_get_int32] >(-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x227d6) >[0x7f0dd23af7d6] >-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) >[0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) >[0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type >asked, >has unsigned integer type [Invalid argument] >[2020-07-07 21:23:06.839361] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-11 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839395] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-15 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839419] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-9 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839473] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-18 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839471] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-10 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839491] D [dict.c:1800:dict_get_int32] >(-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x256ad) >[0x7f0dd23b26ad] >-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) >[0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) >[0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type >asked, >has unsigned integer type [Invalid argument] >[2020-07-07 21:23:06.839512] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-7 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839526] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-23 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839543] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-22 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839543] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-16 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839556] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-21 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839596] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-12 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839617] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-14 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839631] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-13 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839636] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc00395a8, SSD_Storage-client-17 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839643] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0034598, SSD_Storage-client-8 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839656] D [MSGID: 0] >[defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: >0x7f0dc007c428, SSD_Storage-disperse-2 returned -1 error: Stale file >handle >[Stale file handle] >[2020-07-07 21:23:06.839665] D [MSGID: 0] >[dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) >on >SSD_Storage-disperse-2 returned error [Stale file handle] >[2020-07-07 21:23:06.839666] D [MSGID: 0] >[defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: >0x7f0dc007c428, SSD_Storage-disperse-1 returned -1 error: Stale file >handle >[Stale file handle] >[2020-07-07 21:23:06.839683] D [MSGID: 0] >[dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) >on >SSD_Storage-disperse-1 returned error [Stale file handle] >[2020-07-07 21:23:06.839686] D [dict.c:1168:data_to_uint32] >(-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] >-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) >[0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) >[0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has >integer type [Invalid argument] >[2020-07-07 21:23:06.839698] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-19 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839703] D [MSGID: 0] >[dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing >entries >from SSD_Storage-disperse-0 >[2020-07-07 21:23:06.839714] D [MSGID: 0] >[dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: >SSD_Storage-disperse-0: entry = .., type = 4 >[2020-07-07 21:23:06.839716] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-30 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839724] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-34 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839720] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-35 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839755] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-31 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839759] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc009c108, SSD_Storage-client-20 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839774] D [MSGID: 0] >[defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: >0x7f0dc007c428, SSD_Storage-disperse-3 returned -1 error: Stale file >handle >[Stale file handle] >[2020-07-07 21:23:06.839775] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-32 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839783] D [MSGID: 0] >[dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) >on >SSD_Storage-disperse-3 returned error [Stale file handle] >[2020-07-07 21:23:06.839798] D [MSGID: 0] >[dht-common.c:601:dht_discover_complete] 0-SSD_Storage-dht: key = >trusted.glusterfs.quota.read-only not present in dict >[2020-07-07 21:23:06.839807] D [MSGID: 0] >[client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: >stack-address: 0x7f0dc0024b48, SSD_Storage-client-33 returned -1 error: >Stale file handle [Stale file handle] >[2020-07-07 21:23:06.839807] D [MSGID: 0] >[dht-layout.c:789:dht_layout_preset] 0-SSD_Storage-dht: file = >00000000-0000-0000-0000-000000000000, subvol = SSD_Storage-disperse-4 >[2020-07-07 21:23:06.839825] D [MSGID: 0] >[defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: >0x7f0dc007c428, SSD_Storage-disperse-5 returned -1 error: Stale file >handle >[Stale file handle] >[2020-07-07 21:23:06.839835] D [MSGID: 0] >[dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) >on >SSD_Storage-disperse-5 returned error [Stale file handle] > >The above is logged just shortly before the qemu-kvm process crashes >with >the usual error: > >Unexpected error in raw_check_lock_bytes() at block/file-posix.c:811: >2020-07-07T21:23:06.847336Z qemu-kvm: Failed to get shared "write" lock That's strange. Can you check the sanlock logs for anything reported there ? >I have looked also on the bricks logs, but there is too much >information >there and will need to know what to look for. > >Not sure if there is any benefit in looking into this any further? > >Thanks, >Marco > >On Thu, 2 Jul 2020 at 15:45, Strahil Nikolov <hunter86_bg@xxxxxxxxx> >wrote: > >> >> >> На 2 юли 2020 г. 16:33:51 GMT+03:00, Marco Fais <evilmf@xxxxxxxxx> >написа: >> >Hi Strahil, >> > >> >WARNING: As you enabled sharding - NEVER DISABLE SHARDING, EVER ! >> >> >> > >> >Thanks -- good to be reminded :) >> > >> > >> >> >When you say they will not be optimal are you referring mainly to >> >> >performance considerations? We did plenty of testing, and in >terms >> >of >> >> >performance didn't have issues even with I/O intensive workloads >> >(using >> >> >SSDs, I had issues with spinning disks). >> >> >> >> Yes, the client side has to connect to 6 bricks (4+2) at a time >and >> >> calculate the data in order to obtain the necessary >information.Same >> >is >> >> valid for writing. >> >> If you need to conserve space, you can test VDO without >compression >> >(of >> >> even with it). >> >> >> > >> >Understood -- will explore VDO. Storage usage efficiency is less >> >important >> >than fault tolerance or performance for us -- disperse volumes >seemed >> >to >> >tick all the boxes so we looked at them primarily. >> >But clearly I had missed that they are not used as mainstream VM >> >storage >> >for oVirt (I did know they weren't supported, but as explained >thought >> >was >> >more on the management side). >> > >> > >> >> >> >> Also with replica volumes, you can use 'choose-local' /in case >> >you >> >> have faster than the network storage (like NVMe)/ and increase >the >> >read >> >> speed. Of course this feature is useful for Hyperconverged setup >> >(gluster >> >> + ovirt on the same node). >> >> >> > >> >Will explore this option as well, thanks for the suggestion. >> > >> > >> >> If you were using ovirt 4.3 , I would recommend you to focus >on >> >> gluster. Yet, you use oVirt 4.4 which is quite newer and it >needs >> > some >> >> polishing. >> >> >> > >> >Ovirt 4.3.9 (using the older Centos 7 qemu/libvirt) unfortunately >had >> >similar issues with the disperse volumes. Not sure if exactly the >same, >> >as >> >never looked deeper into it, but the results were similar. >> >Ovirt 4.4.0 has some issues with snapshot deletion that are >independent >> >from Gluster (I have raised the issue here, >> >https://bugzilla.redhat.com/show_bug.cgi?id=1840414, should be >sorted >> >with >> >4.4.2 I guess), so at the moment it only works with the "testing" AV >> >repo. >> >> >> >> In such case I can recommend you to: >> 1. Ensure you have enough space on all bricks for the logs >> (/var/log/gluster). Several gigs should be OK >> 2. Enable all logs to 'TRACE' . Red Hat's documentation on the topic >is >> quite good: >> >> >https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level >> 3. Reproduce the issue on a fresh VM (never done snapshot deletion) >> 4. Disable (switch to info) all logs as per the link in point 2 >> >> The logs will be spread among all nodes. If you have remote logging >> available, you can also use it for analysis of the logs. >> >> Most probably the brick logs can provide useful information. >> >> >> > >> >> Check ovirt engine logs (on the HostedEngine VM or your >standalone >> >> engine) , vdsm logs on the host that was running the VM and >next - >> >check >> >> the brick logs. >> >> >> > >> >Will do. >> > >> >Thanks, >> >Marco >> >> >> About VDO - it might require some tuning and even afterwards it won't >be >> very performant, so it depends on your needs. >> >> Best Regards, >> Strahil Nikolov >> ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users