Re: Problems with qemu and disperse volumes (live merge)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Strahil

first of all thanks a million for your help -- really appreciate it.
Thanks also for the pointers on the debug. I have tried it, and while I can't interpret the results I think I might have found something.

There is a lot of information so hopefully this is relevant. During the snapshot creation and deletion, I can see the following errors in the client log:

[2020-07-07 21:23:06.837381] W [MSGID: 122019] [ec-helpers.c:401:ec_loc_gfid_check] 0-SSD_Storage-disperse-0: Mismatching GFID's in loc
[2020-07-07 21:23:06.837387] D [MSGID: 0] [defaults.c:1328:default_mknod_cbk] 0-stack-trace: stack-address: 0x7f0dc0001a78, SSD_Storage-disperse-0 returned -1 error: Input/output error [Input/output error]
[2020-07-07 21:23:06.837392] W [MSGID: 109002] [dht-rename.c:1019:dht_rename_links_create_cbk] 0-SSD_Storage-dht: link/file /8d49207e-f6b9-41d1-8d35-f6e0fb121980/images/4802e66e-a7e3-42df-a570-7155135566ad/b51133ee-54e0-4001-ab4b-9f0dc1e5c6fc.meta on SSD_Storage-disperse-0 failed [Input/output error]
[2020-07-07 21:23:06.837850] D [MSGID: 0] [stack.h:502:copy_frame] 0-stack: groups is null (ngrps: 0) [Invalid argument]
[2020-07-07 21:23:06.839252] D [dict.c:1168:data_to_uint32] (-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) [0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) [0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]
[2020-07-07 21:23:06.839272] D [MSGID: 0] [dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing entries from SSD_Storage-disperse-0
[2020-07-07 21:23:06.839281] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = ., type = 4
[2020-07-07 21:23:06.839291] D [MSGID: 0] [dht-common.c:6813:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: Adding entry = .
[2020-07-07 21:23:06.839297] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = .., type = 4
[2020-07-07 21:23:06.839324] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-6 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839327] D [dict.c:1800:dict_get_int32] (-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x227d6) [0x7f0dd23af7d6] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) [0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) [0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type asked, has unsigned integer type [Invalid argument]
[2020-07-07 21:23:06.839361] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-11 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839395] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-15 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839419] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-9 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839473] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-18 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839471] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-10 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839491] D [dict.c:1800:dict_get_int32] (-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x256ad) [0x7f0dd23b26ad] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) [0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) [0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type asked, has unsigned integer type [Invalid argument]
[2020-07-07 21:23:06.839512] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-7 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839526] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-23 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839543] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-22 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839543] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-16 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839556] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-21 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839596] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-12 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839617] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-14 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839631] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-13 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839636] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-17 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839643] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-8 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839656] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-2 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839665] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-2 returned error [Stale file handle]
[2020-07-07 21:23:06.839666] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-1 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839683] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-1 returned error [Stale file handle]
[2020-07-07 21:23:06.839686] D [dict.c:1168:data_to_uint32] (-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) [0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) [0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]
[2020-07-07 21:23:06.839698] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-19 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839703] D [MSGID: 0] [dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing entries from SSD_Storage-disperse-0
[2020-07-07 21:23:06.839714] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = .., type = 4
[2020-07-07 21:23:06.839716] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-30 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839724] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-34 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839720] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-35 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839755] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-31 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839759] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-20 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839774] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-3 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839775] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-32 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839783] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-3 returned error [Stale file handle]
[2020-07-07 21:23:06.839798] D [MSGID: 0] [dht-common.c:601:dht_discover_complete] 0-SSD_Storage-dht: key = trusted.glusterfs.quota.read-only not present in dict
[2020-07-07 21:23:06.839807] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-33 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839807] D [MSGID: 0] [dht-layout.c:789:dht_layout_preset] 0-SSD_Storage-dht: file = 00000000-0000-0000-0000-000000000000, subvol = SSD_Storage-disperse-4
[2020-07-07 21:23:06.839825] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-5 returned -1 error: Stale file handle [Stale file handle]
[2020-07-07 21:23:06.839835] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-5 returned error [Stale file handle]


The above is logged just shortly before the qemu-kvm process crashes with the usual error:

Unexpected error in raw_check_lock_bytes() at block/file-posix.c:811:
2020-07-07T21:23:06.847336Z qemu-kvm: Failed to get shared "write" lock


I have looked also on the bricks logs, but there is too much information there and will need to know what to look for.

Not sure if there is any benefit in looking into this any further?

Thanks,
Marco

On Thu, 2 Jul 2020 at 15:45, Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:


На 2 юли 2020 г. 16:33:51 GMT+03:00, Marco Fais <evilmf@xxxxxxxxx> написа:
>Hi Strahil,
>
>WARNING: As you enabled sharding - NEVER DISABLE SHARDING, EVER !
>>
>
>Thanks -- good to be reminded :)
>
>
>> >When you say they will not be optimal are you referring mainly to
>> >performance considerations? We did plenty of testing, and in terms
>of
>> >performance didn't have issues even with I/O intensive workloads
>(using
>> >SSDs, I had issues with spinning disks).
>>
>> Yes, the client side has  to connect to 6 bricks (4+2) at a time  and
>> calculate the data in order to obtain the necessary information.Same
>is
>> valid for writing.
>> If you need to conserve space, you can test VDO without compression
>(of
>> even with it).
>>
>
>Understood -- will explore VDO. Storage usage efficiency is less
>important
>than fault tolerance or performance for us -- disperse volumes seemed
>to
>tick all the boxes so we looked at them primarily.
>But clearly I had missed that they are not used as mainstream VM
>storage
>for oVirt (I did know they weren't supported, but as explained thought
>was
>more on the management side).
>
>
>>
>> Also  with replica  volumes,  you can use 'choose-local'  /in case
>you
>> have faster than the network storage (like  NVMe)/ and increase the
>read
>> speed. Of course  this feature is useful for Hyperconverged setup
>(gluster
>> + ovirt on the same node).
>>
>
>Will explore this option as well, thanks for the suggestion.
>
>
>> If you were using ovirt 4.3 ,  I  would  recommend you to focus  on
>> gluster. Yet,  you  use  oVirt 4.4 which is quite  newer and it needs
> some
>> polishing.
>>
>
>Ovirt 4.3.9 (using the older Centos 7 qemu/libvirt) unfortunately had
>similar issues with the disperse volumes. Not sure if exactly the same,
>as
>never looked deeper into it, but the results were similar.
>Ovirt 4.4.0 has some issues with snapshot deletion that are independent
>from Gluster (I have raised the issue here,
>https://bugzilla.redhat.com/show_bug.cgi?id=1840414, should be sorted
>with
>4.4.2 I guess), so at the moment it only works with the "testing" AV
>repo.



In such case I can recommend you to:
1. Ensure you have enough space on all bricks for the logs (/var/log/gluster). Several gigs should be OK
2. Enable all logs to 'TRACE' . Red Hat's documentation on the topic is quite good:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level
3. Reproduce the issue on a fresh VM (never done snapshot deletion)
4. Disable (switch to info)  all logs as per the link in point 2

The logs will be spread among all nodes. If you have remote logging available, you can also use it for analysis of the logs.

Most probably the brick logs can provide useful information.


>
>> Check ovirt  engine  logs (on the HostedEngine VM or your standalone
>> engine) ,  vdsm logs  on the host that was running the VM  and next -
>check
>> the brick  logs.
>>
>
>Will do.
>
>Thanks,
>Marco


About VDO - it might require some tuning and even afterwards it won't be very performant, so it depends on your needs.

Best Regards,
Strahil Nikolov
________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux