Re: [Errno 107] Transport endpoint is not connected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Strahil,

Thank you for your reply. I found the issue, the not connected errors seem to appear from the ACL layer. and somehow it received a permission denied, and this was translated to a not connected error.
while the file permission were listed as owner=vdsm and group=kvm, somehow ACL saw this differently. I ran "chown -R vdsm.kvm /rhev/data-center/mnt/glusterSD/10.201.0.11\:_ovirt-mon-2/" on the mount, and suddenly things started working again.

I indeed have (or now had, since for the restore procedure i needed to provide an empty domain) 1 other VM on the HostedEngine domain, this other VM had other critical services like VPN. Since i see the HostedEngine domain as one of the most reliable domains, i used it for critical services. 
All other VM's have their own domains.

I'm a bit surprised by your comment about brick multiplexing, i understood this should actually improve performance, by sharing resources? Would you have some extra information about this?

To answer your questions;

We currently have 15 physical hosts.

1) there are no pending heals
2) yes i'm able to connect to the ports
3) all peers report as connected
4) Actually i had a setup like this before, i had multiple smaller qcow disks in a raid0 with LVM. But this did appeared not to be reliable, so i switched to 1 single large disk. Would you know if there is some documentation about this?
5) i'm running about the latest and greatest stable; 4.3.7.2-1.el7. Only had trouble with the restore, because the cluster was still in compatibility mode 4.2 and there were 2 older VM's which had snapshots from prior versions, while the leaf was in compatibility level 4.2. note; the backup was taken on the engine running 4.3.

Thanks Olaf



Op di 28 jan. 2020 om 17:31 schreef Strahil Nikolov <hunter86_bg@xxxxxxxxx>:
On January 27, 2020 11:49:08 PM GMT+02:00, Olaf Buitelaar <olaf.buitelaar@xxxxxxxxx> wrote:
>Dear Gluster users,
>
>i'm a bit at a los here, and any help would be appreciated.
>
>I've lost a couple, since the disks suffered from severe XFS error's
>and of
>virtual machines and some won't boot because they can't resolve the
>size of
>the image as reported by vdsm:
>"VM kube-large-01 is down with error. Exit message: Unable to get
>volume
>size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume
>f16492a6-2d0e-4657-88e3-a9f4d8e48e74."
>
>which is also reported by the vdsm-client;  vdsm-client Volume getSize
>storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3
>storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829
>imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd
>volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>vdsm-client: Command Volume.getSize with args {'storagepoolID':
>'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID':
>'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID':
>'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID':
>'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed:
>(code=100, message=[Errno 107] Transport endpoint is not connected)
>
>with corresponding gluster mount log;
>[2020-01-27 19:42:22.678793] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-14:
>remote operation failed. Path:
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
>[2020-01-27 19:42:22.678828] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-13:
>remote operation failed. Path:
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
>[2020-01-27 19:42:22.679806] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-14:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.679862] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-13:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.679981] W [MSGID: 108027]
>[afr-common.c:2274:afr_attempt_readsubvol_set]
>0-ovirt-data-replicate-3: no
>read subvols for
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>[2020-01-27 19:42:22.680606] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-14:
>remote operation failed. Path:
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.680622] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-13:
>remote operation failed. Path:
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.681742] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-13:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.681871] W [MSGID: 108027]
>[afr-common.c:2274:afr_attempt_readsubvol_set]
>0-ovirt-data-replicate-3: no
>read subvols for
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>[2020-01-27 19:42:22.682344] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-14:
>remote operation failed. Path:
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>The message "W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-14:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2
>times between [2020-01-27 19:42:22.679806] and [2020-01-27
>19:42:22.683308]
>[2020-01-27 19:42:22.683327] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-data-client-13:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:42:22.683438] W [MSGID: 108027]
>[afr-common.c:2274:afr_attempt_readsubvol_set]
>0-ovirt-data-replicate-3: no
>read subvols for
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get]
>(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b)
>[0x7faaaadeb92b]
>-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78)
>[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94)
>[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds
>[Invalid
>argument]
>[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 176728: LOOKUP()
>/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>=> -1 (Transport endpoint is not connected)
>
>In addition to this, vdsm also reported it couldn't find the image of
>the
>HostedEngine, and refused to boot;
>2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd)
>[storage.TaskManager.Task]
>(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error
>(task:875)
>Traceback (most recent call last):
>File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882,
>in _run
>    return fn(*args, **kargs)
>  File "<string>", line 2, in prepareImage
>File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
>method
>    ret = func(*args, **kwargs)
>File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203,
>in prepareImage
>    raise se.VolumeDoesNotExist(leafUUID)
>VolumeDoesNotExist: Volume does not exist:
>('38e4fba7-f140-4630-afab-0f744ebe3b57',)
>
>2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm]
>(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process
>failed
>(vm:933)
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in
>_startUnderlyingVm
>    self._run()
> File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in
>_run
>    self._devices = self._make_devices()
> File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in
>_make_devices
>    disk_objs = self._perform_host_local_adjustment()
> File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in
>_perform_host_local_adjustment
>    self._preparePathsForDrives(disk_params)
> File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in
>_preparePathsForDrives
>    drive, self.id, path=path
> File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in
>prepareVolumePath
>    raise vm.VolumeError(drive)
>VolumeError: Bad volume specification {'protocol': 'gluster',
>'address':
>{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci',
>'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
>'index':
>0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {},
>'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64',
>'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk',
>'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0',
>'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000',
>'device': 'disk', 'path':
>'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57',
>'propagateErrors': 'off', 'name': 'vda', 'volumeID':
>'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias':
>'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name':
>'10.201.0.9',
>'port': '0'}], 'discard': False}
>
>And last, there is a storage domain which refuses to activate (from de
>vsdm.log);
>2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error
>checking path
>/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
>(monitor:499)
>Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line
>497, in _pathChecked
>    delay = result.delay()
>File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line
>391,
>in delay
>    raise exception.MiscFileReadException(self.path, self.rc, self.err)
>MiscFileReadException: Internal file read failure:
>(u'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata',
>1, bytearray(b"/usr/bin/dd: failed to open
>\'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\':
>Transport endpoint is not connected\n"))
>
>corresponding gluster mount log;
>The message "W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-0:
>remote operation failed. Path:
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
>times between [2020-01-27 19:58:33.063826] and [2020-01-27
>19:59:21.690134]
>The message "W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-1:
>remote operation failed. Path:
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
>times between [2020-01-27 19:58:33.063734] and [2020-01-27
>19:59:21.690150]
>The message "W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-0:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
>times between [2020-01-27 19:58:33.065027] and [2020-01-27
>19:59:21.691313]
>The message "W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-1:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4
>times between [2020-01-27 19:58:33.065106] and [2020-01-27
>19:59:21.691328]
>The message "W [MSGID: 108027]
>[afr-common.c:2274:afr_attempt_readsubvol_set]
>0-ovirt-mon-2-replicate-0:
>no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md"
>repeated
>4 times between [2020-01-27 19:58:33.065163] and [2020-01-27
>19:59:21.691369]
>[2020-01-27 19:59:50.539315] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-0:
>remote operation failed. Path:
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:59:50.539321] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-1:
>remote operation failed. Path:
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:59:50.540412] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-1:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:59:50.540477] W [MSGID: 114031]
>[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk]
>0-ovirt-mon-2-client-0:
>remote operation failed. Path: (null)
>(00000000-0000-0000-0000-000000000000) [Permission denied]
>[2020-01-27 19:59:50.540533] W [MSGID: 108027]
>[afr-common.c:2274:afr_attempt_readsubvol_set]
>0-ovirt-mon-2-replicate-0:
>no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 99: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
>=> -1 (Transport endpoint is not connected)
>[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 105: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 112: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 118: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 125: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 131: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 137: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 144: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 150: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 156: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 163: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 169: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 176: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk]
>0-glusterfs-fuse: 183: LOOKUP()
>/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint
>is
>not connected)
>
>and some apps directly connecting to gluster mounts report these
>error's;
>2020-01-27  1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not
>found
>(Errcode: 107 "Transport endpoint is not connected")
>2020-01-27  3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not
>found (Errcode: 107 "Transport endpoint is not connected")
>
>So the errors seem to hint to either a connection issue or a quorum
>loss of
>some sort. However gluster is running on it's own private and separate
>network, with no firewall rules or anything else which could obstruct
>the
>connection.
>In addition gluster volume status reports all bricks and nodes are up,
>and
>gluster volume heal reports no pending heals.
>What makes this issue even more interesting is that when i manually
>check
>the files all seems fine;
>
>for the first issue, where the machine won't start because vdsm cannot
>determine the size.
>qemu is able to report the size;
>qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7:
>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46
>57-88e3-a9f4d8e48e74
>image: /rhev/data-center/mnt/glusterSD/10.201.0.7:
>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>file format: raw
>virtual size: 34T (37580963840000 bytes)
>disk size: 7.1T
>in addition i'm able to mount the volume using a loop device;
>losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7:
>_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
>kpartx -av /dev/loop0
>vgscan
>vgchange -ay
>mount /dev/mapper/cl--data5-data5 /data5/
>after this i'm able to see all contents of the disk, and in fact write
>to
>it. So the earlier reported connection error doesn't seem to apply
>here?
>This is actually how i'm currently running the VM, where i detached the
>disk, and mounted it  in the VM via the loop device. The disk is a data
>disk for a heavily loaded mysql instance, and mysql is reporting no
>errors,
>and has been running for about a day now.
>Of course this not the way it should run, but it is at least working,
>only
>performance seems a bit off. So i would like to solve the issue and
>being
>able to attach the image as disk again.
>
>for the second issue where the Image of the HostedEngine couldn't be
>found,
>also all seems correct;
>The file is there and having the correct permissions;
> ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9
>\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/
>total 49406333
>drwxr-xr-x.  2 vdsm kvm        4096 Jan 25 12:03 .
>drwxr-xr-x. 13 vdsm kvm        4096 Jan 25 14:16 ..
>-rw-rw----.  1 vdsm kvm 62277025792 Jan 23 03:04
>38e4fba7-f140-4630-afab-0f744ebe3b57
>-rw-rw----.  1 vdsm kvm     1048576 Jan 25 21:48
>38e4fba7-f140-4630-afab-0f744ebe3b57.lease
>-rw-r--r--.  1 vdsm kvm         285 Jan 27  2018
>38e4fba7-f140-4630-afab-0f744ebe3b57.meta
>And i'm able to mount the image using a loop device and access it's
>contents.
>Unfortunate the VM wouldn't boot due to XFS error's. After tinkering
>with
>this for about a day to make it boot, i gave up and restored from a
>recent
>backup. But i took the data dir from postgress from the mounted old
>image
>to the new VM, and postgress was perfectly fine with it, also
>indicating
>the image wasn't a complete toast.
>
>And the last issue where the storage domain wouldn't activate. The file
>it
>claims it cannot read in the log is perfectly readable and writable;
>cat /rhev/data-center/mnt/glusterSD/10.201.0.11:
>_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
>CLASS=Data
>DESCRIPTION=ovirt-mon-2
>IOOPTIMEOUTSEC=10
>LEASERETRIES=3
>LEASETIMESEC=60
>LOCKPOLICY=
>LOCKRENEWALINTERVALSEC=5
>POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3
>REMOTE_PATH=10.201.0.11:/ovirt-mon-2
>ROLE=Regular
>SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458
>TYPE=GLUSTERFS
>VERSION=4
>_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807
>
>So i've no clue where these "Transport endpoint is not connected"  are
>coming from, and how to resolve them?
>
>I think there are 4 possible causes for this issue;
>1) I was trying to optimize the throughput of gluster on some volumes,
>since we recently gained some additional write load, which we had
>difficulty keeping up with. So I tried to incrementally
>add server.event-threads, via;
>gluster v set ovirt-data server.event-threads X
>since this didn't seem to improve the performance i changed it back to
>it's
>original values. But when i did that the VM's running on these volumes
>all
>locked-up, and required a reboot, which was by than still possible.
>Please
>note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't
>changed.
>
>2) I had a mix of running gluster 6.6 and 6.7, since i was in the
>middle of
>upgrading all to 6.7
>
>3) On one of the physical brick nodes, after a reboot xfs errors were
>reported, and resolved by xfs_repair, which did remove some inodes in
>the
>process. For which i wasn't too worried since i would expect the
>gluster
>self healing daemon would resolve them, which seemed true for all
>volumes,
>except 1, where 1 gfid was pending for about 2 days. in this case also
>exactly the image which vdsm reports it cannot resolve the size from.
>But
>there are other vm image with the same issue, which i left out for
>brevity.
>However the pending heal of the single gfid resolved once I mounted the
>image via the loop device and started writing to. Which is probably due
>the
>nature on how gluster resolves what needs healing. Despite a gluster
>heal X
>full was issued before.
>I could also confirm the pending gfid was in fact missing on the brick
>node
>on the underlying brick directory, while the heal was still pending.
>
>4) I did some brick replace's (only the ovirt-data volume) but only of
>arbiter bricks of the affected volume in the first issue.
>
>the volume info's of the affected bricks look like this;
>
>Volume Name: ovirt-data
>Type: Distributed-Replicate
>Volume ID: 2775dc10-c197-446e-a73f-275853d38666
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 4 x (2 + 1) = 12
>Transport-type: tcp
>Bricks:
>Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data
>Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data
>Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data
>Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data
>Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data
>Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data
>Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data
>Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data
>Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
>Options Reconfigured:
>cluster.choose-local: off
>server.outstanding-rpc-limit: 1024
>storage.owner-gid: 36
>storage.owner-uid: 36
>transport.address-family: inet
>performance.readdir-ahead: on
>nfs.disable: on
>performance.quick-read: off
>performance.read-ahead: off
>performance.io-cache: off
>performance.stat-prefetch: off
>performance.low-prio-threads: 32
>network.remote-dio: off
>cluster.eager-lock: enable
>cluster.quorum-type: auto
>cluster.server-quorum-type: server
>cluster.data-self-heal-algorithm: full
>cluster.locking-scheme: granular
>cluster.shd-max-threads: 8
>cluster.shd-wait-qlength: 10000
>features.shard: on
>user.cifs: off
>performance.write-behind-window-size: 512MB
>performance.cache-size: 384MB
>server.event-threads: 5
>performance.strict-o-direct: on
>cluster.brick-multiplex: on
>
>Volume Name: ovirt-engine
>Type: Distributed-Replicate
>Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 3 x 3 = 9
>Transport-type: tcp
>Bricks:
>Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine
>Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine
>Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine
>Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine
>Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine
>Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine
>Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine
>Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine
>Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine
>Options Reconfigured:
>performance.strict-o-direct: on
>performance.write-behind-window-size: 512MB
>features.shard-block-size: 64MB
>performance.cache-size: 128MB
>nfs.disable: on
>transport.address-family: inet
>performance.quick-read: off
>performance.read-ahead: off
>performance.io-cache: off
>performance.low-prio-threads: 32
>network.remote-dio: enable
>cluster.eager-lock: enable
>cluster.quorum-type: auto
>cluster.server-quorum-type: server
>cluster.data-self-heal-algorithm: full
>cluster.locking-scheme: granular
>cluster.shd-max-threads: 8
>cluster.shd-wait-qlength: 10000
>features.shard: on
>user.cifs: off
>storage.owner-uid: 36
>storage.owner-gid: 36
>cluster.brick-multiplex: on
>
>Volume Name: ovirt-mon-2
>Type: Replicate
>Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd
>Status: Started
>Snapshot Count: 0
>Number of Bricks: 1 x (2 + 1) = 3
>Transport-type: tcp
>Bricks:
>Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2
>Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2
>Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter)
>Options Reconfigured:
>performance.client-io-threads: on
>nfs.disable: on
>transport.address-family: inet
>performance.quick-read: off
>performance.read-ahead: off
>performance.io-cache: off
>performance.low-prio-threads: 32
>network.remote-dio: off
>cluster.eager-lock: enable
>cluster.quorum-type: auto
>cluster.server-quorum-type: server
>cluster.data-self-heal-algorithm: full
>cluster.locking-scheme: granular
>cluster.shd-max-threads: 8
>cluster.shd-wait-qlength: 10000
>features.shard: on
>user.cifs: off
>cluster.choose-local: off
>client.event-threads: 4
>server.event-threads: 4
>storage.owner-uid: 36
>storage.owner-gid: 36
>performance.strict-o-direct: on
>performance.cache-size: 64MB
>performance.write-behind-window-size: 128MB
>features.shard-block-size: 64MB
>cluster.brick-multiplex: on
>
>Thanks Olaf

Hi Olaf,

Thanks  for the detailed output.
On first glance I have noticed that you have a HostedEngine domain for both ovirt's engine VM + for other VMs , is that right?
If yes, that's against best practices and not recommended.
Second, you use brick multiplexing, but according to RH documentation - that feature is not supported for your workload - so in your case its drawing attention but should not be a problem.

Can you specify how many physical hosts do you have ?

I will try to check the output deeper, but I think you need to check:
1. Check gluster heal status - any pending heals should be resolved
2. Use telnet/nc/ncat/netcat to verify that each host sees the peers' brick ports.
3. gluster volume heal <volume> info should report all bricks arr connected
gluster volume status must report all bricks have a pid
4. OPTIONAL - Try to create smaller (it's not a good idea to have large qcow2 disks) disks  via oVirt and assign them to your mysql. Then try to pvmove the LVs from the disk (mounted with loop) to the new disks - that way you can get rid of the old qcow disk .
5. What is your oVirt version ? Could it be an old 3.x ?

Don't forget to backup :)

Best Regards,
Strahil Nikolov
________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux