[Errno 107] Transport endpoint is not connected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Gluster users,

i'm a bit at a los here, and any help would be appreciated.

I've lost a couple, since the disks suffered from severe XFS error's and of virtual machines and some won't boot because they can't resolve the size of the image as reported by vdsm: 
"VM kube-large-01 is down with error. Exit message: Unable to get volume size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume f16492a6-2d0e-4657-88e3-a9f4d8e48e74."

which is also reported by the vdsm-client;  vdsm-client Volume getSize storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3 storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829 imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74
vdsm-client: Command Volume.getSize with args {'storagepoolID': '59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID': '5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID': 'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID': '2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed:
(code=100, message=[Errno 107] Transport endpoint is not connected)

with corresponding gluster mount log;
[2020-01-27 19:42:22.678793] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
[2020-01-27 19:42:22.678828] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied]
[2020-01-27 19:42:22.679806] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.679862] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.679981] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.680606] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.680622] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.681742] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.681871] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.682344] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2 times between [2020-01-27 19:42:22.679806] and [2020-01-27 19:42:22.683308]
[2020-01-27 19:42:22.683327] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:42:22.683438] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b) [0x7faaaadeb92b] -->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78) [0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]
[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 176728: LOOKUP() /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 => -1 (Transport endpoint is not connected)

In addition to this, vdsm also reported it couldn't find the image of the HostedEngine, and refused to boot;
2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [storage.TaskManager.Task] (Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in prepareImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203, in prepareImage
    raise se.VolumeDoesNotExist(leafUUID)
VolumeDoesNotExist: Volume does not exist: ('38e4fba7-f140-4630-afab-0f744ebe3b57',)

2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm] (vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process failed (vm:933)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run
    self._devices = self._make_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in _make_devices
    disk_objs = self._perform_host_local_adjustment()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in _perform_host_local_adjustment
    self._preparePathsForDrives(disk_params)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in _preparePathsForDrives
    drive, self.id, path=path
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'protocol': 'gluster', 'address': {'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64', 'index': 0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {}, 'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64', 'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk', 'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0', 'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': 'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57', 'propagateErrors': 'off', 'name': 'vda', 'volumeID': '38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias': 'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name': '10.201.0.9', 'port': '0'}], 'discard': False}

And last, there is a storage domain which refuses to activate (from de vsdm.log);
2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata (monitor:499)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 497, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\': Transport endpoint is not connected\n"))

corresponding gluster mount log;
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.063826] and [2020-01-27 19:59:21.690134]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.063734] and [2020-01-27 19:59:21.690150]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.065027] and [2020-01-27 19:59:21.691313]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.065106] and [2020-01-27 19:59:21.691328]
The message "W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0: no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md" repeated 4 times between [2020-01-27 19:58:33.065163] and [2020-01-27 19:59:21.691369]
[2020-01-27 19:59:50.539315] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.539321] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540412] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540477] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]
[2020-01-27 19:59:50.540533] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0: no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md
[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 99: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 105: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 112: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 118: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 125: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 131: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 137: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 144: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 150: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 156: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 163: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 169: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 176: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)
[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 183: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected)

and some apps directly connecting to gluster mounts report these error's;
2020-01-27  1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not found (Errcode: 107 "Transport endpoint is not connected")
2020-01-27  3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not found (Errcode: 107 "Transport endpoint is not connected")

So the errors seem to hint to either a connection issue or a quorum loss of some sort. However gluster is running on it's own private and separate network, with no firewall rules or anything else which could obstruct the connection.
In addition gluster volume status reports all bricks and nodes are up, and gluster volume heal reports no pending heals. 
What makes this issue even more interesting is that when i manually check the files all seems fine;

for the first issue, where the machine won't start because vdsm cannot determine the size. 
qemu is able to report the size;
qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7:_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46
57-88e3-a9f4d8e48e74
image: /rhev/data-center/mnt/glusterSD/10.201.0.7:_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
file format: raw
virtual size: 34T (37580963840000 bytes)
disk size: 7.1T
in addition i'm able to mount the volume using a loop device;
losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7:_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74
kpartx -av /dev/loop0
vgscan
vgchange -ay
mount /dev/mapper/cl--data5-data5 /data5/
after this i'm able to see all contents of the disk, and in fact write to it. So the earlier reported connection error doesn't seem to apply here? 
This is actually how i'm currently running the VM, where i detached the disk, and mounted it  in the VM via the loop device. The disk is a data disk for a heavily loaded mysql instance, and mysql is reporting no errors, and has been running for about a day now.
Of course this not the way it should run, but it is at least working, only performance seems a bit off. So i would like to solve the issue and being able to attach the image as disk again.

for the second issue where the Image of the HostedEngine couldn't be found, also all seems correct;
The file is there and having the correct permissions;
 ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/
total 49406333
drwxr-xr-x.  2 vdsm kvm        4096 Jan 25 12:03 .
drwxr-xr-x. 13 vdsm kvm        4096 Jan 25 14:16 ..
-rw-rw----.  1 vdsm kvm 62277025792 Jan 23 03:04 38e4fba7-f140-4630-afab-0f744ebe3b57
-rw-rw----.  1 vdsm kvm     1048576 Jan 25 21:48 38e4fba7-f140-4630-afab-0f744ebe3b57.lease
-rw-r--r--.  1 vdsm kvm         285 Jan 27  2018 38e4fba7-f140-4630-afab-0f744ebe3b57.meta
And i'm able to mount the image using a loop device and access it's contents. 
Unfortunate the VM wouldn't boot due to XFS error's. After tinkering with this for about a day to make it boot, i gave up and restored from a recent backup. But i took the data dir from postgress from the mounted old image to the new VM, and postgress was perfectly fine with it, also indicating the image wasn't a complete toast.

And the last issue where the storage domain wouldn't activate. The file it claims it cannot read in the log is perfectly readable and writable;
cat /rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata
CLASS=Data
DESCRIPTION=ovirt-mon-2
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3
REMOTE_PATH=10.201.0.11:/ovirt-mon-2
ROLE=Regular
SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458
TYPE=GLUSTERFS
VERSION=4
_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807

So i've no clue where these "Transport endpoint is not connected"  are coming from, and how to resolve them? 

I think there are 4 possible causes for this issue;
1) I was trying to optimize the throughput of gluster on some volumes, since we recently gained some additional write load, which we had difficulty keeping up with. So I tried to incrementally add server.event-threads, via;
gluster v set ovirt-data server.event-threads X
since this didn't seem to improve the performance i changed it back to it's original values. But when i did that the VM's running on these volumes all locked-up, and required a reboot, which was by than still possible. Please note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't changed.

2) I had a mix of running gluster 6.6 and 6.7, since i was in the middle of upgrading all to 6.7

3) On one of the physical brick nodes, after a reboot xfs errors were reported, and resolved by xfs_repair, which did remove some inodes in the process. For which i wasn't too worried since i would expect the gluster self healing daemon would resolve them, which seemed true for all volumes, except 1, where 1 gfid was pending for about 2 days. in this case also exactly the image which vdsm reports it cannot resolve the size from. But there are other vm image with the same issue, which i left out for brevity. However the pending heal of the single gfid resolved once I mounted the image via the loop device and started writing to. Which is probably due the nature on how gluster resolves what needs healing. Despite a gluster heal X full was issued before. 
I could also confirm the pending gfid was in fact missing on the brick node on the underlying brick directory, while the heal was still pending.

4) I did some brick replace's (only the ovirt-data volume) but only of arbiter bricks of the affected volume in the first issue. 

the volume info's of the affected bricks look like this;

Volume Name: ovirt-data
Type: Distributed-Replicate
Volume ID: 2775dc10-c197-446e-a73f-275853d38666
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (2 + 1) = 12
Transport-type: tcp
Bricks:
Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data
Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data
Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data
Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data
Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data
Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data
Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data
Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data
Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter)
Options Reconfigured:
cluster.choose-local: off
server.outstanding-rpc-limit: 1024
storage.owner-gid: 36
storage.owner-uid: 36
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
performance.write-behind-window-size: 512MB
performance.cache-size: 384MB
server.event-threads: 5
performance.strict-o-direct: on
cluster.brick-multiplex: on

Volume Name: ovirt-engine
Type: Distributed-Replicate
Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 3 = 9
Transport-type: tcp
Bricks:
Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine
Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine
Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine
Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine
Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine
Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine
Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine
Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine
Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine
Options Reconfigured:
performance.strict-o-direct: on
performance.write-behind-window-size: 512MB
features.shard-block-size: 64MB
performance.cache-size: 128MB
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
storage.owner-uid: 36
storage.owner-gid: 36
cluster.brick-multiplex: on

Volume Name: ovirt-mon-2
Type: Replicate
Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2
Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2
Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
performance.cache-size: 64MB
performance.write-behind-window-size: 128MB
features.shard-block-size: 64MB
cluster.brick-multiplex: on

Thanks Olaf
________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux