Hi, On Sat, Sep 9, 2017 at 2:35 AM, WK <wkmail@xxxxxxxxx> wrote: > Pavel. > > Is there a difference between native client (fuse) and libgfapi in regards > to the crashing/read-only behaviour? I switched to FUSE now and the VM crashed (read-only remount) immediately after one node started rebooting. I tried to mount.glusterfs same volume on different server (not VM), running Ubuntu Xenial and gluster client 3.10.5. mount -t glusterfs -o backupvolfile-server=10.0.1.202 10.0.1.201:/gv_openstack_1 /mnt/gv_openstack_1/ I ran fio job I described earlier. As soon as I killall glusterfsd, fio reported: fio: io_u error on file /mnt/gv_openstack_1/fio.data: Transport endpoint is not connected: read offset=7022575616, buflen=262144 fio: pid=7205, err=107/file:io_u.c:1582, func=io_u error, error=Transport endpoint is not connected And crashed. I still cannot believe I am the only one experiencing these problems and that tells me, that there must be some problem in my setup. However I have not experienced any crashes if all nodes were up. Ever. I suspected disks and network as culprit, but we run SMART tests frequently (short and long), bricks are on RAID10 (6xSSDs), switches are Juniper EX4550s (shallow packet buffer, but no drops in the statistics) pretty much dedicated to Gluster and we ran many VMs and stored and heavily used other data there. And gluster logs or system logs do not provide any hint of HW/network failure. > We use Rep2 + Arb and can shutdown a node cleanly, without issue on our VMs. > We do it all the time for upgrades and maintenance. > > However we are still on native client as we haven't had time to work on > libgfapi yet. Maybe that is more tolerant. > > We have linux VMs mostly with XFS filesystems. We use whatever official cloud (Openstack) images provide, all tests I describe are on Ubuntu Xenial VMs, ext4. > During the downtime, the VMs continue to run with normal speed. > > In this case we migrated to the VM so date node 2 (c2g.gluster) and shutdown > c1g.gluster to do some upgrades. > > # gluster peer status > Number of Peers: 2 > > Hostname: c1g.gluster > Uuid: 91be2005-30e6-462b-a66e-773913cacab6 > State: Peer in Cluster (Disconnected) > Hostname: arb-c2.gluster > Uuid: 20862755-e54e-4b79-96a8-59e78c6a6a2e > State: Peer in Cluster (Connected) > > # gluster volume status > Status of volume: brick1 > Gluster process TCP Port RDMA Port Online Pid > ------------------------------------------------------------------------------ > Brick c2g.gluster:/GLUSTER/brick1 49152 0 Y 5194 > Brick arb-c2.gluster:/GLUSTER/brick1 49152 0 Y 3647 > Self-heal Daemon on localhost N/A N/A Y 5214 > Self-heal Daemon on arb-c2.gluster N/A N/A Y 3667 > > Task Status of Volume brick1 > ------------------------------------------------------------------------------ > There are no active volume tasks > > When we return the c1g node, we do see a "pause" in the VMs as the shards > heal. By pause meaning a terminal session gets spongy, but that passes > pretty quickly. Hmm, do you see any errors in VM's dmesg? Or any other reasons for "sponginess"? > Also are your VMs mounted in libvirt with caching? We always use > cache='none' so we can migrate around easily. No cache, virtio: <disk type='network' device='disk'> <driver name='qemu' type='raw' cache='none'/> <source protocol='gluster' name='gv_openstack_1/volume-3a7eaf5a-8348-4f01-b59f-f28cd8cea771'> <host name='10.0.1.201' port='24007'/> </source> <backingStore/> <target dev='vda' bus='virtio'/> <serial>3a7eaf5a-8348-4f01-b59f-f28cd8cea771</serial> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> > Finally, you seem to be using oVirt/RHEV. Is it possible that your platform > is triggering a protective response on the VMs (by suspending). No, this is Openstack environment, I am not aware of any protective mechanisms. -ps _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users