I updated to 7.9, rebooted everything, and it started working. I will have QE try to break it again and report back. I couldn't break it but they're better at breaking things (which is hard to imagine :) On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote: > Thank you. > > We reproduced the problem after force-killing one of the 3 physical > nodes 6 times in a row. > > At that point, the grub2 loaded off the qemu virtual hard drive, but > could not find partitions. Since there is random luck involved, we don't > actually know if it was the force-killing that caused it to stop > working. > > When I start the VM with the image in this state, there is nothing > interesting in the fuse log for the volume in /var/log/glusterfs on the > node hosting the image. > > No pending heals (all servers report 0 entries to heal). > > The same VM behavior happens on all the physical nodes when I try to > start with the same VM image. > > Something from the gluster fuse mount log from earlier shows: > > [2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0) > [2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400) > [2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'. > [2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened > [2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0: (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640] > > > But that was a long time ago. > > Brick logs have an entry from when I first started the vm today (the > problem was reproduced yesterday) all brick logs have something similar. > Nothing appeared on the several other startup attempts of the VM: > > [2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm > [2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153" > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144 > [2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available) > [2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > [2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 > > > > Like before, if I halt the VM, kpartx the image, mount the giant root > within the image, then unmount, unkpartx, and start the VM - it works: > > nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img > nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt > nano-2:/var/log/glusterfs # dmesg|tail -3 > [85528.602570] loop: module loaded > [85535.975623] EXT4-fs (dm-3): recovery complete > [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null) > nano-2:/var/log/glusterfs # umount /mnt > nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img > loop deleted : /dev/loop0 > > VM WORKS for ONE boot cycle on one physical! > > nano-2:/var/log/glusterfs # virsh start adminvm > > However, this will work for a boot but later it will stop working again. > (INCLUDING the physical node that booted once ok. The next boot fails > again as does luanching it on the other two). > > Based on feedback, I will not change the shard size at this time and > will leave that for later. Some people suggest larger sizes but it isn't > a universal suggestion. I'll also not attempt to make a logical volume > out of a group of smaller images as I think it should work like this. > Those are things I will try later if I run out of runway. Since we want > a solution to deploy to sites, this would increase the maintenance of > the otherwise simple solution. > > I am leaving the state like this and will now proceed to update to the > latest gluster 7. > > I will report back after I get everything updated and services restarted > with the newer version. > > THANKS FOR ALL THE HELP SO FAR!! > > Erik > > On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote: > > I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I > > would increase it to 128M or even 256M, but it varies from one workload to > > another. > > On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <erik.jacobson@xxxxxxx> wrote: > > > > > Also, I would like to point that I have VMs with large disks 1TB and 2TB, > > and > > > have no issues. definitely would upgrade Gluster version like let's say > > at > > > least 7.9. > > > > Great! Thank you! We can update but it's very sensitive due to the > > workload. I can't officially update our gluster until we have a cluster > > with a couple thousand nodes to test with. However, for this problem, > > this is on my list on the test machine. I'm hoping I can reproduce it. So > > far > > no luck making it happen again. Once I hit it, I will try to collect more > > data > > and at the end update gluster. > > > > What do you think about the suggestion to increase the shard size? Are > > you using the default size on your 1TB and 2TB images? > > > > > Amar also asked a question regarding enabling Sharding in the volume > > after > > > creating the VMs disks, which would certainly mess up the volume if that > > what > > > happened. > > > > Oh I missed this question. I basically scripted it quick since I was > > doing it so often.. I have a similar script that takes it away to start > > over. > > > > set -x > > pdsh -g gluster mkdir /data/brick_adminvm/ > > gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/ > > brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/ > > brick_adminvm > > gluster volume set adminvm group virt > > gluster volume set adminvm granular-entry-heal enable > > gluster volume set adminvm storage.owner-uid 439 > > gluster volume set adminvm storage.owner-gid 443 > > gluster volume start adminvm > > > > pdsh -g gluster mount /adminvm > > > > echo -n "press enter to continue for restore tarball" > > > > pushd /adminvm > > tar xvf /root/backup.tar > > popd > > > > echo -n "press enter to continue for qemu-img" > > > > pushd /adminvm > > qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img > > 5T > > popd > > > > > > Thanks again for the kind responses, > > > > Erik > > > > > > > > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <erik.jacobson@xxxxxxx> > > wrote: > > > > > > > > Shortly after the sharded volume is made, there are some fuse > > mount > > > > > messages. I'm not 100% sure if this was just before or during the > > > > > big qemu-img command to make the 5T image > > > > > (qemu-img create -f raw -o preallocation=falloc > > > > > /adminvm/images/adminvm.img 5T) > > > > Any reason to have a single disk with this size ? > > > > > > > Usually in any > > > > virtualization I have used , it is always recommended to keep it > > lower. > > > > Have you thought about multiple disks with smaller size ? > > > > > > Yes, because the actual virtual machine is an admin node/head node > > cluster > > > manager for a supercomputer that hosts big OS images and drives > > > multi-thousand-node-clusters (boot, monitoring, image creation, > > > distribution, sometimes NFS roots, etc) . So this VM is a biggie. > > > > > > We could make multiple smaller images but it would be very painful > > since > > > it differs from the normal non-VM setup. > > > > > > So unlike many solutions where you have lots of small VMs with their > > > images small images, this solution is one giant VM with one giant > > image. > > > We're essentially using gluster in this use case (as opposed to > > others I > > > have posted about in the past) for head node failover (combined with > > > pacemaker). > > > > > > > Also worth > > > > noting is that RHII is supported only when the shard size is > > 512MB, so > > > > it's worth trying bigger shard size . > > > > > > I have put larger shard size and newer gluster version on the list to > > > try. Thank you! Hoping to get it failing again to try these things! > > > > > > > > > > > > -- > > > Respectfully > > > Mahdi > > > > > > > > -- > > Respectfully > > Mahdi ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users