Re: qemu raw image file - qemu and grub2 can't find boot content from VM

Erik Jacobson <erik.jacobson@xxxxxxx> · Mon, 1 Feb 2021 14:56:08 -0600

We think this fixed it. While there is random chance in there, we can't
repeat it in 7.9. So I'll close this thread out for now.

We'll ask for help again if needed. Thanks for all the kind responses,

Erik

On Fri, Jan 29, 2021 at 02:20:56PM -0600, Erik Jacobson wrote:
> I updated to 7.9, rebooted everything, and it started working.
> 
> I will have QE try to break it again and report back. I couldn't break
> it but they're better at breaking things (which is hard to imagine :)
> 
> 
> On Fri, Jan 29, 2021 at 01:11:50PM -0600, Erik Jacobson wrote:
> > Thank you.
> > 
> > We reproduced the problem after force-killing one of the 3 physical
> > nodes 6 times in a row.
> > 
> > At that point, the grub2 loaded off the qemu virtual hard drive, but
> > could not find partitions. Since there is random luck involved, we don't
> > actually know if it was the force-killing that caused it to stop
> > working.
> > 
> > When I start the VM with the image in this state, there is nothing
> > interesting in the fuse log for the volume in /var/log/glusterfs on the
> > node hosting the image.
> > 
> > No pending heals (all servers report 0 entries to heal).
> > 
> > The same VM behavior happens on all the physical nodes when I try to
> > start with the same VM image.
> > 
> > Something from the gluster fuse mount log from earlier shows:
> > 
> > [2021-01-28 21:24:40.814227] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-adminvm-client-0: disconnected from adminvm-client-0. Client process will keep trying to connect to glusterd until brick's port is available
> > [2021-01-28 21:24:43.815120] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-adminvm-client-0: changing port to 49152 (from 0)
> > [2021-01-28 21:24:43.815833] I [MSGID: 114057] [client-handshake.c:1376:select_server_supported_programs] 0-adminvm-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
> > [2021-01-28 21:24:43.817682] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-adminvm-client-0: Connected to adminvm-client-0, attached to remote volume '/data/brick_adminvm'.
> > [2021-01-28 21:24:43.817709] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-adminvm-client-0: 1 fds open - Delaying child_up until they are re-opened
> > [2021-01-28 21:24:43.895163] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-adminvm-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
> > The message "W [MSGID: 114061] [client-common.c:2893:client_pre_lk_v2] 0-adminvm-client-0:  (94695bdb-06b4-4105-9bc8-b8207270c941) remote_fd is -1. EBADFD [File descriptor in bad state]" repeated 6 times between [2021-01-28 21:23:54.395811] and [2021-01-28 21:23:54.811640]
> > 
> > 
> > But that was a long time ago.
> > 
> > Brick logs have an entry from when I first started the vm today (the
> > problem was reproduced yesterday) all brick logs have something similar.
> > Nothing appeared on the several other startup attempts of the VM:
> > 
> > [2021-01-28 21:24:45.460147] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:613f0d91-34e6-4495-859f-bca1c9f7af01-GRAPH_ID:0-PID:6287-HOST:nano-1-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> > [2021-01-29 18:54:45.455558] I [addr.c:54:compare_addr_and_update] 0-/data/brick_adminvm: allowed = "*", received addr = "172.23.255.153"
> > [2021-01-29 18:54:45.455802] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 3b66cfab-00d5-4b13-a103-93b4cf95e144
> > [2021-01-29 18:54:45.455815] I [MSGID: 115029] [server-handshake.c:549:server_setvolume] 0-adminvm-server: accepted client from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0 (version: 7.2) with subvol /data/brick_adminvm
> > [2021-01-29 18:54:45.494950] W [socket.c:774:__socket_rwv] 0-tcp.adminvm-server: readv on 172.23.255.153:48551 failed (No data available)
> > [2021-01-29 18:54:45.494994] I [MSGID: 115036] [server.c:501:server_rpc_notify] 0-adminvm-server: disconnecting connection from CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> > [2021-01-29 18:54:45.495091] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-adminvm-server: Shutting down connection CTX_ID:3774af6b-07b9-437b-a34e-9f71f3b57d03-GRAPH_ID:0-PID:45640-HOST:nano-3-PC_NAME:adminvm-client-2-RECON_NO:-0
> > 
> > 
> > 
> > Like before, if I halt the VM, kpartx the image, mount the giant root
> > within the image, then unmount, unkpartx, and start the VM - it works:
> > 
> > nano-2:/var/log/glusterfs # kpartx -a /adminvm/images/adminvm.img
> > nano-2:/var/log/glusterfs # mount /dev/mapper/loop0p31 /mnt
> > nano-2:/var/log/glusterfs # dmesg|tail -3
> > [85528.602570] loop: module loaded
> > [85535.975623] EXT4-fs (dm-3): recovery complete
> > [85535.979663] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
> > nano-2:/var/log/glusterfs # umount /mnt
> > nano-2:/var/log/glusterfs # kpartx -d /adminvm/images/adminvm.img
> > loop deleted : /dev/loop0
> > 
> > VM WORKS for ONE boot cycle on one physical!
> > 
> > nano-2:/var/log/glusterfs # virsh start adminvm
> > 
> > However, this will work for a boot but later it will stop working again.
> > (INCLUDING the physical node that booted once ok. The next boot fails
> > again as does luanching it on the other two).
> > 
> > Based on feedback, I will not change the shard size at this time and
> > will leave that for later. Some people suggest larger sizes but it isn't
> > a universal suggestion. I'll also not attempt to make a logical volume
> > out of a group of smaller images as I think it should work like this.
> > Those are things I will try later if I run out of runway. Since we want
> > a solution to deploy to sites, this would increase the maintenance of
> > the otherwise simple solution.
> > 
> > I am leaving the state like this and will now proceed to update to the
> > latest gluster 7.
> > 
> > I will report back after I get everything updated and services restarted
> > with the newer version.
> > 
> > THANKS FOR ALL THE HELP SO FAR!!
> > 
> > Erik
> > 
> > On Wed, Jan 27, 2021 at 10:55:50PM +0300, Mahdi Adnan wrote:
> > >  I would leave it on 64M in volumes with spindle disks, but with SSD volumes, I
> > > would increase it to 128M or even 256M, but it varies from one workload to
> > > another.
> > > On Wed, Jan 27, 2021 at 10:02 PM Erik Jacobson <erik.jacobson@xxxxxxx> wrote:
> > > 
> > >     > Also, I would like to point that I have VMs with large disks 1TB and 2TB,
> > >     and
> > >     > have no issues. definitely would upgrade Gluster version like let's say
> > >     at
> > >     > least 7.9.
> > > 
> > >     Great! Thank you! We can update but it's very sensitive due to the
> > >     workload. I can't officially update our gluster until we have a cluster
> > >     with a couple thousand nodes to test with. However, for this problem,
> > >     this is on my list on the test machine. I'm hoping I can reproduce it. So
> > >     far
> > >     no luck making it happen again. Once I hit it, I will try to collect more
> > >     data
> > >     and at the end update gluster.
> > > 
> > >     What do you think about the suggestion to increase the shard size? Are
> > >     you using the default size on your 1TB and 2TB images?
> > > 
> > >     > Amar also asked a question regarding enabling Sharding in the volume
> > >     after
> > >     > creating the VMs disks, which would certainly mess up the volume if that
> > >     what
> > >     > happened.
> > > 
> > >     Oh I missed this question. I basically scripted it quick since I was
> > >     doing it so often.. I have a similar script that takes it away to start
> > >     over.
> > > 
> > >     set -x
> > >     pdsh -g gluster mkdir /data/brick_adminvm/
> > >     gluster volume create adminvm replica 3 transport tcp 172.23.255.151:/data/
> > >     brick_adminvm 172.23.255.152:/data/brick_adminvm 172.23.255.153:/data/
> > >     brick_adminvm
> > >     gluster volume set adminvm group virt
> > >     gluster volume set adminvm granular-entry-heal enable
> > >     gluster volume set adminvm storage.owner-uid 439
> > >     gluster volume set adminvm storage.owner-gid 443
> > >     gluster volume start adminvm
> > > 
> > >     pdsh -g gluster mount /adminvm
> > > 
> > >     echo -n "press enter to continue for restore tarball"
> > > 
> > >     pushd /adminvm
> > >     tar xvf /root/backup.tar
> > >     popd
> > > 
> > >     echo -n "press enter to continue for qemu-img"
> > > 
> > >     pushd /adminvm
> > >     qemu-img create -f raw -o preallocation=falloc /adminvm/images/adminvm.img
> > >     5T
> > >     popd
> > > 
> > > 
> > >     Thanks again for the kind responses,
> > > 
> > >     Erik
> > > 
> > >     >
> > >     > On Wed, Jan 27, 2021 at 5:28 PM Erik Jacobson <erik.jacobson@xxxxxxx>
> > >     wrote:
> > >     >
> > >     >     > > Shortly after the sharded volume is made, there are some fuse
> > >     mount
> > >     >     > > messages. I'm not 100% sure if this was just before or during the
> > >     >     > > big qemu-img command to make the 5T image
> > >     >     > > (qemu-img create -f raw -o preallocation=falloc
> > >     >     > > /adminvm/images/adminvm.img 5T)
> > >     >     > Any reason to have a single disk with this size ?
> > >     >
> > >     >     > Usually in any
> > >     >     > virtualization I have used , it is always recommended to keep it
> > >     lower.
> > >     >     > Have you thought about multiple disks with smaller size ?
> > >     >
> > >     >     Yes, because the actual virtual machine is an admin node/head node
> > >     cluster
> > >     >     manager for a supercomputer that hosts big OS images and drives
> > >     >     multi-thousand-node-clusters (boot, monitoring, image creation,
> > >     >     distribution, sometimes NFS roots, etc) . So this VM is a biggie.
> > >     >
> > >     >     We could make multiple smaller images but it would be very painful
> > >     since
> > >     >     it differs from the normal non-VM setup.
> > >     >
> > >     >     So unlike many solutions where you have lots of small VMs with their
> > >     >     images small images, this solution is one giant VM with one giant
> > >     image.
> > >     >     We're essentially using gluster in this use case (as opposed to
> > >     others I
> > >     >     have posted about in the past) for head node failover (combined with
> > >     >     pacemaker).
> > >     >
> > >     >     > Also worth
> > >     >     > noting is that RHII is supported only when the shard size is 
> > >     512MB, so
> > >     >     > it's worth trying bigger shard size .
> > >     >
> > >     >     I have put larger shard size and newer gluster version on the list to
> > >     >     try. Thank you! Hoping to get it failing again to try these things!
> > >     >
> > >     >
> > >     >
> > >     > --
> > >     > Respectfully
> > >     > Mahdi
> > > 
> > > 
> > > 
> > > --
> > > Respectfully
> > > Mahdi
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users