Re: RBD boot from volume weirdness in OpenStack

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Thu, 25 Oct 2012 10:25:51 -0700

On 10/25/2012 09:27 AM, Travis Rhoden wrote:
Josh,

Do you mind if I ask you a few follow-up questions?  I can ask on the
OpenStack ML if needed, but I think you are the most knowledgeable
person for these...

I don't mind. ceph-devel is fine for these ceph-related questions.

1. To get "efficient volumes from images" (i.e. volumes that are a COW
copy of the image), do the images and volumes need to live in the same
pool?  I have glance configured to use a pool called "glanceimages",
and nova-volume/Cinder uses a second pool called "nova-volume".  Is
this always going to prevent the COW process from working?  If I check
out my volume, I see this:

# rbd -p nova-volume info volume-8c30ee47-5ec3-4600-b332-1bdc2a650837
rbd image 'volume-8c30ee47-5ec3-4600-b332-1bdc2a650837':
	size 220 MB in 55 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.1f04.4ba87ea2
	parent:  (pool -1)

If the COW process is actually working, I think I'll see a parent
other than (pool -1), correct?

They can be in different pools. With a COW clone you would see a parent
there. Did you set show_image_direct_url=True for Glance (i.e. 
http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-glance)?

I had split glance/cinder into different RADOS pools because I figured
it would give me more flexibility (could set different rep size/crush
rules) and potentially more security (use different cephx
clients/keys.  Glance keys aren't on nova-compute nodes, only glance
node).  But this isn't a strict requirement.

Yeah, that's how it's designed to work. The Glance pool can
be read-only from nova-compute, and Glance doesn't need access
to the pool used for volumes.

2. Do you know if "raw" is the only disk format accepted for
boot-from-volume?  I did the whole "create volume from image" step,
and my source image was a qcow2.  But when I do the boot-from-volume,
the -disk line contains format=raw.  Not sure how to control that
anymore -- there is no metadata attached to the volume that indicates
if it is qcow2 vs raw.  I'll have to dig into the code and see if
looks for anything.  Thought you might know...

Raw is the only thing that works by default. Although it's possible
to layer other formats on top of rbd, it's not well tested or
recommended. Now that rbd supports cloning natively, there's not much
benefit to e.g. qcow2 on top of it. The interfaces for QEMU and
libvirt generally don't handle such layered formats well in any case.

3.  I edited my libvirt XML to saw raw instead of qcow2, and the VM
started to boot!  Hooray!  boot-from-volume over RBD.  But then
console.log shows stuff like:

Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[    1.044112] EXT4-fs (vda1): mounted filesystem with ordered data
mode. Opts: (null)
Begin: Running /scripts/local-bottom ... [    1.052379] FDC 0 is a S82078B
done.
done.
Begin: Running /scripts/init-bottom ... done.
[    1.156951] Refined TSC clocksource calibration: 2266.803 MHz.
[    1.796114] end_request: I/O error, dev vda, sector 16065
[    1.800018] Buffer I/O error on device vda1, logical block 0
[    1.800018] lost page write due to I/O error on vda1
[    1.805294] EXT4-fs (vda1): re-mounted. Opts: (null)
cloud-init start-local running: Thu, 25 Oct 2012 16:06:34 +0000. up
2.86 seconds^M
no instance data found in start-local^M
[    3.802465] end_request: I/O error, dev vda, sector 1257161
[    3.803629] Buffer I/O error on device vda1, logical block 155137
[    3.804020] Buffer I/O error on device vda1, logical block 155138
....

And that just continues on and obviously the VM is unusable.  Any
thoughts on why that might happen.  You ever run into this during your
testing?

I haven't seen such errors. It may be due to using qcow2 on top of rbd.

I'm thinking that I probably need to not use UEC images for this -- It
tries to go in and resize the file system and stuff like that.  I
should probably just make a bunch of fixed images (10G, 20G, etc.) and
make volumes from those.  Right now, I'm not even positive that the
RBD has even been formatted with a filesystem.

UEC images work, but you have to convert them to raw first, as shown here:

http://ceph.com/docs/master/rbd/rbd-openstack/#booting-from-a-block-device

Regards,

  - Travis

On Thu, Oct 25, 2012 at 11:51 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
Awesome, thanks Josh.  I mispoke -- my client was 0.48.1.  glad
upgrading to 0.48.2 will do the trick!  thanks again.

On Thu, Oct 25, 2012 at 11:42 AM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
On 2012-10-25 08:22, Travis Rhoden wrote:

I've been trying to take advantage of the code additions made by Josh
Durgin to OpenStack Folsom for combining  boot-from-volume and Ceph
RBD.  First off, nice work Josh!  I'm hoping you folks can help me out
with something strange I am seeing.  The question may be more
OpenStack related than Ceph, though, but hear me out first.

I created a new volume (to use for boot-from-volume) from an existing
image like so:

#cinder create --display-name uec-test-vol --image-id
699137a2-a864-4a87-98fa-1684d7677044 5

This completes just fine.

Later I try to boot from it, that fails.  Cutting to the chase, here is
why:

kvm: -drive

file=rbd:nova-volume/volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b,if=none,id=drive-virtio-disk0,format=raw,cache=none:
error reading header from volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b
kvm: -drive

file=rbd:nova-volume/volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b,if=none,id=drive-virtio-disk0,format=raw,cache=none:
could not open disk image
rbd:nova-volume/volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b: No such
file or directory

It's weird that creating the volume was successful, but that KVM can't
read it.  Poking around a bit more, it was clear why:

# rbd -n client.novavolume --pool nova-volume ls
<returns nothing>

# rbd -n client.novavolume ls
volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b

Okay, the volume is the "rbd" pool!  That's really weird, though.
Here is the my nova.conf entries:
volume_driver=nova.volume.driver.RBDDriver
rbd_pool=nova-volume
rbd_user=novavolume

AND, here are the log entries from nova-volume.log (cleaned up a little):

rbd create --pool nova-volume --size 5120
volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b
rbd rm --pool nova-volume volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b
rbd import --pool nova-volume /tmp/tmplQUwzt
volume-9f4e4b70-7fbb-4d81-b912-b1c6fcf86c8b

I'm not sure why it goes create/delete/import, but regardless all of
that worked.  More importantly, all these commands used --pool
nova-volume.  So how the heck did that RBD end up in the "rbd" pool
instead of the "nova-volume" pool?  Any ideas?

Before I hit "send", I figured I should at least test this myself.  Watch
this:

#rbd create -n client.novavolume --pool nova-volume --size 1024 test
# rbd ls -n client.novavolume --pool nova-volume
test
# rbd export -n client.novavolume --pool nova-volume test /tmp/test
Exporting image: 100% complete...done.
# rbd rm -n client.novavolume --pool nova-volume test
Removing image: 100% complete...done.
# rbd import -n client.novavolume --pool nova-volume /tmp/test test
Importing image: 100% complete...done.
# rbd ls -n client.novavolume --pool nova-volume

# rbd ls -n client.novavolume --pool rbd
test

So it seems that "rbd import" doesn't honor the --pool argument?

This was true in 0.48, but it should have been fixed in 0.48.2 (and 0.52).
I'll add a note about this to the docs.

I am using 0.53 on the backend, but my client is 0.48.2.  I'll upgrade
that and see if that makes a different.

The ceph-common package in particular should be 0.48.2 or >=0.52.

  - Travis

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html