Re: ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

Adrien Gillard <gillard.adrien@xxxxxxxxx> · Mon, 29 Feb 2016 11:14:28 +0100

We are likely facing the same kind of issue in our infernalis cluster with EC.
From times to times some of our volumes mounted via the RBD kernel module, will start to "freeze". I can still browse the volume, but the (backup) application using it hangs. I guess it's because it tries to access an object from the EC pool (tracker.ceph.com seems down at the moment so I can't access the details).

I can't map / unmap the affected volumes (it rarely concerns all the volumes at the same time). Running 'rbd -p ec-pool info volume-1' gets me the same errors as Frederic ((95) Operation not supported). The sloppy workaround I found is running 'rbd -p ec-pool ls -l' a couple of times. It "magically" gets the volumes in order and they become usable again.

Adrien

On Sat, Feb 27, 2016 at 12:14 PM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:
Hi,

Many thanks.

Just tested : I could see the rbd_id object in the EC pool, and after promoting it I could see it in the SSD cache pool and could successfully list the image information, indeed.

Cheers

-----Message d'origine-----

De : Jason Dillaman [mailto:dillaman@xxxxxxxxxx]

Envoyé : mercredi 24 février 2016 19:16

À : SCHAER Frederic <frederic.schaer@xxxxxx>

Cc : ceph-users@xxxxxxxx; HONORE Pierre-Francois <pierre-francois.honore@xxxxxx>

Objet : Re:  ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

If you run "rados -p <cache pool> ls | grep "rbd_id.<yyy-disk1>" and don't see that object, you are experiencing that issue [1].

You can attempt to work around this issue by running "rados -p irfu-virt setomapval rbd_id.<yyy-disk1> dummy value" to force-promote the object to the cache pool.  I haven't tested / verified that will alleviate the issue, though.

[1] http://tracker.ceph.com/issues/14762

--

Jason Dillaman

----- Original Message -----

> From: "SCHAER Frederic" <frederic.schaer@xxxxxx>

> To: ceph-users@xxxxxxxx

> Cc: "HONORE Pierre-Francois" <pierre-francois.honore@xxxxxx>

> Sent: Wednesday, February 24, 2016 12:56:48 PM

> Subject:  ceph hammer : rbd info/Status : operation not supported

> (95) (EC+RBD tier pools)

> Hi,

> I just started testing VMs inside ceph this week, ceph-hammer 0.94-5 here.

> I built several pools, using pool tiering:

> - A small replicated SSD pool (5 SSDs only, but I thought it’d be better for

> IOPS, I intend to test the difference with disks only)

> - Overlaying a larger EC pool

> I just have 2 VMs in Ceph… and one of them is breaking something.

> The VM that is not breaking was migrated using qemu-img for creating the ceph

> volume, then migrating the data. Its rbd format is 1 :

> rbd image 'xxx-disk1':

> size 20480 MB in 5120 objects

> order 22 (4096 kB objects)

> block_name_prefix: rb.0.83a49.3d1b58ba

> format: 1

> The VM that’s failing has a rbd format 2

> this is what I had before things started breaking :

> rbd image 'yyy-disk1':

> size 10240 MB in 2560 objects

> order 22 (4096 kB objects)

> block_name_prefix: rbd_data.8ae1f47398c89

> format: 2

> features: layering, striping

> flags:

> stripe unit: 4096 kB

> stripe count: 1

> The VM started behaving weirdly with a huge IOwait % during its install

> (that’s to say it did not take long to go wrong ;) )

> Now, this is the only thing that I can get

> [root@ceph0 ~]# rbd -p irfu-virt info yyy-disk1

> 2016-02-24 18:30:33.213590 7f00e6f6d7c0 -1 librbd::ImageCtx: error reading

> image id: (95) Operation not supported

> rbd: error opening image yyy-disk1: (95) Operation not supported

> One thing to note : the VM * IS STILL * working : I can still do disk

> operations, apparently.

> During the VM installation, I realized I wrongly set the target SSD caching

> size to 100Mbytes, instead of 100Gbytes, and ceph complained it was almost

> full :

> health HEALTH_WARN

> 'ssd-hot-irfu-virt' at/near target max

> My question is…… am I facing the bug as reported in this list thread with

> title “Possible Cache Tier Bug - Can someone confirm” ?

> Or did I do something wrong ?

> The libvirt and kvm that are writing into ceph are the following :

> libvirt -1.2.17-13.el7_2.3.x86_64

> qemu- kvm -1.5.3-105.el7_2.3.x86_64

> Any idea how I could recover the VM file, if possible ?

> Please note I have no problem with deleting the VM and rebuilding it, I just

> spawned it to test.

> As a matter of fact, I just “virsh destroyed” the VM, to see if I could start

> it again… and I cant :

> # virsh start yyy

> error: Failed to start domain yyy

> error: internal error: process exited while connecting to monitor:

> 2016-02-24T17:49:59.262170Z qemu-kvm: -drive

> file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***==:auth_supported=cephx\;none:mon_host=_____\:6789,if=none,id=drive-virtio-disk0,format=raw:

> error reading header from yyy-disk1

> 2016-02-24T17:49:59.263743Z qemu-kvm: -drive

> file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=A***==:auth_supported=cephx\;none:mon_host=___\:6789,if=none,id=drive-virtio-disk0,format=raw:

> could not open disk image

> rbd:irfu-virt/___-disk1:id=irfu-***==:auth_supported=cephx\;none:mon_host=___\:6789:

> Could not open 'rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***

> Ideas ?

> Thanks

> Frederic

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com