Removing cache tier for RBD pool

"Jens-U. Mozdzen" <jmozdzen@xxxxxx> · Mon, 08 Jan 2018 13:08:21 +0000

Hi *,

trying to remove a caching tier from a pool used for RBD / Openstack,  
we followed the procedure from  
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-a-writeback-cache and ran into  
problems.

The cluster is currently running Ceph 12.2.2, the caching tier was  
created with an earlier release of Ceph.

First of all, setting the cache-mode to "forward" is reported to be  
unsafe, which is not mentioned in the documentation - if it's really  
meant to be used in this case, the need for "--yes-i-really-mean-it"  
should be documented.

Unfortunately, using "rados -p hot-storage cache-flush-evict-all" not  
only reported errors ("file not found") for many objects, but left us  
with quite a number of objects in the pool and new ones being created,  
despite the "forward" mode. Even after stopping all Openstack  
instances ("VMs"), we could also see that the remaining objects in the  
pool were still locked. Manually unlocking these via rados commands  
worked, but "cache-flush-evict-all" then still reported those "file  
not found" errors and 1070 objects remained in the pool, like before.  
We checked the remaining objects via "rados stat" both in the  
hot-storage and the cold-storage pool and could see that every  
hot-storage object had a counter-part in cold-storage with identical  
stat info. We also compared some of the objects (with size > 0) and  
found the hot-storage and cold-storage entities to be identical.

We aborted that attempt, reverted the mode to "writeback" and  
restarted the Openstack cluster - everything was working fine again,  
of course still using the cache tier.

During a recent maintenance window, the Openstack cluster was shut  
down again and we re-tried the procedure. As there were no active  
users of the images pool, we skipped the step of forcing the cache  
mode to forward and immediately issued the "cache-flush-evict-all"  
command. Again 1070 objects remained in the hot-storage pool (and gave  
"file not found" errors), but unlike last time, none were locked.

Out of curiosity we then issued loops of "rados -p hot-storage  
cache-flush <obj-name>" and "rados -p hot-storage cache-evict  
<obj-name>" for all objects in the hot-storage pool and surprisingly  
not only received no error messages at all, but were left with an  
empty hot-storage pool! We then proceeded with the further steps from  
the docs and were able to successfully remove the cache tier.

This leaves us with two questions:

1. Does setting the cache mode to "forward" lead to above situation of  
remaining locks on hot-storage pool objects? Maybe the clients' unlock  
requests are forwarded to the cold-storage pool, leaving the  
hot-storage objects locked? If so, this should be documented and it'd  
seem impossible to cleanly remove a cache tier during live operations.

2. What is the significant difference between "rados  
cache-flush-evict-all" and separate "cache-flush" and "cache-evict"  
cycles? Or is it some implementation error that leads to those "file  
not found" errors with "cache-flush-evict-all", while the manual  
cycles work successfully?

Thank you for any insight you might be able to share.

Regards,
Jens

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com