Hi *,
trying to remove a caching tier from a pool used for RBD / Openstack,
we followed the procedure from
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-a-writeback-cache and ran into
problems.
The cluster is currently running Ceph 12.2.2, the caching tier was
created with an earlier release of Ceph.
First of all, setting the cache-mode to "forward" is reported to be
unsafe, which is not mentioned in the documentation - if it's really
meant to be used in this case, the need for "--yes-i-really-mean-it"
should be documented.
Unfortunately, using "rados -p hot-storage cache-flush-evict-all" not
only reported errors ("file not found") for many objects, but left us
with quite a number of objects in the pool and new ones being created,
despite the "forward" mode. Even after stopping all Openstack
instances ("VMs"), we could also see that the remaining objects in the
pool were still locked. Manually unlocking these via rados commands
worked, but "cache-flush-evict-all" then still reported those "file
not found" errors and 1070 objects remained in the pool, like before.
We checked the remaining objects via "rados stat" both in the
hot-storage and the cold-storage pool and could see that every
hot-storage object had a counter-part in cold-storage with identical
stat info. We also compared some of the objects (with size > 0) and
found the hot-storage and cold-storage entities to be identical.
We aborted that attempt, reverted the mode to "writeback" and
restarted the Openstack cluster - everything was working fine again,
of course still using the cache tier.
During a recent maintenance window, the Openstack cluster was shut
down again and we re-tried the procedure. As there were no active
users of the images pool, we skipped the step of forcing the cache
mode to forward and immediately issued the "cache-flush-evict-all"
command. Again 1070 objects remained in the hot-storage pool (and gave
"file not found" errors), but unlike last time, none were locked.
Out of curiosity we then issued loops of "rados -p hot-storage
cache-flush <obj-name>" and "rados -p hot-storage cache-evict
<obj-name>" for all objects in the hot-storage pool and surprisingly
not only received no error messages at all, but were left with an
empty hot-storage pool! We then proceeded with the further steps from
the docs and were able to successfully remove the cache tier.
This leaves us with two questions:
1. Does setting the cache mode to "forward" lead to above situation of
remaining locks on hot-storage pool objects? Maybe the clients' unlock
requests are forwarded to the cold-storage pool, leaving the
hot-storage objects locked? If so, this should be documented and it'd
seem impossible to cleanly remove a cache tier during live operations.
2. What is the significant difference between "rados
cache-flush-evict-all" and separate "cache-flush" and "cache-evict"
cycles? Or is it some implementation error that leads to those "file
not found" errors with "cache-flush-evict-all", while the manual
cycles work successfully?
Thank you for any insight you might be able to share.
Regards,
Jens
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com