Re: Removing cache tier for RBD pool

"Jens-U. Mozdzen" <jmozdzen@xxxxxx> · Tue, 16 Jan 2018 16:25:12 +0000

Hello Mike,

Zitat von Mike Lovell <mike.lovell@xxxxxxxxxxxxx>:
On Mon, Jan 8, 2018 at 6:08 AM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote:
Hi *,
[...]
1. Does setting the cache mode to "forward" lead to above situation of
remaining locks on hot-storage pool objects? Maybe the clients' unlock
requests are forwarded to the cold-storage pool, leaving the hot-storage
objects locked? If so, this should be documented and it'd seem impossible
to cleanly remove a cache tier during live operations.

2. What is the significant difference between "rados
cache-flush-evict-all" and separate "cache-flush" and "cache-evict" cycles?
Or is it some implementation error that leads to those "file not found"
errors with "cache-flush-evict-all", while the manual cycles work
successfully?

Thank you for any insight you might be able to share.

Regards,
Jens

i've removed a cache tier in environments a few times. the only locked
files i ran into were the rbd_directory and rbd_header objects for each
volume. the rbd_headers for each rbd volume are locked as long as the vm is
running. every time i've tried to remove a cache tier, i shutdown all of
the vms before starting the procedure and there wasn't any problem getting
things flushed+evicted. so i can't really give any further insight into
what might have happened other than it worked for me. i set the cache-mode
to forward everytime before flushing and evicting objects.

while your report doesn't confirm my suspicion expressed in my  
question 1, it at least is another example where removing the cache  
worked *after stopping all instances*, rather than "live". If, OTOH,  
this limitation is confirmed, it should be added to the docs.

Out of curiosity: Do you have any other users for the pool? After  
stopping all VMs (and the image-related services on our Openstack  
control nodes), my pool was without access, so I saw no need to put  
the caching tier to "forward" mode.

i don't think there really is a significant technical difference between
the cache-flush-evict-all command and doing separate cache-flush and
cache-evict on individual objects. my understanding is
cache-flush-evict-all is just a short cut to getting everything in the
cache flushed and evicted. did the cache-flush-evict-all error on some
objects where the separate operations succeeded? you're description doesn't
say if there was but then you say you used both styles during your second
attempt.

It was actually that every run of "cache-flush-evict-all" did report  
errors on all remaining objects, while running the loop manually  
(issue flush for every objects, then issue evict for every remaining  
object) did work flawlessly. That's why my question 2 came up.

The objects I saw seemed related to the images stored in the pool, not  
any "management data" (like the suggested hitset persistence).

on a different note, you say that your cluster is on 12.2 but the cache
tiers were created on an earlier version. which version was the cache tier
created on? how well did the upgrade process work? i am curious since the
production clusters i have using a cache tier are still on 10.2 and i'm
about to begin the process of testing the upgrade to 12.2. any info on that
experience you can share would be helpful.

I *believe* the cache was created on 10.2, but cannot recall for sure.  
I remember having had similar problems in those earlier days with a  
previous instance of that caching tier, but many root causes were "on  
my side of the keyboard". The cache tier I was trying to remove  
recently was created from scratch after those problems, and upgrading  
to the latest release via the recommended intermediate version steps  
was problem-free. At least when focusing on the subject of cache tiers  
;)

Regards,
Jens

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com