Re: Ceph cache tier removal.

Daznis <daznis@xxxxxxxxx> · Wed, 11 Jan 2017 14:52:35 +0200

Hello,

On Tue, Jan 10, 2017 at 11:11 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Daznis
>> Sent: 09 January 2017 12:54
>> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Subject:  Ceph cache tier removal.
>>
>> Hello,
>>
>>
>> I'm running preliminary test on cache tier removal on a live cluster, before I try to do that on a production one. I'm trying to
> avoid
>> downtime, but from what I noticed it's either impossible or I'm doing something wrong. My cluster is running Centos 7.2 and 0.94.9
>> ceph.
>>
>> Example 1:
>>  I'm setting the cache layer to forward.
>>     1. ceph osd tier cache-mode test-cache forward .
>> Then flushing the cache:
>>      1. rados -p test-cache cache-flush-evict-all Then I'm getting stuck with the some objects that can't be removed:
>>
>> rbd_header.29c3cdb2ae8944a
>> failed to evict /rbd_header.29c3cdb2ae8944a: (16) Device or resource busy
>>         rbd_header.28c96316763845e
>> failed to evict /rbd_header.28c96316763845e: (16) Device or resource busy error from cache-flush-evict-all: (1) Operation not
>> permitted
>>
>
> These are probably the objects which have watchers attached. The current evict logic seems to be unable to evict these, hence the
> error. I'm not sure if anything can be done to work around this other than what you have tried...ie stopping the VM, which will
> remove the watcher.

You can move them from cache pool once you remove tier overlay. Bu I
wasn't sure about the data consistency. So I ran a few test to
confirm. So I spawned a few VM's that were just idling, few that were
writing small files to disk with consistent crc and few that were
writing larger files with sync option to disk. I have run it multiple
times, don't remember the number as I was really waiting for a crc
mismatch or general VM crash, but it was 20+ times.
You flush the cache a few times. Once no new objects appear in it. Do
a flush follow by overlay removal. After about a minute header files
will unlock and you will be able to flush them down to cold storage.
Once that's done ran a crc check on the everything I was verifying. So
I'm pretty confident that I will not lose any data while doing this on
a live/production server. I will run a few more tests and decide what
to do then. And If I do this on production I will report the progress.
Maybe this will help others struggling with similar options.

>
>> I found a workaround for this. You can bypass these errors by running
>>       1. ceph osd tier remove-overlay test-pool
>>       2. turning off the VM's that are using them.
>>
>> For the second option. I can boot the VM's normally after recreating a new overlay/cauchetier. At this point everything is working
> fine,
>> but I'm trying to avoid downtime as it takes almost 8h to start and check everything to be in optimal condition.
>>
>> Now for the first part. I can remove the overlay and flush cache layer. And VM's are running fine with it removed. Issues start
> after I
>> have readed the cache layer to the cold pool and try to write/read from the disk. For no apparent reason VM's just freeze. And you
>> need to force stop/start all VM's to start working.
>
> Which pool are the VM's being pointed at base or cache? I'm wondering if it's something to do with the pool id changing?

It was pointing to the base pool. So after reading about online I
found that I can add it with live machines. Just need to run these
commands:

      1. "ceph osd tier add cold-pool cache-pool --force-nonempty"
      2. "ceph osd tier cache-mode cache-pool forward" <--- no other
mode seems to work only forward. Plus you need to wait a while for all
rbd_header to reappear in this pool before switching cache-mode or the
VM's will crash.
      3. "ceph osd tier set-overlay cold-pool cache-pool" <--- after
you run this header pools should start appearing in it. rados -p
cache-pool ls

>
>>
>> From what I have read about it all objects should leave cache tier and you don't have to  "force" removing the tier with objects.
>>
>> Now onto the questions:
>>
>>    1. Is it normal for VPS to freeze while adding a cache layer/tier?
>>    2. Do VMS' need to be offline to remove caching layer?
>>    3. I have read somewhere that snapshots might interfere with cache
>> tier clean up. Is it true?        4. Are there some other ways to
>> remove the caching tier on a live system?
>>
>>
>> Regards,
>>
>>
>> Darius
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Regards,

Darius
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com