Re: How dead is my ec pool?

Brady Deetz <bdeetz@xxxxxxxxx> · Sat, 14 Oct 2017 00:15:02 -0500

At this point, before I go any further, I'm copying my pools to new pools so that I can attempt manual rados operations.
My current thinking is I could compare all objects in the cache tier against the ec pool. Then if the object doesn't exist, copy the object. If the objects exist in both and are different replace the ec pool's object with the cache tier's object.

thoughts?

On Fri, Oct 13, 2017 at 10:13 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
TLDR; In Jewel, I briefly had 2 cache tiers assigned to an ec pool and I think that broke my ec pool. I then made a series of decisions attempting to repair that mistake. I now think I've caused further issues.

Background:

Following having some serious I/O issues with my ec pool's cache tier, I decided I wanted to use a cache tier hosted on a different set of disks than my current tier.
My first potentially poor decision was not removing the original cache tier before adding the new one.

Basically, the workflow was as follows:

pools:
data_ec
data_cache
data_new_cache

ceph osd tier add data_ec data_new_cache
ceph osd tier cache-mode data_new_cache writeback

ceph osd tier set-overlay data_ec data_new_cache
ceph osd pool set data_new_cache hit_set_type bloom
ceph osd pool set data_new_cache hit_set_count 1
ceph osd pool set data_new_cache hit_set_period 3600
ceph osd pool set data_new_cache target_max_bytes 1000000000000
ceph osd pool set data_new_cache min_read_recency_for_promote 1
ceph osd pool set data_new_cache min_write_recency_for_promote 1

#so now I decided to attempt to remove the old cache
ceph osd tier cache-mode data_cache forward

#here is where things got bad
rados -p data_cache cache-flush-evict-all

#every object rados attempted to flush from the cache, left errors of the following varieties
#
rados -p data_cache cache-flush-evict-all
        rbd_data.af81e6238e1f29.000000000001732e
error listing snap shots /rbd_data.af81e6238e1f29.000000000001732e: (2) No such file or directory
        rbd_data.af81e6238e1f29.00000000000143bb
error listing snap shots /rbd_data.af81e6238e1f29.00000000000143bb: (2) No such file or directory
        rbd_data.af81e6238e1f29.00000000000cf89d
failed to flush /rbd_data.af81e6238e1f29.00000000000cf89d: (2) No such file or directory
        rbd_data.af81e6238e1f29.00000000000cf82c

#Following these errors, I thought maybe the world would become happy again if I just removed the newly added ecpool.

ceph osd tier cache-mode data_new_cache forward
rados -p data_new_cache cache-flush-evict-all

#when running the evict against the new tier, I received no errors
#and so begins potential mistake number 3

ceph osd tier remove-overlay ec_data
ceph osd tier remove data_ec data_new_cache

#I received the same errors. while trying to evict

#knowing my data had been untouched for over an hour, I made a terrible decison
ceph osd tier remove data_ec data_cache

#I then discovered that I couldn't add the new or the old cache back to the ec pool, even with --force-nonempty

ceph osd tier add data_ec data_cache --force-nonempty
Error ENOTEMPTY: tier pool 'data_cache' has snapshot state; it cannot be added as a tier without breaking the pool

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com