At this point, before I go any further, I'm copying my pools to new pools so that I can attempt manual rados operations.
My current thinking is I could compare all objects in the cache tier against the ec pool. Then if the object doesn't exist, copy the object. If the objects exist in both and are different replace the ec pool's object with the cache tier's object.
thoughts?
On Fri, Oct 13, 2017 at 10:13 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
TLDR; In Jewel, I briefly had 2 cache tiers assigned to an ec pool and I think that broke my ec pool. I then made a series of decisions attempting to repair that mistake. I now think I've caused further issues.Background:Following having some serious I/O issues with my ec pool's cache tier, I decided I wanted to use a cache tier hosted on a different set of disks than my current tier.My first potentially poor decision was not removing the original cache tier before adding the new one.Basically, the workflow was as follows:pools:data_ec
data_cachedata_new_cacheceph osd tier add data_ec data_new_cacheceph osd tier cache-mode data_new_cache writebackceph osd tier set-overlay data_ec data_new_cacheceph osd pool set data_new_cache hit_set_type bloomceph osd pool set data_new_cache hit_set_count 1ceph osd pool set data_new_cache hit_set_period 3600ceph osd pool set data_new_cache target_max_bytes 1000000000000ceph osd pool set data_new_cache min_read_recency_for_promote 1ceph osd pool set data_new_cache min_write_recency_for_promote 1#so now I decided to attempt to remove the old cacheceph osd tier cache-mode data_cache forward#here is where things got badrados -p data_cache cache-flush-evict-all#every object rados attempted to flush from the cache, left errors of the following varieties#rados -p data_cache cache-flush-evict-allrbd_data.af81e6238e1f29.000000000001732e error listing snap shots /rbd_data.af81e6238e1f29.000000000001732e: (2) No such file or directory rbd_data.af81e6238e1f29.00000000000143bb error listing snap shots /rbd_data.af81e6238e1f29.00000000000143bb: (2) No such file or directory rbd_data.af81e6238e1f29.00000000000cf89d failed to flush /rbd_data.af81e6238e1f29.00000000000cf89d: (2) No such file or directory rbd_data.af81e6238e1f29.00000000000cf82c
#Following these errors, I thought maybe the world would become happy again if I just removed the newly added ecpool.ceph osd tier cache-mode data_new_cache forwardrados -p data_new_cache cache-flush-evict-all#when running the evict against the new tier, I received no errors#and so begins potential mistake number 3ceph osd tier remove-overlay ec_dataceph osd tier remove data_ec data_new_cache#I received the same errors. while trying to evict#knowing my data had been untouched for over an hour, I made a terrible decisonceph osd tier remove data_ec data_cache#I then discovered that I couldn't add the new or the old cache back to the ec pool, even with --force-nonemptyceph osd tier add data_ec data_cache --force-nonemptyError ENOTEMPTY: tier pool 'data_cache' has snapshot state; it cannot be added as a tier without breaking the pool
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com