luminous 12.2.6 -> 12.2.7 active+clean+inconsistent PGs workaround (or wait for 12.2.8+ ?)

SCHAER Frederic <frederic.schaer@xxxxxx> · Mon, 3 Sep 2018 09:15:37 +0000

Hi,

For those facing (lots of) active+clean+inconsistent
PGs after the luminous 12.2.6 metadata corruption and 12.2.7 upgrade, I’d like to explain how I finally got rid of those.

Disclaimer : my cluster doesn’t contain highly valuable data, and I can sort of recreate what is actually contains : VMs. The following is risky…

One reason I needed to fix those issues is that I faced IO errors whit pool overlays/tiering which were apparently related to the inconsistencies, and the only way I could get my VMs running again was to completely disable
 the SSDs overlay, which is far from  ideal.
For those not feeling the need to fix this “harmless” issue, please stop reading.
For the others, please understand the risks of the following… or wait for an official “pg repair” solution

So : 

1^st step : 
since I was getting an ever growing list of damaged PGs, I decided to deep-scrub… all PGs.

Yes. If you have 1+PB data… stop reading (or not ?).

How to do that : 
# for j in <pools to scrub> ; do for i in `ceph pg ls-by-pool $j |cut -d " " -f 1|tail -n +2`; do ceph pg deep-scrub $i ; done ; done

I think I already had a full list of damaged PGs until I upgraded to mimic and restarted the MONs/the OSDs : I believe daemons restarts caused ceph to forget about known inconsistencies.
If you believe the number of damaged PGs is sort of stable for you then skip step 1…

2^nd step is sort of easy : it is to apply the method described here :
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021054.html

I tried to add some rados locking before overwriting the objects (4M rbd objects in my case), but was still able to overwrite a locked object even with “rados
 -p rbd lock get --lock-type exclusive” … maybe I haven’t tried hard enough.
It would have been great if it were possible to make sure the object was not overwritten between a get and a put :/ - that would make this procedure much safer…

In my case, I had 2000+ damaged PGs, so I wrote a small script that should process those PGs and should try to apply the procedure:
https://gist.github.com/fschaer/cb851eae4f46287eaf30715e18f14524

My Ceph cluster has been healthy since Friday evening and I haven’t seen any data corruption nor any hung VM…

Cheers
Frederic

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com