TL;DR: We could not fix this problem in the end and ended up with a Ceph
fs in read only mode (so we could only backup, delete and restore) and
one broken OSD (we deleted that and restored to a "new disk")
I can now wrap up my whole experience with this problem.
After the OSD usage growing to almost 2 TB x 3 OSDs (for a data that
'du' counted to be about 120GB), Ceph stopped filling up and in the week
or two that followed,
most of the space that was used showed up as free again.
But there was one OSD that did not free up to meaningful values. To my
surprise it was an OSD that is backed by SSDs not the one on HDD.
It seems the biggest contributing factor was that I created the pools
for the Ceph fs with autoscaling set to on (the default in cephadm
dashboard).
Now this pool never grew to more than 1 PG although it had a little over
100 GB on it.
From what I read on this list this alone is prone to lock contention
and other problematic behavior.
Lessons learned:
* If you use pool autoscaling, watch it if it actually does that.
-> I opted for setting the new PG number for the replacement pools to
32 manually.
* Ceph fs has a read only mode that at least lets you back up data in
some bad states.
-> That is good to know. It allows at least administrators to copy
data to other storage devices.
* If you use Ceph fs for persistent volumes in Kubernetes be aware that
you probably loose all volumes at the same time when Ceph fs switches to
read only.
The CSI for Ceph fs does not work on read only Ceph fs, it always
writes xattr on mount (the data pool that should be used and other
internal data) and gives up if that fails.
-> use a reasonable number of Ceph fs for Kubernetes persistent
storage so you don't loose all PV at once.
Other problems with my configuration I found:
* We suffer from VMWare ESXi misreporting the type of disk physically
attached. It reports some SAS SSDs as HDD. We also have real SAS HDD
attached to some node.
I suspected that to be a problem and we will exchange the HDDs for
SSDs soon but the big problem was that the disktype was in the crushmap.
-> Edited the crush map to ignore the type of storage as it is not
very meaningful in our setup anyway
* I had PGs stuck in undersized state for a long time and could not
understand why ceph does not fix it.
Then I checked the OSD weights (reweights) again and they were set to
different values (1 and 0.85).
After setting it to one on all OSDs Ceph actually started to bring
all PG into the active+clean state.
-> If all the OSDs actually are the same size i will either not
reweight in the future or set it to the same value on all OSDs probably 1
So now after a noticable downtime of kubernetes and having to recreate
most persistent volumes on the cluster Ceph health is HEALTH_OK again.
I could upgrade to Ceph 16.2.13.
I hope I can now upgrade to 17.2.6 without issues.
Best regards
--
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons
Bäckerstraße 13, 1010 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.siam@xxxxxxxxxx |www.oeaw.ac.at/acdh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx