Re: Disks are filling up

Omar Siam <Omar.Siam@xxxxxxxxxx> · Fri, 9 Jun 2023 15:57:45 +0200

TL;DR: We could not fix this problem in the end and ended up with a Ceph 
fs in read only mode (so we could only backup, delete and restore) and 
one broken OSD (we deleted that and restored to a "new disk")

I can now wrap up my whole experience with this problem.

After the OSD usage growing to almost 2 TB x 3 OSDs (for a data that 
'du' counted to be about 120GB), Ceph stopped filling up and in the week 
or two that followed,
most of the space that was used showed up as free again.
But there was one OSD that did not free up to meaningful values. To my 
surprise it was an OSD that is backed by SSDs not the one on HDD.

It seems the biggest contributing factor was that I created the pools 
for the Ceph fs with autoscaling set to on (the default in cephadm 
dashboard).
Now this pool never grew to more than 1 PG although it had a little over 
100 GB on it.

From what I read on this list this alone is prone to lock contention 
and other problematic behavior.

Lessons learned:
* If you use pool autoscaling, watch it if it actually does that.
  -> I opted for setting the new PG number for the replacement pools to 
32 manually.
* Ceph fs has a read only mode that at least lets you back up data in 
some bad states.
  -> That is good to know. It allows at least administrators to copy 
data to other storage devices.
* If you use Ceph fs for persistent volumes in Kubernetes be aware that 
you probably loose all volumes at the same time when Ceph fs switches to 
read only.
  The CSI for Ceph fs does not work on read only Ceph fs, it always 
writes xattr on mount (the data pool that should be used and other 
internal data) and gives up if that fails.
  -> use a reasonable number of Ceph fs for Kubernetes persistent 
storage so you don't loose all PV at once.

Other problems with my configuration I found:
* We suffer from VMWare ESXi misreporting the type of disk physically 
attached. It reports some SAS SSDs as HDD. We also have real SAS HDD 
attached to some node.
   I suspected that to be a problem and we will exchange the HDDs for 
SSDs soon but the big problem was that the disktype was in the crushmap.
  -> Edited the crush map to ignore the type of storage as it is not 
very meaningful in our setup anyway
* I had PGs stuck in undersized state for a long time and could not 
understand why ceph does not fix it.
  Then I checked the OSD weights (reweights) again and they were set to 
different values (1 and  0.85).
  After setting it to one on all OSDs Ceph actually started to bring 
all PG into the active+clean state.
  -> If all the OSDs actually are the same size i will either not 
reweight in the future or set it to the same value on all OSDs probably 1

So now after a noticable downtime of kubernetes and having to recreate 
most persistent volumes on the cluster Ceph health is HEALTH_OK again.

I could upgrade to Ceph 16.2.13.
I hope I can now upgrade to 17.2.6 without issues.

Best regards

--
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons
Bäckerstraße 13, 1010 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.siam@xxxxxxxxxx  |www.oeaw.ac.at/acdh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx