Re: EC pool OSDs getting erroneously "full" (15.2.15)

Stefan Kooman <stefan@xxxxxx> · Wed, 20 Apr 2022 12:57:52 +0200

On 4/20/22 12:34, Nikola Ciprich wrote:
Hi fellow ceph users and developers,

we've got into quite strange situation I'm not sure is
not a ceph bug..

we have 4 node CEPH cluster with multiple pools. one of them
is SATA EC 2+2 pool containting 4x3 10TB drives (one of tham
is actually 12TB)

one day, after planned downtime of fourth node, we got into strange
state where there seemed to be large amount of degraded PGs
to recover (we had noout set for the duration of downtime though)

the weird thing was, that OSDs of that node seemed to be almost full (ie
80%) while there were almost no PGs on them according to osd df tree
leading to backfilltoofull..

after some experimenting, I dropped those and recreated them, but during
the recovery, we got into the same state:

-31         120.00000         -  112 TiB   81 TiB   80 TiB   36 GiB  456 GiB   31 TiB  72.58  1.06    -          root sata-archive
-32          30.00000         -   29 TiB   20 TiB   20 TiB   10 GiB  133 GiB  9.5 TiB  67.48  0.99    -              host v1a-sata-archive
   5    hdd   10.00000   1.00000  9.2 TiB  6.2 TiB  6.1 TiB  3.5 GiB   47 GiB  3.0 TiB  67.78  0.99  171      up          osd.5
  10    hdd   10.00000   1.00000  9.2 TiB  6.2 TiB  6.2 TiB  3.6 GiB   48 GiB  2.9 TiB  68.06  1.00  171      up          osd.10
  13    hdd   10.00000   1.00000   11 TiB  7.3 TiB  7.3 TiB  3.2 GiB   38 GiB  3.6 TiB  66.73  0.98  170      up          osd.13
-33          30.00000         -   27 TiB   19 TiB   18 TiB   11 GiB  139 GiB  9.0 TiB  67.39  0.99    -              host v1b-sata-archive
  19    hdd   10.00000   1.00000  9.2 TiB  6.1 TiB  6.1 TiB  3.5 GiB   46 GiB  3.0 TiB  67.11  0.98  171      up          osd.19
  28    hdd   10.00000   1.00000  9.2 TiB  6.1 TiB  6.0 TiB  3.5 GiB   46 GiB  3.1 TiB  66.44  0.97  170      up          osd.28
  29    hdd   10.00000   1.00000  9.2 TiB  6.3 TiB  6.2 TiB  3.6 GiB   48 GiB  2.9 TiB  68.61  1.00  171      up          osd.29
-34          30.00000         -   27 TiB   19 TiB   19 TiB   11 GiB  143 GiB  8.6 TiB  68.65  1.00    -              host v1c-sata-archive
  30    hdd   10.00000   1.00000  9.2 TiB  6.3 TiB  6.2 TiB  3.8 GiB   48 GiB  2.8 TiB  68.91  1.01  171      up          osd.30
  31    hdd   10.00000   1.00000  9.1 TiB  6.3 TiB  6.3 TiB  3.6 GiB   48 GiB  2.8 TiB  69.20  1.01  171      up          osd.31
  52    hdd   10.00000   1.00000  9.1 TiB  6.2 TiB  6.1 TiB  3.4 GiB   46 GiB  2.9 TiB  67.84  0.99  170      up          osd.52
-35          30.00000         -   27 TiB   24 TiB   24 TiB  4.0 GiB   41 GiB  3.5 TiB  87.13  1.27    -              host v1d-sata-archive
  53    hdd   10.00000   1.00000  9.2 TiB  8.1 TiB  8.0 TiB  1.3 GiB   14 GiB  1.0 TiB  88.54  1.29   81      up          osd.53
  54    hdd   10.00000   1.00000  9.2 TiB  8.3 TiB  8.2 TiB  1.4 GiB   14 GiB  897 GiB  90.44  1.32   79      up          osd.54
  55    hdd   10.00000   1.00000  9.1 TiB  7.5 TiB  7.5 TiB  1.3 GiB   13 GiB  1.6 TiB  82.39  1.21   62      up          osd.55

the count of pgs on osd 53..55 is less then 1/2 of other OSDs but they are almost full. according
to weights, this should not happen..

What Ceph version are you running? ceph versions

What do you have set as neafull ratio? ceph osd dump |grep nearfull.

Do you have the ceph balancer enabled? ceph balancer status
What kind of maintenance was going on?

Are the PGs on those OSDs *way* bigger than on those of the other nodes? 
ceph pg ls-by-osd $osd-id and check for bytes (and OMAP bytes). Only 
accurate information when PGs have been recently deep-scrubbed.

In this case the PG backfilltoofull warning(s) might have been correct. 
Yesterday though, I had no OSDs close to near full ratio and was getting 
the same PG backfilltoofull message ... previously seen due to this bug 
[1]. I could fix that by setting upmaps for the affacted PGs to another OSD.

any idea on why could this be happening or what to check?

I helps to know what kind of maintenance was going on. Sometimes Ceph PG 
mappings are not what you want. There are ways to do maintenance in a 
more controlled fashion.

thanks a lot in advance for hints..

Gr. Stefan

[1]: https://tracker.ceph.com/issues/39555
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx