Re: Unbalanced Cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Erdem,

The balancer was driving all the weights to 1.00000 so I turned it off.  The OSDs were creeping up to the 90% full threshold with it turned on.  I've been playing whackamole with the OSDs for a week trying to keep the cluster from locking all writes when a single OSD goes over 90%.

I had a look at deleting that pool.  I think it was there and still required to keep the filesystem happy and I'm a bit anxious about deleting it.  It's been a long time since the new fsdatak7m2 was created and my memory is getting foggy about how it was done.  I think the new pool was created as a tier and then data was migrated to it.  I don't really think it's safe to delete the pool as I think it is still in use:

# ceph osd pool stats fsdata
pool fsdata id 4
  227323/3383377685 objects degraded (0.007%)
  94063915/3383377685 objects misplaced (2.780%)
  recovery io 0 B/s, 71 objects/s

The filesystem has 1.4B files

You mentioned that all solutions would cause data migration and would need to be planned carefully.  I've seen that language in the docs and other messages but what I can't find is what is meant by "planned carefully". Doing any of these will cause data migration like crazy but it's not avoidable other than to change the number of max backfills etc. but the systems should still be accessible during this time but with reduced bandwidth and higher latency.  Is it just a warning that the system could be degraded for a long period of time or is it suggesting that users should take an outage while the rebuild happens?

Thanks for your guidance.

-Dave


On 2022-05-05 2:33 a.m., Erdem Agaoglu wrote:
[△EXTERNAL]


 Hi David,

I think you're right with your option 2. 512 pgs is just too few. You're also
right with the "inflation" but you should add your erasure bits to the
calculation, so 9x512=4608. With 144 OSDs, you would average 32 pgs per OSD.
Some old advice for that number was around 100.

But your current PGs per OSD is around 180-190 according to the df output. This
is probably because of your empty pool 4 fsdata, having 4096 pgs with size 5,
and adding 5x4096=20480, 20480/144=142 more pgs per OSD.

I'm not really sure how empty/unused PGs would affect OSD, but I think it will
affect the balancer which tries to balance the number of PGs, which might
explain things getting worse. Also your df output shows several modifications in
weights/reweights but I'm not sure if they're manual or balancer adjusted.

I would first delete that empty pool to have a more clear picture of PGs on
OSDs. Then I would increase the pg_num for pool 6 to 2048. And after everything
settles, if it's still too unbalanced I'd go for the upmap balancer. Needless to
say, all these would cause major data migration so it should be planned
carefully.

Best,



On Thu, May 5, 2022 at 12:02 AM David Schulz <dschulz@xxxxxxxxxxx<mailto:dschulz@xxxxxxxxxxx>> wrote:
Hi Josh,

We do have an old pool that is empty so there's 4611 empty PGs but the
rest seem fairly close:

# ceph pg ls|awk '{print $7/1024/1024/10}'|cut -d "." -f 1|sed -e
's/$/0/'|sort -n|uniq -c
    4611 00
       1 1170
       8 1180
      10 1190
      28 1200
      51 1210
      54 1220
      52 1230
      32 1240
      13 1250
       7 1260
Hmm, that's interesting, adding up the first column except the 4611
gives 256 but there are 512 PGs in the main data pool.

Here are our pool settings:

pool 3 'fsmeta' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 35490
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs
pool 4 'fsdata' erasure size 5 min_size 4 crush_rule 1 object_hash
rjenkins pg_num 4096 pgp_num 4096 autoscale_mode warn last_change 35490
lfor 0/0/4742 flags hashpspool,ec_overwrites stripe_width 12288
application cephfs
pool 6 'fsdatak7m2' erasure size 9 min_size 8 crush_rule 3 object_hash
rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 35490
flags hashpspool,ec_overwrites stripe_width 28672 application cephfs

The fsdata was originally created with very safe erasure coding that
wasted too much space, then the fsdatak7m2 was created and everything
was migrated to it.  This is why there's at least 4096 pgs with 0 bytes.

-Dave

On 2022-05-04 2:08 p.m., Josh Baergen wrote:
> [△EXTERNAL]
>
>
>
> Hi Dave,
>
>> This cluster was upgraded from 13.x to 14.2.9 some time ago.  The entire
>> cluster was installed at the 13.x time and was upgraded together so all
>> OSDs should have the same formatting etc.
> OK, thanks, that should rule out a difference in bluestore
> min_alloc_size, for example.
>
>> Below is pasted the ceph osd df tree output.
> It looks like there is some pretty significant skew in terms of the
> amount of bytes per active PG. If you issue "ceph pg ls", are you able
> to find any PGs with a significantly higher byte count?
>
> Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>


--
erdem agaoglu
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux