Re: How to force PG merging in one step?

Eugen Block <eblock@xxxxxx> · Wed, 12 Oct 2022 14:20:37 +0000

Hi Frank,

thanks, that was a great hint! I have a strong déjà vu feeling, we  
discussed this before with increasing pg_num, didn't we?

although I don't have a feeling of déjà vu I believe it's a reoccuring  
issue so chances are you're right. ;-)

I just set it to 1 and it did exactly what I wanted. Its the same  
number of PGs backfilling, but pgp_num=1024, so while the  
rebalancing load is the same, I got rid of any redundant data  
movements and I can actually see the progress of the merge just with  
ceph status.

It's helpful to know that setting the target_max_misplaced_ratio to 1  
doesn't cause unwanted side effects. I agree with your point of view  
to reduce unnecessary data movement as much as possible and this seems  
to do the trick (in this case). I'll keep that in mind for future  
recovery scenarios, thanks for testing it in the real world. ;-)

Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs  
with more than 400 PGs. Still, I don't see the promised health  
warning in ceph status. Is this a known issue?

During recovery there's another factor involved  
(osd_max_pg_per_osd_hard_ratio), the default is 3. I had to deal with  
that a few months back when I got inactive PGs due to many chunks and  
"only" a factor of 3. In that specific cluster I increased it to 5 and  
didn't encounter inactive PGs anymore.

Regards,
Eugen

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

thanks, that was a great hint! I have a strong déjà vu feeling, we  
discussed this before with increasing pg_num, didn't we? I just set  
it to 1 and it did exactly what I wanted. Its the same number of PGs  
backfilling, but pgp_num=1024, so while the rebalancing load is the  
same, I got rid of any redundant data movements and I can actually  
see the progress of the merge just with ceph status.

Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs  
with more than 400 PGs. Still, I don't see the promised health  
warning in ceph status. Is this a known issue?

Opinion part.

Returning to the above setting, I have to say that the assignment of  
which parameter influences what seems a bit unintuitive if not  
inconsistent. The parameter target_max_misplaced_ratio belongs to  
the balancer module, but merging PGs clearly is a task of the  
pg_autoscaler module. I'm not balancing, I'm scaling PG numbers.  
Such cross dependencies make it really hard to find relevant  
information in the section of the documentation where one would be  
looking for it. It starts being distributed all over the place.

If its not possible to have such things separated and specific tasks  
consistently explained in a single section, there could at least be  
a hint including also the case of PG merging/splitting in the  
description of target_max_misplaced_ratio so that a search for these  
terms brings up this page. There should also be a cross reference  
from "ceph osd pool set pg[p]_num" to target_max_misplaced_ratio.  
Well, its now here in this message for google to reveal.

I have to add that, while I understand the motivation behind adding  
these baby sitting modules, I would actually appreciate if one could  
disable them. I personally find them to be really annoying  
especially in emergency situations, but also in normal operations. I  
would consider them a nice to have and not enforce it on people who  
want to be in charge.

For example, in my current situation, I'm halving the PG count of a  
pool. Doing the merge in one go or letting the  
target_max_misplaced_ratio "help" me leads to exactly the same  
number of PGs backfilling at any time. Which means both cases,  
target_max_misplaced_ratio=0.05 and 1 lead to exactly the same  
interference of rebalancing IO with user IO. The difference is that  
with target_max_misplaced_ratio=0.05 this phase of reduced  
performance will take longer, because every time the module decides  
to change pgp_num it will inevitably also rebalance objects again  
that have been moved before. I find it difficult to consider this an  
improvement. I prefer to avoid any redundant writes at all cost for  
the benefit of disk life time. If I really need to reduce the impact  
of recovery IO I can set recovery_sleep.

My personal opinion to the user group.

Thanks for your help and have a nice evening!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: 11 October 2022 14:13:45
To: ceph-users@xxxxxxx
Subject:  Re: How to force PG merging in one step?

Hi Frank,

I don't think it's the autoscaler interferring here but the default 5%
target_max_misplaced_ratio. I haven't tested the impacts of increasing
that to a much higher value, so be careful.

Regards,
Eugen

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

I need to reduce the number of PGs in a pool from 2048 to 512 and
would really like to do that in a single step. I executed the set
pg_num 512 command, but the PGs are not all merged. Instead I get
this intermediate state:

pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512
pgp_num_target 512 autoscale_mode off last_change 916710 lfor
0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes
107374182400 stripe_width 0 compression_mode none application cephfs

This is really annoying, because it will not only lead to repeated
redundant data movements and but I also need to rebalance this pool
onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to
intermittently. How can I override the autoscaler interfering with
admin operations in such tight corners?

As you can see, we disabled autoscaler on all pools and also
globally. Still, it interferes with admin commands in an unsolicited
way. I would like the PG merge happen on the fly as the data moves
to the new OSDs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx