Re: How to force PG merging in one step?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Frank,

thanks, that was a great hint! I have a strong déjà vu feeling, we discussed this before with increasing pg_num, didn't we?

although I don't have a feeling of déjà vu I believe it's a reoccuring issue so chances are you're right. ;-)

I just set it to 1 and it did exactly what I wanted. Its the same number of PGs backfilling, but pgp_num=1024, so while the rebalancing load is the same, I got rid of any redundant data movements and I can actually see the progress of the merge just with ceph status.

It's helpful to know that setting the target_max_misplaced_ratio to 1 doesn't cause unwanted side effects. I agree with your point of view to reduce unnecessary data movement as much as possible and this seems to do the trick (in this case). I'll keep that in mind for future recovery scenarios, thanks for testing it in the real world. ;-)

Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs with more than 400 PGs. Still, I don't see the promised health warning in ceph status. Is this a known issue?

During recovery there's another factor involved (osd_max_pg_per_osd_hard_ratio), the default is 3. I had to deal with that a few months back when I got inactive PGs due to many chunks and "only" a factor of 3. In that specific cluster I increased it to 5 and didn't encounter inactive PGs anymore.

Regards,
Eugen

Zitat von Frank Schilder <frans@xxxxxx>:

Hi Eugen,

thanks, that was a great hint! I have a strong déjà vu feeling, we discussed this before with increasing pg_num, didn't we? I just set it to 1 and it did exactly what I wanted. Its the same number of PGs backfilling, but pgp_num=1024, so while the rebalancing load is the same, I got rid of any redundant data movements and I can actually see the progress of the merge just with ceph status.

Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs with more than 400 PGs. Still, I don't see the promised health warning in ceph status. Is this a known issue?

Opinion part.

Returning to the above setting, I have to say that the assignment of which parameter influences what seems a bit unintuitive if not inconsistent. The parameter target_max_misplaced_ratio belongs to the balancer module, but merging PGs clearly is a task of the pg_autoscaler module. I'm not balancing, I'm scaling PG numbers. Such cross dependencies make it really hard to find relevant information in the section of the documentation where one would be looking for it. It starts being distributed all over the place.

If its not possible to have such things separated and specific tasks consistently explained in a single section, there could at least be a hint including also the case of PG merging/splitting in the description of target_max_misplaced_ratio so that a search for these terms brings up this page. There should also be a cross reference from "ceph osd pool set pg[p]_num" to target_max_misplaced_ratio. Well, its now here in this message for google to reveal.

I have to add that, while I understand the motivation behind adding these baby sitting modules, I would actually appreciate if one could disable them. I personally find them to be really annoying especially in emergency situations, but also in normal operations. I would consider them a nice to have and not enforce it on people who want to be in charge.

For example, in my current situation, I'm halving the PG count of a pool. Doing the merge in one go or letting the target_max_misplaced_ratio "help" me leads to exactly the same number of PGs backfilling at any time. Which means both cases, target_max_misplaced_ratio=0.05 and 1 lead to exactly the same interference of rebalancing IO with user IO. The difference is that with target_max_misplaced_ratio=0.05 this phase of reduced performance will take longer, because every time the module decides to change pgp_num it will inevitably also rebalance objects again that have been moved before. I find it difficult to consider this an improvement. I prefer to avoid any redundant writes at all cost for the benefit of disk life time. If I really need to reduce the impact of recovery IO I can set recovery_sleep.

My personal opinion to the user group.

Thanks for your help and have a nice evening!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: 11 October 2022 14:13:45
To: ceph-users@xxxxxxx
Subject:  Re: How to force PG merging in one step?

Hi Frank,

I don't think it's the autoscaler interferring here but the default 5%
target_max_misplaced_ratio. I haven't tested the impacts of increasing
that to a much higher value, so be careful.

Regards,
Eugen


Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

I need to reduce the number of PGs in a pool from 2048 to 512 and
would really like to do that in a single step. I executed the set
pg_num 512 command, but the PGs are not all merged. Instead I get
this intermediate state:

pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512
pgp_num_target 512 autoscale_mode off last_change 916710 lfor
0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes
107374182400 stripe_width 0 compression_mode none application cephfs

This is really annoying, because it will not only lead to repeated
redundant data movements and but I also need to rebalance this pool
onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to
intermittently. How can I override the autoscaler interfering with
admin operations in such tight corners?

As you can see, we disabled autoscaler on all pools and also
globally. Still, it interferes with admin commands in an unsolicited
way. I would like the PG merge happen on the fly as the data moves
to the new OSDs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux