Hi Frank,
thanks, that was a great hint! I have a strong déjà vu feeling, we
discussed this before with increasing pg_num, didn't we?
although I don't have a feeling of déjà vu I believe it's a reoccuring
issue so chances are you're right. ;-)
I just set it to 1 and it did exactly what I wanted. Its the same
number of PGs backfilling, but pgp_num=1024, so while the
rebalancing load is the same, I got rid of any redundant data
movements and I can actually see the progress of the merge just with
ceph status.
It's helpful to know that setting the target_max_misplaced_ratio to 1
doesn't cause unwanted side effects. I agree with your point of view
to reduce unnecessary data movement as much as possible and this seems
to do the trick (in this case). I'll keep that in mind for future
recovery scenarios, thanks for testing it in the real world. ;-)
Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs
with more than 400 PGs. Still, I don't see the promised health
warning in ceph status. Is this a known issue?
During recovery there's another factor involved
(osd_max_pg_per_osd_hard_ratio), the default is 3. I had to deal with
that a few months back when I got inactive PGs due to many chunks and
"only" a factor of 3. In that specific cluster I increased it to 5 and
didn't encounter inactive PGs anymore.
Regards,
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi Eugen,
thanks, that was a great hint! I have a strong déjà vu feeling, we
discussed this before with increasing pg_num, didn't we? I just set
it to 1 and it did exactly what I wanted. Its the same number of PGs
backfilling, but pgp_num=1024, so while the rebalancing load is the
same, I got rid of any redundant data movements and I can actually
see the progress of the merge just with ceph status.
Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs
with more than 400 PGs. Still, I don't see the promised health
warning in ceph status. Is this a known issue?
Opinion part.
Returning to the above setting, I have to say that the assignment of
which parameter influences what seems a bit unintuitive if not
inconsistent. The parameter target_max_misplaced_ratio belongs to
the balancer module, but merging PGs clearly is a task of the
pg_autoscaler module. I'm not balancing, I'm scaling PG numbers.
Such cross dependencies make it really hard to find relevant
information in the section of the documentation where one would be
looking for it. It starts being distributed all over the place.
If its not possible to have such things separated and specific tasks
consistently explained in a single section, there could at least be
a hint including also the case of PG merging/splitting in the
description of target_max_misplaced_ratio so that a search for these
terms brings up this page. There should also be a cross reference
from "ceph osd pool set pg[p]_num" to target_max_misplaced_ratio.
Well, its now here in this message for google to reveal.
I have to add that, while I understand the motivation behind adding
these baby sitting modules, I would actually appreciate if one could
disable them. I personally find them to be really annoying
especially in emergency situations, but also in normal operations. I
would consider them a nice to have and not enforce it on people who
want to be in charge.
For example, in my current situation, I'm halving the PG count of a
pool. Doing the merge in one go or letting the
target_max_misplaced_ratio "help" me leads to exactly the same
number of PGs backfilling at any time. Which means both cases,
target_max_misplaced_ratio=0.05 and 1 lead to exactly the same
interference of rebalancing IO with user IO. The difference is that
with target_max_misplaced_ratio=0.05 this phase of reduced
performance will take longer, because every time the module decides
to change pgp_num it will inevitably also rebalance objects again
that have been moved before. I find it difficult to consider this an
improvement. I prefer to avoid any redundant writes at all cost for
the benefit of disk life time. If I really need to reduce the impact
of recovery IO I can set recovery_sleep.
My personal opinion to the user group.
Thanks for your help and have a nice evening!
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: 11 October 2022 14:13:45
To: ceph-users@xxxxxxx
Subject: Re: How to force PG merging in one step?
Hi Frank,
I don't think it's the autoscaler interferring here but the default 5%
target_max_misplaced_ratio. I haven't tested the impacts of increasing
that to a much higher value, so be careful.
Regards,
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi all,
I need to reduce the number of PGs in a pool from 2048 to 512 and
would really like to do that in a single step. I executed the set
pg_num 512 command, but the PGs are not all merged. Instead I get
this intermediate state:
pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512
pgp_num_target 512 autoscale_mode off last_change 916710 lfor
0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes
107374182400 stripe_width 0 compression_mode none application cephfs
This is really annoying, because it will not only lead to repeated
redundant data movements and but I also need to rebalance this pool
onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to
intermittently. How can I override the autoscaler interfering with
admin operations in such tight corners?
As you can see, we disabled autoscaler on all pools and also
globally. Still, it interferes with admin commands in an unsolicited
way. I would like the PG merge happen on the fly as the data moves
to the new OSDs.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx