Re: Question about expansion existing Ceph cluster - adding OSDs

Kristof Coucke <kristof.coucke@xxxxxxxxx> · Mon, 26 Oct 2020 11:33:52 +0100

Hi Ansgar, Frank, all,

Thanks for the feedback in the first place.

In the meantime, I've added all the disks and the cluster is rebalancing
itself... Which will take ages as you've mentioned. Last week after this
conversation it was around 50% (little bit more), today it's around 44,5%.
Every day, I have to take the cluster down to run manual compaction on some
disks :-(, but that's a known bug where Igor is working on. (Kudos to him
when I get my sleep back at night for this one...)

Though, I'm still having an issue which I don't completely understand.
When I look into the Ceph dashboard - OSDs, I can see the #pgs for a
specific OSD. Does someone know how this is calculated? Because it seems
incorrect...
E.g. A specific disk shows in the dashboard 189 PGs...? However, examining
the pg dump output I can see that for that particular disk there are 145
PGs where the disk is in the "up" list, and 168 disks where that particular
disk is in the "acting" list...  Of those 2 lists, 135 are in common,
meaning 10 PGs will need to be moved to that disk, while 33 PGs will need
to be moved away...
I can't figure out how the dashboard is getting to the figure of 189...
It's also on other disks (a delta between the PG dump output and the info
in the Ceph dashboard).

Another example is one disk which I've put on weight 0 as it's marked to
have a predictable failure in the future... So the list with "up" is 0
(which is correct), and the PGs where this disk is in acting is 49. So,
this seems correct as these 49 PGs need to be moved away. However...
Looking into the Ceph dashboard the UI is saying that there are 71 PGs on
that disk...

So:
- How does the Ceph dashboard get that number in the 1st place?
- Is there a possibility that there are "orphaned" PG-parts left behind on
a particular OSD?
- If it is possible that there are orphaned parts of a PG left behind on a
disk, how do I clean this up?

I've also tried examining the osdmap, however, the output seems to be
limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the
file is concatenated by exporting the osd map, or by the osdmaptool
--print).

The cluster is running Nautilus v14.2.11, all on the same version.

I'll make some time writing documentation and documenting my findings which
I've all faced in the journey of the last 2 weeks.... Kristof in Ceph's
wunderland...

Thanks for all your input so far!

Regards,

Kristof

Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder <frans@xxxxxx>:

> There have been threads on exactly this. Might depend a bit on your ceph
> version. We are running mimic and have no issues doing:
>
> - set noout, norebalance, nobackfill
> - add all OSDs (with weight 1)
> - wait for peering to complete
> - unset all flags and let the rebalance loose
>
> Starting with nautilus there seem to be issues with this procedure. Mainly
> the peering phase can cause a collapse of the cluster.  In your case, it
> sounds like you added the OSDs already. You should be able to do relatively
> safely:
>
> - set noout, norebalance, nobackfill
> - set weight of OSDs to 1 one by one and wait for peering to complete
> every time
> - unset all flags and let the rebalance loose
>
> I believe once the peering succeeded without crashes, the rebalancing will
> just work fine. You can easily control how much rebalancing is going on.
>
> I noted that ceph seems to have a strange concept of priority though. I
> needed to gain capacity by adding OSDs and ceph was very consequent with
> moving PGs from the fullest OSDs last. The opposite of what should happen.
> Thus, it took ages for additional capacity to become available and also the
> backfill too full warnings stayed for all the time. You can influence this
> to some degree by using force_recovery commands on PGs on the fullest OSDs.
>
> Best regards and good luck,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Kristof Coucke <kristof.coucke@xxxxxxxxx>
> Sent: 21 October 2020 13:29:00
> To: ceph-users@xxxxxxx
> Subject:  Question about expansion existing Ceph cluster -
> adding OSDs
>
> Hi,
>
> I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
> Some disks were near full.
> The new disks have been added with initial weight = 0.
> The original plan was to increase this slowly towards their full weight
> using the gentle reweight script. However, this is going way too slow and
> I'm also having issues now with "backfill_toofull".
> Can I just add all the OSDs with their full weight, or will I get a lot of
> issues when I'm doing that?
> I know that a lot of PGs will have to be replaced, but increasing the
> weight slowly will take a year at the current speed. I'm already playing
> with the max backfill to increase the speed, but every time I increase the
> weight it will take a lot of time again...
> I can face the fact that there will be a performance decrease.
>
> Looking forward to your comments!
>
> Regards,
>
> Kristof
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx