Re: Question about expansion existing Ceph cluster - adding OSDs

Kristof Coucke <kristof.coucke@xxxxxxxxx> · Mon, 26 Oct 2020 14:01:26 +0100

Okay, so far I figured out that the value in the Ceph dashboard is gathered
from a Metric from Prometheus (*ceph_osd_numpg*). Is there anyone here that
knows how this is populated?

Op ma 26 okt. 2020 om 12:52 schreef Kristof Coucke <kristof.coucke@xxxxxxxxx
>:

> Hi Frank,
>
> We're having a lot of small objects in the cluster... RocksDb has issues
> with the compaction causing high disk load... That's why we are performing
> manual compaction...
> See https://github.com/ceph/ceph/pull/37496
>
> Br,
>
> Kristof
>
>
> Op ma 26 okt. 2020 om 12:14 schreef Frank Schilder <frans@xxxxxx>:
>
>> Hi Kristof,
>>
>> I missed that: why do you need to do manual compaction?
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Kristof Coucke <kristof.coucke@xxxxxxxxx>
>> Sent: 26 October 2020 11:33:52
>> To: Frank Schilder; a.jazdzewski@xxxxxxxxxxxxxx
>> Cc: ceph-users@xxxxxxx
>> Subject: Re:  Question about expansion existing Ceph cluster
>> - adding OSDs
>>
>> Hi Ansgar, Frank, all,
>>
>> Thanks for the feedback in the first place.
>>
>> In the meantime, I've added all the disks and the cluster is rebalancing
>> itself... Which will take ages as you've mentioned. Last week after this
>> conversation it was around 50% (little bit more), today it's around 44,5%.
>> Every day, I have to take the cluster down to run manual compaction on
>> some disks :-(, but that's a known bug where Igor is working on. (Kudos to
>> him when I get my sleep back at night for this one...)
>>
>> Though, I'm still having an issue which I don't completely understand.
>> When I look into the Ceph dashboard - OSDs, I can see the #pgs for a
>> specific OSD. Does someone know how this is calculated? Because it seems
>> incorrect...
>> E.g. A specific disk shows in the dashboard 189 PGs...? However,
>> examining the pg dump output I can see that for that particular disk there
>> are 145 PGs where the disk is in the "up" list, and 168 disks where that
>> particular disk is in the "acting" list...  Of those 2 lists, 135 are in
>> common, meaning 10 PGs will need to be moved to that disk, while 33 PGs
>> will need to be moved away...
>> I can't figure out how the dashboard is getting to the figure of 189...
>> It's also on other disks (a delta between the PG dump output and the info
>> in the Ceph dashboard).
>>
>> Another example is one disk which I've put on weight 0 as it's marked to
>> have a predictable failure in the future... So the list with "up" is 0
>> (which is correct), and the PGs where this disk is in acting is 49. So,
>> this seems correct as these 49 PGs need to be moved away. However...
>> Looking into the Ceph dashboard the UI is saying that there are 71 PGs on
>> that disk...
>>
>> So:
>> - How does the Ceph dashboard get that number in the 1st place?
>> - Is there a possibility that there are "orphaned" PG-parts left behind
>> on a particular OSD?
>> - If it is possible that there are orphaned parts of a PG left behind on
>> a disk, how do I clean this up?
>>
>> I've also tried examining the osdmap, however, the output seems to be
>> limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the
>> file is concatenated by exporting the osd map, or by the osdmaptool
>> --print).
>>
>> The cluster is running Nautilus v14.2.11, all on the same version.
>>
>> I'll make some time writing documentation and documenting my findings
>> which I've all faced in the journey of the last 2 weeks.... Kristof in
>> Ceph's wunderland...
>>
>> Thanks for all your input so far!
>>
>> Regards,
>>
>> Kristof
>>
>>
>>
>> Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder <frans@xxxxxx<mailto:
>> frans@xxxxxx>>:
>> There have been threads on exactly this. Might depend a bit on your ceph
>> version. We are running mimic and have no issues doing:
>>
>> - set noout, norebalance, nobackfill
>> - add all OSDs (with weight 1)
>> - wait for peering to complete
>> - unset all flags and let the rebalance loose
>>
>> Starting with nautilus there seem to be issues with this procedure.
>> Mainly the peering phase can cause a collapse of the cluster.  In your
>> case, it sounds like you added the OSDs already. You should be able to do
>> relatively safely:
>>
>> - set noout, norebalance, nobackfill
>> - set weight of OSDs to 1 one by one and wait for peering to complete
>> every time
>> - unset all flags and let the rebalance loose
>>
>> I believe once the peering succeeded without crashes, the rebalancing
>> will just work fine. You can easily control how much rebalancing is going
>> on.
>>
>> I noted that ceph seems to have a strange concept of priority though. I
>> needed to gain capacity by adding OSDs and ceph was very consequent with
>> moving PGs from the fullest OSDs last. The opposite of what should happen.
>> Thus, it took ages for additional capacity to become available and also the
>> backfill too full warnings stayed for all the time. You can influence this
>> to some degree by using force_recovery commands on PGs on the fullest OSDs.
>>
>> Best regards and good luck,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Kristof Coucke <kristof.coucke@xxxxxxxxx<mailto:
>> kristof.coucke@xxxxxxxxx>>
>> Sent: 21 October 2020 13:29:00
>> To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> Subject:  Question about expansion existing Ceph cluster -
>> adding OSDs
>>
>> Hi,
>>
>> I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
>> Some disks were near full.
>> The new disks have been added with initial weight = 0.
>> The original plan was to increase this slowly towards their full weight
>> using the gentle reweight script. However, this is going way too slow and
>> I'm also having issues now with "backfill_toofull".
>> Can I just add all the OSDs with their full weight, or will I get a lot of
>> issues when I'm doing that?
>> I know that a lot of PGs will have to be replaced, but increasing the
>> weight slowly will take a year at the current speed. I'm already playing
>> with the max backfill to increase the speed, but every time I increase the
>> weight it will take a lot of time again...
>> I can face the fact that there will be a performance decrease.
>>
>> Looking forward to your comments!
>>
>> Regards,
>>
>> Kristof
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx