Re: Question about expansion existing Ceph cluster - adding OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Frank,

We're having a lot of small objects in the cluster... RocksDb has issues
with the compaction causing high disk load... That's why we are performing
manual compaction...
See https://github.com/ceph/ceph/pull/37496

Br,

Kristof


Op ma 26 okt. 2020 om 12:14 schreef Frank Schilder <frans@xxxxxx>:

> Hi Kristof,
>
> I missed that: why do you need to do manual compaction?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Kristof Coucke <kristof.coucke@xxxxxxxxx>
> Sent: 26 October 2020 11:33:52
> To: Frank Schilder; a.jazdzewski@xxxxxxxxxxxxxx
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Question about expansion existing Ceph cluster -
> adding OSDs
>
> Hi Ansgar, Frank, all,
>
> Thanks for the feedback in the first place.
>
> In the meantime, I've added all the disks and the cluster is rebalancing
> itself... Which will take ages as you've mentioned. Last week after this
> conversation it was around 50% (little bit more), today it's around 44,5%.
> Every day, I have to take the cluster down to run manual compaction on
> some disks :-(, but that's a known bug where Igor is working on. (Kudos to
> him when I get my sleep back at night for this one...)
>
> Though, I'm still having an issue which I don't completely understand.
> When I look into the Ceph dashboard - OSDs, I can see the #pgs for a
> specific OSD. Does someone know how this is calculated? Because it seems
> incorrect...
> E.g. A specific disk shows in the dashboard 189 PGs...? However, examining
> the pg dump output I can see that for that particular disk there are 145
> PGs where the disk is in the "up" list, and 168 disks where that particular
> disk is in the "acting" list...  Of those 2 lists, 135 are in common,
> meaning 10 PGs will need to be moved to that disk, while 33 PGs will need
> to be moved away...
> I can't figure out how the dashboard is getting to the figure of 189...
> It's also on other disks (a delta between the PG dump output and the info
> in the Ceph dashboard).
>
> Another example is one disk which I've put on weight 0 as it's marked to
> have a predictable failure in the future... So the list with "up" is 0
> (which is correct), and the PGs where this disk is in acting is 49. So,
> this seems correct as these 49 PGs need to be moved away. However...
> Looking into the Ceph dashboard the UI is saying that there are 71 PGs on
> that disk...
>
> So:
> - How does the Ceph dashboard get that number in the 1st place?
> - Is there a possibility that there are "orphaned" PG-parts left behind on
> a particular OSD?
> - If it is possible that there are orphaned parts of a PG left behind on a
> disk, how do I clean this up?
>
> I've also tried examining the osdmap, however, the output seems to be
> limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the
> file is concatenated by exporting the osd map, or by the osdmaptool
> --print).
>
> The cluster is running Nautilus v14.2.11, all on the same version.
>
> I'll make some time writing documentation and documenting my findings
> which I've all faced in the journey of the last 2 weeks.... Kristof in
> Ceph's wunderland...
>
> Thanks for all your input so far!
>
> Regards,
>
> Kristof
>
>
>
> Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>>:
> There have been threads on exactly this. Might depend a bit on your ceph
> version. We are running mimic and have no issues doing:
>
> - set noout, norebalance, nobackfill
> - add all OSDs (with weight 1)
> - wait for peering to complete
> - unset all flags and let the rebalance loose
>
> Starting with nautilus there seem to be issues with this procedure. Mainly
> the peering phase can cause a collapse of the cluster.  In your case, it
> sounds like you added the OSDs already. You should be able to do relatively
> safely:
>
> - set noout, norebalance, nobackfill
> - set weight of OSDs to 1 one by one and wait for peering to complete
> every time
> - unset all flags and let the rebalance loose
>
> I believe once the peering succeeded without crashes, the rebalancing will
> just work fine. You can easily control how much rebalancing is going on.
>
> I noted that ceph seems to have a strange concept of priority though. I
> needed to gain capacity by adding OSDs and ceph was very consequent with
> moving PGs from the fullest OSDs last. The opposite of what should happen.
> Thus, it took ages for additional capacity to become available and also the
> backfill too full warnings stayed for all the time. You can influence this
> to some degree by using force_recovery commands on PGs on the fullest OSDs.
>
> Best regards and good luck,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Kristof Coucke <kristof.coucke@xxxxxxxxx<mailto:
> kristof.coucke@xxxxxxxxx>>
> Sent: 21 October 2020 13:29:00
> To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject:  Question about expansion existing Ceph cluster -
> adding OSDs
>
> Hi,
>
> I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
> Some disks were near full.
> The new disks have been added with initial weight = 0.
> The original plan was to increase this slowly towards their full weight
> using the gentle reweight script. However, this is going way too slow and
> I'm also having issues now with "backfill_toofull".
> Can I just add all the OSDs with their full weight, or will I get a lot of
> issues when I'm doing that?
> I know that a lot of PGs will have to be replaced, but increasing the
> weight slowly will take a year at the current speed. I'm already playing
> with the max backfill to increase the speed, but every time I increase the
> weight it will take a lot of time again...
> I can face the fact that there will be a performance decrease.
>
> Looking forward to your comments!
>
> Regards,
>
> Kristof
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux