Re: radosgw stopped working

Alwin Antreich <alwin.antreich@xxxxxxxx> · Sun, 22 Dec 2024 19:41:26 +0100

Hi Rok,

On Sun, 22 Dec 2024 at 16:08, Rok Jaklič <rjaklic@xxxxxxxxx> wrote:

> Thank you all for your suggestions.
>
> I've increased full ratio to 0.96 and rgw started to work again.
>
> However:
>
> I've tried to set for e.g. osd.122 reweight, crush reweight and then also
> with:
> ceph osd pg-upmap-items 9.169 122 187
>
A little too much at once. ;)

1) ceph osd reweight
    This overrules (lika a bias) the OSD weight temporarily but it will
keep the crush weight of the bucket the same. Which means Ceph will try to
keep the PGs within the node.

2) ceph osd crush reweight
    This sets the weight of the item in crush, which also tells Ceph that
the node has a different weight and starts a new calculation for all the
PGs connected to OSDs for this node. Which means more data movement.

3) ceph osd pg-upmap-items
    Overrides the placement of a PG specified by the crush algorithm. You
can see these upmap entries in `ceph osd dump`.

I do not recommend 1). With 3) you have the most control over where PGs
should go. And it is best to keep the balancer off, otherwise it may
interfere with your placements.

>
> and then also
> [root@ctplmon1 plankton-swarm]# bash ./plankton-swarm.sh source-osds
> osd.122 3
> Using custom source OSDs: osd.122
> Underused OSDs (<65%):
> 63,108,200,199,143,144,198,142,140,146,195,141,148,194,196,197,103,158,191,147,125,25,164,54,193,126,192,145,46,19,68,157,128,15,60,53,134,129,87,0,102,131,6,43,78,127,120,33,81,119,3,5,74,190,70,8,160,24,58,156,114,29,186,82,96,116,182,48,84,28,18,44,178,39,4,75,115,76,79,130,72,86,159,40,184,22,26,35,71,171,88,64,175,187,170,165,41,110,94,150,111,17,83,14,27,49,2,37,124,172,177,98,152,104,95,118,168,189,132,52,105,32,59,21,101,9,107
> Will now find ways to move 3 pgs in each OSD respecting node failure
> domain.
> Processing OSD osd.122...
> dumped all
> No active and clean pgs found for osd.122, skipping.
> Balance pgs commands written to swarm-file - review and let planktons
> swarm with 'bash swarm-file'.
>
On another post of yours you show 6x PGs being processed for osd.122, which
is usually quite a lot for HDDs. Reducing the amount of osd_max_backfills
(for wpq) to 1 will reduce the amount of parallel backfills. Where at most
2x backfills per OSD will happen.
This might reduce the extra space needed during rebalance of these PGs, as
the time needed to complete the rebalance of a particular PG will only
increase the more PGs are processed in parallel. Once the PG has backfilled
successfully it will be removed from the old OSD.

You can also try to use `ceph pg force-backfill` of particular PGs to let
Ceph know to process them prior to other PGs.

Can you please post a `ceph osd df tree` and a `ceph df` to give us a
picture on how things are?

Cheers,
Alwin
croit GmbH, https://croit.io
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx