Hi,
I appreciate your message, it really sounds tough (9 months,
really?!). But thanks for the reassurance :-)
They don’t have any other options so we’ll have to start that process
anyway, probably tomorrow. We’ll see how it goes…
Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:
Hi Eugene!
I have a case, where PG with millions of objects, like this
```
root@host# ./show_osd_pool_pg_usage.sh <pool> | less | head
id used_mbytes used_objects omap_used_mbytes omap_used_keys
-- ----------- ------------ ---------------- --------------
17.c91 1213.2482748031616 2539152 0 0
17.9ae 1213.3145303726196 2539025 0 0
17.1a4 1213.432228088379 2539752 0 0
17.8f4 1213.4958791732788 2539831 0 0
17.f9 1213.5339193344116 2539837 0 0
17.c9d 1213.564414024353 2540014 0 0
17.89 1213.6339054107666 2540183 0 0
17.412 1213.6393299102783 2539797 0 0
```
And OSD was very small, like 1TB with RocksDB ~150-200GB. Actually
currently you see splitted PG. So one OSD was serve 64PG * 4M =
256,000,000 of objects...
Main problem was - to remove something, you need to move something.
While the move is in progress, nothing is deleted
Also, deleting is slower than writing. So one task for all
operations was impossible. I do it manually for a 9 moths. After the
splitting of the some PG was completed, I took other PG away from
the most crowded (from the operator’s point of view, problematic)
OSD. The pgremapper [1] helped me with this. As far as I remember,
in this way I got from 2048 to 3000 PG, then I was able to set 4096
PG, after which it became possible to move to 4TV NVME
Your case doesn't look that scary. Firstly, your 85% means that you
have hundreds of free gigabytes (8TB's). If new data does not
arrive, the reservation mechanism is sufficient and after some time
the process will end. On the other hand, I had a replica, so
compared to the EC - my case is a simpler
In any case, it’s worth trying and using the maximum capabilities of
the upmap
Good luck,
k
[1] https://github.com/digitalocean/pgremapper
On 9 Apr 2024, at 11:39, Eugen Block <eblock@xxxxxx> wrote:
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:
PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG
DISK_LOG UP
86.3ff 277708 414403098409 0 0 3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
Their main application is RGW on EC (currently 1024 PGs on 240
OSDs), 8TB HDDs backed by SSDs. There are 6 RGWs running behind
HAProxies. It took me a while to convince them to do a PG split and
now they're trying to assess how big the impact could be. The
fullest OSD is already at 85% usage, the least filled one at 59%,
so there is definitely room for a better balancing which, will be
necessary until the new hardware arrives. The current distribution
is around 100 PGs per OSD which usually would be fine, but since
the PGs are that large only a few PGs difference have a huge impact
on the OSD utilization.
I'm targeting 2048 PGs for that pool for now, probably do another
split when the new hardware has been integrated.
Any comments are appreciated!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx