Re: pg's stuck activating on osd create

Eugen Block <eblock@xxxxxx> · Mon, 08 Jul 2024 11:36:40 +0000

Hi,

it depends a bit on the actual OSD layout on the node and your  
procedure, but there's a chance you might have hit the overdose. But I  
would expect it to be logged in the OSD logs, two years ago in a  
Nautilus cluster the message looked like this:

maybe_wait_for_max_pg withhold creation of pg ...

According to github in 16.2.15 it could look like this:

maybe_wait_for_max_pg hit max pg, dropping ...

But I'm not sure, I haven't seen that in newer clusters (yet).

Regards,
Eugen

Zitat von Richard Bade <hitrich@xxxxxxxxx>:

Hi Everyone,
I had an issue last night when I was bringing online some osds that I
was rebuilding. When the osds created and came online 15pgs got stuck
in activating. The first osd (osd.112) seemed to come online ok, but
the second one (osd.113) triggered the issue. All the pgs in
activating included osd.112 in the pg map and I resolved it by doing
pg-upmap-items to map the pg back from osd.112 to where it currently
was but it was painful having 10min of stuck i/o os an rbd pool with
vm's running.

Some details about the cluster:
Pacific 16.2.15, upgraded from Nautilus fairly recently and Luminos
back in the past. All osds were rebuilt on bluestore in Nautilus, as
were the mons.
The disks in question are Intel DC P4510 8TB nvme. I'm rebuilding them
as I had previously had 4x2TB osd's per disk and now wanted to
consolidate down to one osd per disk.
There's around 300 osd's in the pool with 16384 pgs which means that
the 2TB osds had 157pgs on them. However this means that the 8TB osds
have 615pgs on them and I'm wondering if this is maybe the cause of
the problem.

There are no warnings about too many pgs per osd in the logs or ceph status.
I have the default value of 250 for mon_max_pg_per_osd and default
value of 3.0 for osd_max_pg_per_osd_hard_ratio.

My plan is to reduce the number of pgs in the pool but I want to
understand and prove what happened here.
Is it likely I've hit pg overdose protection? If I have, how would I
tell as I can't see anything in the cluster logs.

Thanks,
Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx