Re: PG overdose protection causing PG unavailability

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Sat, 24 Feb 2018 19:42:46 +0100

Am 24.02.2018 um 07:14 schrieb David Turner:
> There was another part to my suggestion which was to set the initial crush weight to 0 in ceph.conf. after you add all of your osds, you could download the crush map, weight the new osds to what they should be, and upload the crush map to give them all the ability to take PGs at the same time. With this method you never have any osds that can take PGs on the host until all of them can.

I did indeed miss this part of the suggestion. 
Up to now, I have refrained from any manual edits of the Crush map, but made use of device classes and automatic Crush location updates - 
it seems to me the general direction in which Ceph is moving is to make it unnecessary to ever touch the Crushmap, 
and even to obsolete ceph.conf at some point in the near future. 
Since there are already first tools playing with the weights (such as the balancer), it would also not be nice to have to intervene manually
in this regard. 

Still, it seems very likely that manually adapting the weights should avoid the issue completely. 

However, I'd then prefer my hack (osd_max_pg_per_osd_hard_ratio = 32) which turns off the hard overdose protection until the issue is fixed
over the manual crushmap editing. In a cluster with almost 200 OSDs, this would otherwise mean I have to manually edit the crush map each time 
I purge an OSD and add it anew. Right now, the HDDs are fresh, but as soon as they start to become old and fail, this would become a cumbersome
(and technically not really necessary) task. 

I'll re-trigger the issue and upload logs as suggested by Greg soon-ish, maybe this issue will even be fixed before we have the first failing disk ;-). 

Cheers,
	Oliver

> 
> On Thu, Feb 22, 2018, 7:14 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
>     >
>     >
>     > On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx> <mailto:freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>> wrote:
>     >
>     >     Dear Cephalopodians,
>     >
>     >     in a Luminous 12.2.3 cluster with a pool with:
>     >     - 192 Bluestore OSDs total
>     >     - 6 hosts (32 OSDs per host)
>     >     - 2048 total PGs
>     >     - EC profile k=4, m=2
>     >     - CRUSH failure domain = host
>     >     which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues with PG overdose protection.
>     >
>     >     In case I reinstall one OSD host (zapping all disks), and recreate the OSDs one by one with ceph-volume,
>     >     they will usually come back "slowly", i.e. one after the other.
>     >
>     >     This means the first OSD will initially be assigned all 2048 PGs (to fulfill the "failure domain host" requirement),
>     >     thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
>     >     We also use mon_max_pg_per_osd default, i.e. 200.
>     >
>     >     This appears to cause the previously active (but of course undersized+degraded) PGs to enter an "activating+remapped" state,
>     >     and hence they become unavailable.
>     >     Thus, data availability is reduced. All this is caused by adding an OSD!
>     >
>     >     Of course, as more and more OSDs are added until all 32 are back online, this situation is relaxed.
>     >     Still, I observe that some PGs get stuck in this "activating" state, and can't seem to figure out from logs or by dumping them
>     >     what's the actual reason. Waiting does not help, PGs stay "activating", data stays inaccessible.
>     >
>     >
>     > Can you upload logs from each of the OSDs that are (and should be, but aren't) involved with one of the PGs that happens to? (ceph-post-file) And create a ticket about it?
> 
>     I'll reproduce in the weekend and then capture the logs, at least I did not see anything in there, but I also am not yet too much used to reading them.
> 
>     What I can already confirm for sure is that after I set:
>     osd_max_pg_per_osd_hard_ratio = 32
>     in ceph.conf (global) and deploy new OSD hosts with that, the problem has fully vanished. I have already tested this with two machines.
> 
>     Cheers,
>     Oliver
> 
>     >
>     > Once you have a good map, all the PGs should definitely activate themselves.
>     > -Greg
>     >
>     >
>     >     Waiting a bit and manually restarting the ceph-OSD-services on the reinstalled host seems to bring them back.
>     >     Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10) appears to prevent the issue.
>     >
>     >     So my best guess is that this is related to PG overdose protection.
>     >     Any ideas on how to best overcome this / similar observations?
>     >
>     >     It would be nice to be able to reinstall an OSD host without temporarily making data unavailable,
>     >     right now the only thing which comes to my mind is to effectively disable PG overdose protection.
>     >
>     >     Cheers,
>     >             Oliver
>     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com