Re: PG overdose protection causing PG unavailability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 22.02.2018 um 02:54 schrieb David Turner:
> You could set the flag noin to prevent the new osds from being calculated by crush until you are ready for all of them in the host to be marked in. 
> You can also set initial crush weight to 0 for new pads so that they won't receive any PGs until you're ready for it.

I tried this just now for the next reinstallation and it did not help. Here's what I did:

$ ceph osd set noin
# Shutdown to-be-reinstalled host, purge old OSDs, reinstall host, create new OSDs
$ ceph osd unset noin
=> nothing happens, new OSDs are obviously "up", but not "in". 

Now I have to put them in somehow. 
What I did was:
$ for i in {68.99}; do ceph osd in osd.${i}; done

And I ended up with the very same problem, since there is of course a delay between the first OSD going "in"
and the second OSD going "in". It seems our mons are fast enough to recalculate the crush map within this small delay,
then "PG overdose protection" kicks in (via osd_max_pg_per_osd_hard_ratio), 
many PGs enter "activating+undersized+degraded+remapped" or "activating+remapped" state and get stuck in this condition,
and I end up with about 100 PGs being inactive and data availability being reduced (just by adding a host!). 

So it seems to me the only solution to prevent data inavailabiltiy in such a (probably common?) setup when you want to reinstall a host
is to effective disable overdose protection, or at least the "osd_max_pg_per_osd_hard_ratio". 

If that really is the case, maybe the documentation should contain a huge warning that this has to be done during reinstallation of a full OSD host
if the number of total OSD hosts matches k+m in an EC pool. 

Alternatively, it would be nice if the "activating" PGs would at least recover at some point without manual intervention. 

Cheers,
	Oliver

> On Wed, Feb 21, 2018, 5:46 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
> 
>     Dear Cephalopodians,
> 
>     in a Luminous 12.2.3 cluster with a pool with:
>     - 192 Bluestore OSDs total
>     - 6 hosts (32 OSDs per host)
>     - 2048 total PGs
>     - EC profile k=4, m=2
>     - CRUSH failure domain = host
>     which results in 2048*6/192 = 64 PGs per OSD on average, I run into issues with PG overdose protection.
> 
>     In case I reinstall one OSD host (zapping all disks), and recreate the OSDs one by one with ceph-volume,
>     they will usually come back "slowly", i.e. one after the other.
> 
>     This means the first OSD will initially be assigned all 2048 PGs (to fulfill the "failure domain host" requirement),
>     thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
>     We also use mon_max_pg_per_osd default, i.e. 200.
> 
>     This appears to cause the previously active (but of course undersized+degraded) PGs to enter an "activating+remapped" state,
>     and hence they become unavailable.
>     Thus, data availability is reduced. All this is caused by adding an OSD!
> 
>     Of course, as more and more OSDs are added until all 32 are back online, this situation is relaxed.
>     Still, I observe that some PGs get stuck in this "activating" state, and can't seem to figure out from logs or by dumping them
>     what's the actual reason. Waiting does not help, PGs stay "activating", data stays inaccessible.
> 
>     Waiting a bit and manually restarting the ceph-OSD-services on the reinstalled host seems to bring them back.
>     Also, adjusting osd_max_pg_per_osd_hard_ratio to something large (e.g. 10) appears to prevent the issue.
> 
>     So my best guess is that this is related to PG overdose protection.
>     Any ideas on how to best overcome this / similar observations?
> 
>     It would be nice to be able to reinstall an OSD host without temporarily making data unavailable,
>     right now the only thing which comes to my mind is to effectively disable PG overdose protection.
> 
>     Cheers,
>             Oliver
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux