Re: importance of steps when adding osds

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 8 Jan 2018 14:28:53 +0000 (UTC)

Hi Ugis,

On Mon, 8 Jan 2018, Ugis wrote:
> Hi,
> 
> Suggestion first: ceph.com  site could have some best practice rules
> for adding new OSDs. Googling regarding this topic reveals that people
> have questions like:
> - may I add serveral OSDSs at once?

Yes

> - may I completely change crushmap online so that pgs get completely relocated?

Yes

> - what config parameters help to reduce backfill load?

osd_max_backfill (default: 1) controls how many concurrent backfill (or 
recovery) operations an OSD will work on concurrently.  (It is actually 2x 
this value, since we track recoveries for with teh OSD is primary 
separately from those for which it is a replica participant; this is to 
avoid deadlock in our relatively simplistic approach to reservation.)

Suggestions for where this type of summary info would fit into the docs 
structure would be helpful!

> Until that still have this theoretical question on CRUSH algorithm.
> We have ceph cluster with 5 osd hosts, CRUSH rule orders ceph to put
> replicas one copy per host.
> 
> If we add 2 osds simultaneously in different hosts - how CRUSH
> guarantees that some existing pg that now should be located on those
> new 2 osds does not get unavailable? Should be something with epochs I
> suppose?
> 
> Have found thread mentioning that people have tested completely
> remapping pgs http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019577.html
> Still it is not clear what are the theoretical constraints on adding
> bunches of OSDs (except backfill load). For example if pg gets
> relocated several times in a row(in case osds get added not waiting
> degradation to resolve) - how long that chain of previously allocated
> pgs can be?

The key thing to keep in mind here is that CRUSH only tells us where 
things "should" be as of a given point in time.  RADOS is responsible for 
keeping track of where things are and have been recently, and making a 
safe migration to the desired location.  Generally speaking the amount of 
history it will remember is unbounded--you could feed the cluster a 
million CRUSH map changes faster than it can move data and it won't stop 
you.  In theory, the amount of state that has to be tracked is bounded by 
the size of the cluster... in the truly degenerate case it will think that 
every PG existed at some point on every other OSD.  In practice (as of 
luminous) the amount of state needed to is very small due to the recent 
PastIntervals work (see this blog for some more background if you're 
interested: 
http://ceph.com/community/new-luminous-pg-overdose-protection/)

Hope that helps!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html