Re: Bounding OSD memory requirements during peering/recovery

David McBride <dwm37@xxxxxxxxx> · Mon, 09 Feb 2015 21:36:16 +0000

On 09/02/15 15:31, Gregory Farnum wrote:

So, memory usage of an OSD is usually linear in the number of PGs it
hosts. However, that memory can also grow based on at least one other
thing: the number of OSD Maps required to go through peering. It
*looks* to me like this is what you're running in to, not growth on
the number of state machines. In particular, those past_intervals you
mentioned. ;)

Hi Greg,

Right, that sounds entirely plausible, and is very helpful.

In practice, that means I'll need to be careful to avoid this situation 
occurring in production — but given that's unlikely to occur except in 
the case of non-trivial neglect, I don't think I need be particularly 
concerned.

(Happily, I'm in the situation that my existing cluster is purely for 
testing purposes; the data is expendable.)

That said, for my own peace of mind, it would be valuable to have a 
procedure that can be used to recover from this state, even if it's 
unlikely to occur in practice.

I'm currently running an experiment where I augment the RAM of each OSD 
node with 10GB swapfiles on each spinning OSD disk, so that there's a 
big-enough backing-store to complete log reconstruction.

(You obviously wouldn't want to operate in this manner during normal 
production operation — the loss of a single drive would cause a hard 
machine-crash, and the performance will be fairly diabolical, 
particularly if you allow client workloads to carry on in the background.)

I did try enabling zswap on the Utopic LTS kernel as supplied as an 
option in Ubuntu 14.04; however, the kernel was not stable in such a 
configuration and several machines crashed under memory pressure.

I do have OSDs committing suicide periodically, probably because they're 
insufficiently responsive to heartbeats as they start to hit swap.  This 
is before experimenting with the various OSD tuning dials for timeouts, 
so some improvement may be possible.

In the meantime, I've configured the ceph-osd Upstart jobs to apply a 
post-exec command of `sleep 3600` to reduce the rate at which they're 
respawned.

So far, the resulting configuration seems to be making progress, albeit 
moderately slowly.

Cheers,
David
--
David McBride <dwm37@xxxxxxxxx>
Unix Specialist, University Information Services
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html