Re: OOM's on the Ceph client machine

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 21 Oct 2010 15:44:17 -0700 (PDT)

On Thu, 21 Oct 2010, Ted Ts'o wrote:
> On Thu, Oct 21, 2010 at 02:46:11PM -0700, Sage Weil wrote:
> > 
> > Unfortunately it's not obvious to me from dmesg where the problem is, 
> > other than that it looks like some of the osds aren't responding (but are 
> > apparently still up).  There is a known regression in v0.22 that can cause 
> > crashes in the osd cluster; we should have a fix pushed later today.  
> > That would look a bit different, though (you'd see osd down messages).  
> > I'll post an update (and probably v0.22.1) when that's been tested.
> 
> I looked earlier in the logs, and I do see some "osd down", "osd up",
> and "osd socket closed" messages.  So it looks like the v0.22
> regression you mentioned.  I'll wait for the git update and try
> rebuilding the server.  Thanks!!

Phew!  :)

> > > Also, It seems that there are issues moving back and forth between
> > > 0.21 and 0.22 without reformating the ceph client.  Is that accurate?
> > 
> > Yeah, that isn't expected to work.  In general, rolling backward isn't 
> > supported.  In this case we forgot to add an incompat flag to generate a 
> > nice error message to that effect.
> 
> Is rolling forward between 0.21 and 0.22 expected to work?  Or should
> I just do a mkcephfs just to be safe?  It's not a data preservation
> issue, but rather the time it takes to do a mkcephfs. 

Rolling forward is always supposed to work.  (And if we do end up changing 
things in a non-backward compatible way, we'll make some noise about it.)

> Random
> question: how do you feel about using Python?  Trying to make a
> version of mkcephfs that runs in parallel would probably be easier if
> we could port the shell script to a python script.  I don't think
> there are any Python dependencies in Ceph right now, though.

Python's fine.  There's an issue in the tracker relating to this, btw.  
The goal will be to create discrete steps that let you use whatever 
cluster-specific tools you have for launching parallel jobs. 
	http://tracker.newdream.net/issues/400

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html