Re: Hit suicide timeout after adding new osd

Sage Weil <sage@xxxxxxxxxxx> · Thu, 24 Jan 2013 10:01:22 -0800 (PST)

On Thu, 24 Jan 2013, Andrey Korolyov wrote:
> On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> > On Thu, 24 Jan 2013, Andrey Korolyov wrote:
> >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard
> >> <jens@xxxxxxxxxxxxxxxxxxxx> wrote:
> >> > Hi Sage,
> >> >
> >> >>> I think the problem now is just that 'osd target transaction size' is
> >> >>
> >> >> I set it to 50, and that seems to have solved all my problems.
> >> >>
> >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been
> >> >> running for a few days now without any crashes!
> >> >
> >> >
> >> > Hmm, one of the OSDs crashed again, sadly.
> >> >
> >> > It logs:
> >> >
> >> >    -2> 2013-01-23 18:01:23.563624 7f67524da700  1 heartbeat_map is_healthy
> >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60
> >> >     -1> 2013-01-23 18:01:23.563657 7f67524da700  1 heartbeat_map is_healthy
> >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180
> >> >      0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc:
> >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677
> >> >
> >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> >> >
> >> >
> >> > With this stack trace:
> >> >
> >> >  ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539)
> >> >  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> >> > long)+0x2eb) [0x846ecb]
> >> >  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae]
> >> >  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8]
> >> >  4: (CephContextServiceThread::entry()+0x55) [0x8e0f45]
> >> >  5: /lib64/libpthread.so.0() [0x3cbc807d14]
> >> >  6: (clone()+0x6d) [0x3cbc0f167d]
> >> >
> >> >
> >> > I have saved the core file, if there's anything in there you need?
> >> >
> >> > Or do you think I just need to set the target transaction size even lower
> >> > than 50?
> >> >
> >> >
> >>
> >> I was able to catch this too on rejoin to very busy cluster and seems
> >> I need to lower this value at least at start time. Also
> >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time
> >> before restarted or new osd will join a cluster, and for 2M objects/3T
> >> of replicated data restart of the cluster was took almost a hour
> >> before it actually begins to work. The worst thing is that a single
> >> osd, if restarted, will mark as up after couple of minutes, then after
> >> almost half of hour(eating 100 percent of one cpu, ) as down and then
> >> cluster will start to redistribute data after 300s timeout, osd still
> >> doing something.
> >
> > Okay, something is very wrong.  Can you reproduce this with a log?  Or
> > even a partial log while it is spinning?  You can adjust the log level on
> > a running process with
> >
> >   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20
> >   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1
> >
> > We haven't been able to reproduce this, so I'm very much interested in any
> > light you can shine here.
> >
> 
> Unfortunately cluster finally hit ``suicide timeout'' by every osd, so
> there was no logs, only some backtraces[1].
> Yesterday after an osd was not able to join cluster in a hour, I
> decided to wait until data is remapped, then tried to restart cluster,
> leaving it overnight, to morning all osd processes are dead, with the
> same backtraces. Before it, after a silly node crash(related to
> deadlocks in kernel kvm code), some pgs remains to stay in peering
> state without any blocker in json output, so I had decided to restart
> osd to which primary copy belongs, because it helped before. So most
> interesting part is missing, but I`ll reformat cluster soon and will
> try to catch this again after filling some data in.
> 
> [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/

Thanks, I believe I see the problem.  The peering workqueue is way behind, 
and it is trying to it all in one lump, timing out the work queue.  The 
workaround is to increase the timeout.  We'll put together a proper fix.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html