Re: Hit suicide timeout after adding new osd

Andrey Korolyov <andrey@xxxxxxx> · Sun, 17 Feb 2013 14:21:49 +0300

On Thu, Jan 24, 2013 at 10:01 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Thu, 24 Jan 2013, Andrey Korolyov wrote:
>> On Thu, Jan 24, 2013 at 8:39 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>> > On Thu, 24 Jan 2013, Andrey Korolyov wrote:
>> >> On Thu, Jan 24, 2013 at 12:59 AM, Jens Kristian S?gaard
>> >> <jens@xxxxxxxxxxxxxxxxxxxx> wrote:
>> >> > Hi Sage,
>> >> >
>> >> >>> I think the problem now is just that 'osd target transaction size' is
>> >> >>
>> >> >> I set it to 50, and that seems to have solved all my problems.
>> >> >>
>> >> >> After a day or so my cluster got to a HEALTH_OK state again. It has been
>> >> >> running for a few days now without any crashes!
>> >> >
>> >> >
>> >> > Hmm, one of the OSDs crashed again, sadly.
>> >> >
>> >> > It logs:
>> >> >
>> >> >    -2> 2013-01-23 18:01:23.563624 7f67524da700  1 heartbeat_map is_healthy
>> >> > 'FileStore::op_tp thread 0x7f673affd700' had timed out after 60
>> >> >     -1> 2013-01-23 18:01:23.563657 7f67524da700  1 heartbeat_map is_healthy
>> >> > 'FileStore::op_tp thread 0x7f673affd700' had suicide timed out after 180
>> >> >      0> 2013-01-23 18:01:24.257996 7f67524da700 -1 common/HeartbeatMap.cc:
>> >> > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
>> >> > const char*, time_t)' thread 7f67524da700 time 2013-01-23 18:01:23.563677
>> >> >
>> >> > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>> >> >
>> >> >
>> >> > With this stack trace:
>> >> >
>> >> >  ceph version 0.56.1-26-g3bd8f6b (3bd8f6b7235eb14cab778e3c6dcdc636aff4f539)
>> >> >  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> >> > long)+0x2eb) [0x846ecb]
>> >> >  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x8476ae]
>> >> >  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x8478d8]
>> >> >  4: (CephContextServiceThread::entry()+0x55) [0x8e0f45]
>> >> >  5: /lib64/libpthread.so.0() [0x3cbc807d14]
>> >> >  6: (clone()+0x6d) [0x3cbc0f167d]
>> >> >
>> >> >
>> >> > I have saved the core file, if there's anything in there you need?
>> >> >
>> >> > Or do you think I just need to set the target transaction size even lower
>> >> > than 50?
>> >> >
>> >> >
>> >>
>> >> I was able to catch this too on rejoin to very busy cluster and seems
>> >> I need to lower this value at least at start time. Also
>> >> c5fe0965572c074a2a33660719ce3222d18c1464 has increased overall time
>> >> before restarted or new osd will join a cluster, and for 2M objects/3T
>> >> of replicated data restart of the cluster was took almost a hour
>> >> before it actually begins to work. The worst thing is that a single
>> >> osd, if restarted, will mark as up after couple of minutes, then after
>> >> almost half of hour(eating 100 percent of one cpu, ) as down and then
>> >> cluster will start to redistribute data after 300s timeout, osd still
>> >> doing something.
>> >
>> > Okay, something is very wrong.  Can you reproduce this with a log?  Or
>> > even a partial log while it is spinning?  You can adjust the log level on
>> > a running process with
>> >
>> >   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_osd 20
>> >   ceph --admin-daemon /var/run/ceph-osd.NN.asok config set debug_ms 1
>> >
>> > We haven't been able to reproduce this, so I'm very much interested in any
>> > light you can shine here.
>> >
>>
>> Unfortunately cluster finally hit ``suicide timeout'' by every osd, so
>> there was no logs, only some backtraces[1].
>> Yesterday after an osd was not able to join cluster in a hour, I
>> decided to wait until data is remapped, then tried to restart cluster,
>> leaving it overnight, to morning all osd processes are dead, with the
>> same backtraces. Before it, after a silly node crash(related to
>> deadlocks in kernel kvm code), some pgs remains to stay in peering
>> state without any blocker in json output, so I had decided to restart
>> osd to which primary copy belongs, because it helped before. So most
>> interesting part is missing, but I`ll reformat cluster soon and will
>> try to catch this again after filling some data in.
>>
>> [1]. http://xdel.ru/downloads/ceph-log/osd-heartbeat/
>
> Thanks, I believe I see the problem.  The peering workqueue is way behind,
> and it is trying to it all in one lump, timing out the work queue.  The
> workaround is to increase the timeout.  We'll put together a proper fix.
>
> sage

Hi Sage,

Single OSDs still not able to join a cluster after restart, osd
process eats one core and reads disk by long continuous periods, about
hundreds of seconds, then staying eating 100% of core, then repeat. On
relatively new cluster, it is not repeatable even with almost same
data commit, only week or two of writes, snapshot creation, etc.
exposes that. Please see log below:

2013-02-17 12:08:17.503992 7fbe8795c780  0 ceph version
0.56.3-2-g290a352 (290a352c3f9e241deac562e980ac8c6a74033ba6), process
ceph-osd, pid 29283
starting osd.26 at :/0 osd_data /var/lib/ceph/osd/26
/var/lib/ceph/osd/journal/journal26
2013-02-17 12:08:17.508193 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6803/29283 need_addr=1
2013-02-17 12:08:17.508222 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6804/29283 need_addr=1
2013-02-17 12:08:17.508244 7fbe8795c780  1 accepter.accepter.bind
my_inst.addr is 0.0.0.0:6805/29283 need_addr=1
2013-02-17 12:08:17.508665 7fbe8795c780  5
filestore(/var/lib/ceph/osd/26) basedir /var/lib/ceph/osd/26 journal
/var/lib/ceph/osd/journal/journal26
2013-02-17 12:08:17.508712 7fbe8795c780 10
filestore(/var/lib/ceph/osd/26) mount fsid is
67783a66-178e-430a-a031-757bd45ae718
2013-02-17 12:08:17.512110 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is supported and
appears to work
2013-02-17 12:08:17.512122 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option
2013-02-17 12:08:17.512590 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount did NOT detect btrfs
2013-02-17 12:08:17.532726 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount syncfs(2) syscall fully
supported (by glibc and kernel)
2013-02-17 12:08:17.532901 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount found snaps <>
2013-02-17 12:08:17.532929 7fbe8795c780  5
filestore(/var/lib/ceph/osd/26) mount op_seq is 53467469
2013-02-17 12:08:17.534953 7fbe8795c780 20 filestore
(init)dbobjectmap: seq is 852347
2013-02-17 12:08:17.534981 7fbe8795c780 10
filestore(/var/lib/ceph/osd/26) open_journal at
/var/lib/ceph/osd/journal/journal26
2013-02-17 12:08:17.535035 7fbe8795c780  0
filestore(/var/lib/ceph/osd/26) mount: enabling WRITEAHEAD journal
mode: btrfs not detected
2013-02-17 12:08:17.535046 7fbe8795c780 10
filestore(/var/lib/ceph/osd/26) list_collections
2013-02-17 12:08:17.668919 7fbe83ba2700 20
filestore(/var/lib/ceph/osd/26) sync_entry waiting for max_interval
5.000000
2013-02-17 12:08:17.669993 7fbe80b9c700 20
filestore(/var/lib/ceph/osd/26) flusher_entry start
2013-02-17 12:08:17.670012 7fbe80b9c700 20
filestore(/var/lib/ceph/osd/26) flusher_entry sleeping
2013-02-17 12:08:17.670104 7fbe8795c780  5
filestore(/var/lib/ceph/osd/26) umount /var/lib/ceph/osd/26
2013-02-17 12:08:17.670126 7fbe83ba2700 20
filestore(/var/lib/ceph/osd/26) sync_entry force_sync set
2013-02-17 12:08:17.670162 7fbe80b9c700 20
filestore(/var/lib/ceph/osd/26) flusher_entry awoke
2013-02-17 12:08:17.670169 7fbe80b9c700 20
filestore(/var/lib/ceph/osd/26) flusher_entry finish
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html