Re: osd suicide timeout

huang jun <hjwsm1989@xxxxxxxxx> · Tue, 14 Jul 2015 08:42:08 +0800

hi,Deneau
we set the filestore_max_queue_ops=1000, unlucky we alse meet the osd suicide.
we test random 4K writes, the osd_op_tp thread suicide event if we set
osd_op_thread_timeout=120,
the threads in osd_op_tp will reset the timeout handle after submit op
to FileJournal's workqueue,
so we have no idea why the thread blocked so long. the debug log
showed during that time, other threads in osd_op_tp
works fine.

2015-07-14 7:12 GMT+08:00 Deneau, Tom <tom.deneau@xxxxxxx>:
> Greg --
>
> Not sure how to tell whether rebalancing occurred at that time.
> I do see in other osd logs complaints that they do not get a reply from
> osd.8 starting around 15:52 on that day.
>
> I see the deep-scrub of pool 14 but that was almost 30 minutes earlier.
>
> -- Tom
>
>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
>> Sent: Monday, July 13, 2015 11:45 AM
>> To: Deneau, Tom
>> Cc: ceph-devel
>> Subject: Re: osd suicide timeout
>>
>> heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x3ff6eb0efd0' had suicide
>> timed out after 150
>>
>> So that's the OSD's "op thread", which is the one that does most of the work.
>> You often see the FileStore::op_tp when it's the disk or filesystem breaking,
>> but I do see the line
>>
>> waiting 51 > 50 ops || 57248008 > 104857600
>>
>> which leaves me feeling pretty confident that the disk is just getting more
>> work than it can keep up with. It looks like there was some rebalancing
>> happening around this time?
>> -Greg
>>
>>
>>
>> On Mon, Jul 13, 2015 at 5:04 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>> > Greg --
>> >
>> > Thanks.  I put the osd.log file at
>> >
>> > https://drive.google.com/file/d/0B_rfwWh40kPwQjZ3OXdjLUZNRVU/view?usp=
>> > sharing
>> >
>> > I noticed the following from journalctl output around that time, so other
>> nodes were complaining they could not reach osd.8.
>> >
>> > Jul 09 15:53:04 seattle-04-ausisv bash[8486]: 2015-07-09
>> > 15:53:03.905386 3ffa0d9efd0 -1 osd.9 2487 heartbeat_check: no reply
>> > from osd.8 since back 2015-07-09 15:52:43.256581 front 2015-07-09
>> > 15:52:43.256581 (cutoff 2015-07-09 15:52:43.905384) Jul 09 15:53:06
>> > seattle-04-ausisv bash[1060]: 2015-07-09 15:53:06.784069 3ff916fefd0
>> > -1 osd.7 2487 heartbeat_check: no reply from osd.8 since back
>> > 2015-07-09 15:52:46.474273 front 2015-07-09 15:52:46.474273 (cutoff
>> > 2015-07-09 15:52:46.784066)
>> >
>> > and here is some sar data for the disk that osd.8 was controlling
>> > (sde1=journal partition, sde2=data partition)
>> >
>> > 03:40:02 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz
>> await     svctm     %util
>> > 03:50:17 PM      sde1      9.67      0.00  19046.40   1970.32      0.35
>> 35.86     30.62     29.60
>> > 03:50:17 PM      sde2     60.47   1524.27  14597.67    266.63     24.45
>> 404.30      8.54     51.67
>> > 03:50:32 PM      sde1     12.13      0.00  18158.93   1496.62      0.25
>> 20.66     17.58     21.33
>> > 03:50:32 PM      sde2     28.00   1550.93  17958.33    696.76     10.54
>> 376.50     13.52     37.87
>> > 03:50:47 PM      sde1     12.73      0.00  25446.40   1998.41      0.31
>> 24.19     22.30     28.40
>> > 03:50:47 PM      sde2     51.60    338.67  18091.73    357.18     13.05
>> 252.91      8.02     41.40
>> > 03:51:02 PM      sde1     12.27      0.00  18790.40   1531.83      0.31
>> 25.33     18.53     22.73
>> > 03:51:02 PM      sde2     33.13   2635.20  18026.67    623.60      5.02
>> 151.57     10.99     36.40
>> > 03:51:17 PM      sde1     10.13      0.00  14557.87   1436.63      0.16
>> 16.18     12.76     12.93
>> > 03:51:17 PM      sde2     46.73   1107.73  12067.00    281.91      8.55
>> 182.88      5.46     25.53
>> > 03:51:32 PM      sde1     11.93      0.00  18594.13   1558.17      0.35
>> 29.27     16.42     19.60
>> > 03:51:32 PM      sde2     22.20    555.20  18834.33    873.40      4.24
>> 191.08     13.51     30.00
>> > 03:51:47 PM      sde1     18.00      0.00  13926.40    773.69      0.19
>> 10.78     10.07     18.13
>> > 03:51:47 PM      sde2     47.27   1652.80  10775.53    262.94     12.24
>> 259.01      6.66     31.47
>> > 03:52:02 PM      sde1     21.60      0.00  10845.87    502.12      0.24
>> 11.08      9.75     21.07
>> > 03:52:02 PM      sde2     34.33   1652.80   9089.13    312.87      7.43
>> 216.41      8.45     29.00
>> > 03:52:17 PM      sde1     19.87      0.00  20198.40   1016.70      0.33
>> 16.85     13.46     26.73
>> > 03:52:17 PM      sde2     35.60   2752.53  16355.53    536.74     11.90
>> 333.33     10.90     38.80
>> > 03:52:32 PM      sde1     22.54      0.00   8434.04    374.18      0.15
>> 6.67      6.17     13.90
>> > 03:52:32 PM      sde2     35.84   2738.30   4586.30    204.38      2.01
>> 28.11      6.53     23.40
>> > 03:52:47 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:52:47 PM      sde2     13.37     35.83   1101.80     85.09      1.87
>> 218.65      5.75      7.69
>> > 03:53:02 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:02 PM      sde2      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:17 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:17 PM      sde2      0.13      0.00      0.20      1.50      0.00
>> 20.00     20.00      0.27
>> > 03:53:32 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:32 PM      sde2      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:47 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:53:47 PM      sde2      0.13      0.00      0.20      1.50      0.00
>> 5.00      5.00      0.07
>> > 03:54:02 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:54:02 PM      sde2      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:54:17 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:54:17 PM      sde2      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:54:32 PM      sde1      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> > 03:54:32 PM      sde2      0.00      0.00      0.00      0.00      0.00
>> 0.00      0.00      0.00
>> >
>> > -- Tom
>> >
>> >> -----Original Message-----
>> >> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
>> >> Sent: Monday, July 13, 2015 5:07 AM
>> >> To: Deneau, Tom
>> >> Cc: ceph-devel
>> >> Subject: Re: osd suicide timeout
>> >>
>> >> On Fri, Jul 10, 2015 at 10:45 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
>> >> > I have an osd log file from an osd that hit a suicide timeout (with
>> >> > the
>> >> previous 10000 events logged).
>> >> > (On this node I have also seen this suicide timeout happen once
>> >> > before and
>> >> also a sync_entry timeout.
>> >> >
>> >> > I can see that 6 minutes or so before that osd died, other osds on
>> >> > the same node were logging messages such as
>> >> >     heartbeat_check: no reply from osd.8 so it appears that osd8
>> >> > stopped responding quite some time before it died.
>> >>
>> >> It does this deliberately when its disk threads look like they might
>> >> be stuck.
>> >>
>> >> >
>> >> > I'm wondering if there is enough information in the osd8 log file
>> >> > to deduce
>> >> why osd 8 stopped responding?
>> >> > I don't know enough to figure it out myself.
>> >> >
>> >> > Is there any expert who would be willing to take a look at the log file?
>> >>
>> >> The logs will have a backtrace in them; if you can include that and
>> >> the last hundred lines or so prior in a pastebin that you email the
>> >> list with several people can give you a pretty good idea of what's going
>> on.
>> >>
>> >> In general though, it's just going to be that the disk can't keep up
>> >> with the load being applied to it. That could be because it's
>> >> failing, or because you're pushing too much work on to it in some fashion.
>> >> -Greg

-- 
thanks
huangjun
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html