RE: osd suicide timeout

"Deneau, Tom" <tom.deneau@xxxxxxx> · Mon, 13 Jul 2015 16:04:36 +0000

Greg --

Thanks.  I put the osd.log file at

https://drive.google.com/file/d/0B_rfwWh40kPwQjZ3OXdjLUZNRVU/view?usp=sharing

I noticed the following from journalctl output around that time, so other nodes were complaining they could not reach osd.8.

Jul 09 15:53:04 seattle-04-ausisv bash[8486]: 2015-07-09 15:53:03.905386 3ffa0d9efd0 -1 osd.9 2487 heartbeat_check: no reply from osd.8 since back 2015-07-09 15:52:43.256581 front 2015-07-09 15:52:43.256581 (cutoff 2015-07-09 15:52:43.905384)
Jul 09 15:53:06 seattle-04-ausisv bash[1060]: 2015-07-09 15:53:06.784069 3ff916fefd0 -1 osd.7 2487 heartbeat_check: no reply from osd.8 since back 2015-07-09 15:52:46.474273 front 2015-07-09 15:52:46.474273 (cutoff 2015-07-09 15:52:46.784066)

and here is some sar data for the disk that osd.8 was controlling (sde1=journal partition, sde2=data partition)

03:40:02 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
03:50:17 PM      sde1      9.67      0.00  19046.40   1970.32      0.35     35.86     30.62     29.60
03:50:17 PM      sde2     60.47   1524.27  14597.67    266.63     24.45    404.30      8.54     51.67
03:50:32 PM      sde1     12.13      0.00  18158.93   1496.62      0.25     20.66     17.58     21.33
03:50:32 PM      sde2     28.00   1550.93  17958.33    696.76     10.54    376.50     13.52     37.87
03:50:47 PM      sde1     12.73      0.00  25446.40   1998.41      0.31     24.19     22.30     28.40
03:50:47 PM      sde2     51.60    338.67  18091.73    357.18     13.05    252.91      8.02     41.40
03:51:02 PM      sde1     12.27      0.00  18790.40   1531.83      0.31     25.33     18.53     22.73
03:51:02 PM      sde2     33.13   2635.20  18026.67    623.60      5.02    151.57     10.99     36.40
03:51:17 PM      sde1     10.13      0.00  14557.87   1436.63      0.16     16.18     12.76     12.93
03:51:17 PM      sde2     46.73   1107.73  12067.00    281.91      8.55    182.88      5.46     25.53
03:51:32 PM      sde1     11.93      0.00  18594.13   1558.17      0.35     29.27     16.42     19.60
03:51:32 PM      sde2     22.20    555.20  18834.33    873.40      4.24    191.08     13.51     30.00
03:51:47 PM      sde1     18.00      0.00  13926.40    773.69      0.19     10.78     10.07     18.13
03:51:47 PM      sde2     47.27   1652.80  10775.53    262.94     12.24    259.01      6.66     31.47
03:52:02 PM      sde1     21.60      0.00  10845.87    502.12      0.24     11.08      9.75     21.07
03:52:02 PM      sde2     34.33   1652.80   9089.13    312.87      7.43    216.41      8.45     29.00
03:52:17 PM      sde1     19.87      0.00  20198.40   1016.70      0.33     16.85     13.46     26.73
03:52:17 PM      sde2     35.60   2752.53  16355.53    536.74     11.90    333.33     10.90     38.80
03:52:32 PM      sde1     22.54      0.00   8434.04    374.18      0.15      6.67      6.17     13.90
03:52:32 PM      sde2     35.84   2738.30   4586.30    204.38      2.01     28.11      6.53     23.40
03:52:47 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:52:47 PM      sde2     13.37     35.83   1101.80     85.09      1.87    218.65      5.75      7.69
03:53:02 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:02 PM      sde2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:17 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:17 PM      sde2      0.13      0.00      0.20      1.50      0.00     20.00     20.00      0.27
03:53:32 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:32 PM      sde2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:47 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:53:47 PM      sde2      0.13      0.00      0.20      1.50      0.00      5.00      5.00      0.07
03:54:02 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:54:02 PM      sde2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:54:17 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:54:17 PM      sde2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:54:32 PM      sde1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:54:32 PM      sde2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

-- Tom

> -----Original Message-----
> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
> Sent: Monday, July 13, 2015 5:07 AM
> To: Deneau, Tom
> Cc: ceph-devel
> Subject: Re: osd suicide timeout
> 
> On Fri, Jul 10, 2015 at 10:45 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote:
> > I have an osd log file from an osd that hit a suicide timeout (with the
> previous 10000 events logged).
> > (On this node I have also seen this suicide timeout happen once before and
> also a sync_entry timeout.
> >
> > I can see that 6 minutes or so before that osd died, other osds on the
> > same node were logging messages such as
> >     heartbeat_check: no reply from osd.8 so it appears that osd8
> > stopped responding quite some time before it died.
> 
> It does this deliberately when its disk threads look like they might be
> stuck.
> 
> >
> > I'm wondering if there is enough information in the osd8 log file to deduce
> why osd 8 stopped responding?
> > I don't know enough to figure it out myself.
> >
> > Is there any expert who would be willing to take a look at the log file?
> 
> The logs will have a backtrace in them; if you can include that and the last
> hundred lines or so prior in a pastebin that you email the list with several
> people can give you a pretty good idea of what's going on.
> 
> In general though, it's just going to be that the disk can't keep up with the
> load being applied to it. That could be because it's failing, or because
> you're pushing too much work on to it in some fashion.
> -Greg
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f