Greg -- Thanks. I put the osd.log file at https://drive.google.com/file/d/0B_rfwWh40kPwQjZ3OXdjLUZNRVU/view?usp=sharing I noticed the following from journalctl output around that time, so other nodes were complaining they could not reach osd.8. Jul 09 15:53:04 seattle-04-ausisv bash[8486]: 2015-07-09 15:53:03.905386 3ffa0d9efd0 -1 osd.9 2487 heartbeat_check: no reply from osd.8 since back 2015-07-09 15:52:43.256581 front 2015-07-09 15:52:43.256581 (cutoff 2015-07-09 15:52:43.905384) Jul 09 15:53:06 seattle-04-ausisv bash[1060]: 2015-07-09 15:53:06.784069 3ff916fefd0 -1 osd.7 2487 heartbeat_check: no reply from osd.8 since back 2015-07-09 15:52:46.474273 front 2015-07-09 15:52:46.474273 (cutoff 2015-07-09 15:52:46.784066) and here is some sar data for the disk that osd.8 was controlling (sde1=journal partition, sde2=data partition) 03:40:02 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 03:50:17 PM sde1 9.67 0.00 19046.40 1970.32 0.35 35.86 30.62 29.60 03:50:17 PM sde2 60.47 1524.27 14597.67 266.63 24.45 404.30 8.54 51.67 03:50:32 PM sde1 12.13 0.00 18158.93 1496.62 0.25 20.66 17.58 21.33 03:50:32 PM sde2 28.00 1550.93 17958.33 696.76 10.54 376.50 13.52 37.87 03:50:47 PM sde1 12.73 0.00 25446.40 1998.41 0.31 24.19 22.30 28.40 03:50:47 PM sde2 51.60 338.67 18091.73 357.18 13.05 252.91 8.02 41.40 03:51:02 PM sde1 12.27 0.00 18790.40 1531.83 0.31 25.33 18.53 22.73 03:51:02 PM sde2 33.13 2635.20 18026.67 623.60 5.02 151.57 10.99 36.40 03:51:17 PM sde1 10.13 0.00 14557.87 1436.63 0.16 16.18 12.76 12.93 03:51:17 PM sde2 46.73 1107.73 12067.00 281.91 8.55 182.88 5.46 25.53 03:51:32 PM sde1 11.93 0.00 18594.13 1558.17 0.35 29.27 16.42 19.60 03:51:32 PM sde2 22.20 555.20 18834.33 873.40 4.24 191.08 13.51 30.00 03:51:47 PM sde1 18.00 0.00 13926.40 773.69 0.19 10.78 10.07 18.13 03:51:47 PM sde2 47.27 1652.80 10775.53 262.94 12.24 259.01 6.66 31.47 03:52:02 PM sde1 21.60 0.00 10845.87 502.12 0.24 11.08 9.75 21.07 03:52:02 PM sde2 34.33 1652.80 9089.13 312.87 7.43 216.41 8.45 29.00 03:52:17 PM sde1 19.87 0.00 20198.40 1016.70 0.33 16.85 13.46 26.73 03:52:17 PM sde2 35.60 2752.53 16355.53 536.74 11.90 333.33 10.90 38.80 03:52:32 PM sde1 22.54 0.00 8434.04 374.18 0.15 6.67 6.17 13.90 03:52:32 PM sde2 35.84 2738.30 4586.30 204.38 2.01 28.11 6.53 23.40 03:52:47 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:52:47 PM sde2 13.37 35.83 1101.80 85.09 1.87 218.65 5.75 7.69 03:53:02 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:02 PM sde2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:17 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:17 PM sde2 0.13 0.00 0.20 1.50 0.00 20.00 20.00 0.27 03:53:32 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:32 PM sde2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:47 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:53:47 PM sde2 0.13 0.00 0.20 1.50 0.00 5.00 5.00 0.07 03:54:02 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:54:02 PM sde2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:54:17 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:54:17 PM sde2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:54:32 PM sde1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 03:54:32 PM sde2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -- Tom > -----Original Message----- > From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] > Sent: Monday, July 13, 2015 5:07 AM > To: Deneau, Tom > Cc: ceph-devel > Subject: Re: osd suicide timeout > > On Fri, Jul 10, 2015 at 10:45 PM, Deneau, Tom <tom.deneau@xxxxxxx> wrote: > > I have an osd log file from an osd that hit a suicide timeout (with the > previous 10000 events logged). > > (On this node I have also seen this suicide timeout happen once before and > also a sync_entry timeout. > > > > I can see that 6 minutes or so before that osd died, other osds on the > > same node were logging messages such as > > heartbeat_check: no reply from osd.8 so it appears that osd8 > > stopped responding quite some time before it died. > > It does this deliberately when its disk threads look like they might be > stuck. > > > > > I'm wondering if there is enough information in the osd8 log file to deduce > why osd 8 stopped responding? > > I don't know enough to figure it out myself. > > > > Is there any expert who would be willing to take a look at the log file? > > The logs will have a backtrace in them; if you can include that and the last > hundred lines or so prior in a pastebin that you email the list with several > people can give you a pretty good idea of what's going on. > > In general though, it's just going to be that the disk can't keep up with the > load being applied to it. That could be because it's failing, or because > you're pushing too much work on to it in some fashion. > -Greg ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f