Re: OSDs not coming up on one host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 08, 2015 at 03:42:29PM +0000, Gregory Farnum wrote:
> Im on my phone so can't check exactly what those threads are trying to do,
> but the osd has several threads which are stuck. The FileStore threads are
> certainly trying to access the disk/local filesystem. You may not have a
> hardware fault, but it looks like something in your stack is not behaving
> when the osd asks the filesystem to do something. Check dmesg, etc.
> -Greg


Noticed a bit in dmesg that seems to be controller-related (HP Smart Array P420i) where I/O was hanging in some cases[1]; fixed by updating from 5.42 to 6.00

[1] http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03555882

In dmesg:
[11775.779477] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x49fb  REQUEST SUCCEEDED.
[11812.170350] hpsa 0000:08:00.0: Abort request on C1:B0:T0:L0
[11817.386773] hpsa 0000:08:00.0: cp ffff880522bff000 is reported invalid (probably means target device no longer present)
[11817.386784] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x4a13  REQUEST SUCCEEDED.

The problem still appears to be persisting in the cluster, although I am no longer seeing the disk-related errors in dmesg, I am still getting errors in the osd logs:

2015-04-08 17:24:15.024820 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 17:24:15.025043 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 17:48:33.146399 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 17:48:33.146439 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 18:55:31.107727 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 18:55:31.107774 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 18:55:31.107789 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 18:55:31.108225 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 18:55:31.108268 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 18:55:31.108272 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 18:55:31.108281 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 18:55:31.108285 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 18:55:31.108345 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 18:55:31.108378 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:01:20.694897 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:01:20.694928 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:01:20.694970 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:01:20.695544 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:01:20.695665 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:01:34.979288 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:01:34.979498 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:01:34.979513 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:01:34.979535 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:01:34.980021 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:01:34.980051 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:01:34.980392 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:03:34.731872 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:03:34.731972 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:03:34.732686 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:03:34.732717 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:03:34.732736 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:03:34.732740 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:03:34.732744 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:03:34.733145 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:03:34.734826 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:03:34.734857 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:03:34.734875 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:03:34.734892 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:04:19.294759 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:04:19.294790 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:04:19.294807 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:04:19.294823 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:04:19.294837 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:04:49.917763 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:04:49.917791 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:04:49.917809 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:04:49.917842 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:04:49.917879 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:07:22.139097 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:07:22.139258 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:07:22.139274 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:07:22.139279 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:07:22.139284 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:07:22.139287 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:07:22.139293 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:07:22.139323 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:07:22.139341 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:07:22.139358 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:07:22.139851 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:07:22.139875 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:07:22.139894 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:07:31.648896 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:07:31.648993 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:07:31.649019 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:07:31.649050 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:07:31.649130 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:08:52.319900 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:08:52.319934 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:08:52.319951 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:08:52.320478 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:08:52.320533 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:09:18.455058 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:09:18.455176 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:09:18.455243 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:09:18.455247 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:09:18.455261 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:09:18.455283 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:09:18.455300 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:09:50.180556 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:09:50.180797 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:09:50.180822 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:09:50.180829 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:09:50.180859 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:09:50.180863 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:09:50.181066 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:09:50.181089 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:09:50.181385 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:09:50.181635 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:09:50.181637 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:09:50.181653 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:09:50.181666 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:10:35.118758 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
2015-04-08 19:10:35.118792 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
2015-04-08 19:10:35.119429 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:10:35.119455 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:10:35.119479 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:10:35.119484 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:10:35.119488 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:10:35.119506 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:10:35.119526 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
2015-04-08 19:10:35.119541 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
2015-04-08 19:10:35.120129 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
2015-04-08 19:10:35.120164 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
2015-04-08 19:10:50.073367 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
2015-04-08 19:10:50.073413 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4

The OSDs are still down despite having caught up:

# ceph osd stat
     osdmap e3814: 16 osds: 10 up, 10 in
# ceph daemon osd.15 status
{ "cluster_fsid": "****",
  "osd_fsid": "****",
  "whoami": 15,
  "state": "booting",
  "oldest_map": 2527,
  "newest_map": 3814,
  "num_pgs": 0}

Any further idea?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux