On Wed, Apr 08, 2015 at 03:42:29PM +0000, Gregory Farnum wrote: > Im on my phone so can't check exactly what those threads are trying to do, > but the osd has several threads which are stuck. The FileStore threads are > certainly trying to access the disk/local filesystem. You may not have a > hardware fault, but it looks like something in your stack is not behaving > when the osd asks the filesystem to do something. Check dmesg, etc. > -Greg Noticed a bit in dmesg that seems to be controller-related (HP Smart Array P420i) where I/O was hanging in some cases[1]; fixed by updating from 5.42 to 6.00 [1] http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03555882 In dmesg: [11775.779477] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x49fb REQUEST SUCCEEDED. [11812.170350] hpsa 0000:08:00.0: Abort request on C1:B0:T0:L0 [11817.386773] hpsa 0000:08:00.0: cp ffff880522bff000 is reported invalid (probably means target device no longer present) [11817.386784] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x4a13 REQUEST SUCCEEDED. The problem still appears to be persisting in the cluster, although I am no longer seeing the disk-related errors in dmesg, I am still getting errors in the osd logs: 2015-04-08 17:24:15.024820 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 17:24:15.025043 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 17:48:33.146399 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 17:48:33.146439 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 18:55:31.107727 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 18:55:31.107774 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 18:55:31.107789 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 18:55:31.108225 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 18:55:31.108268 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 18:55:31.108272 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 18:55:31.108281 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 18:55:31.108285 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 18:55:31.108345 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 18:55:31.108378 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:01:20.694897 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:01:20.694928 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:01:20.694970 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:01:20.695544 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:01:20.695665 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:01:34.979288 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:01:34.979498 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:01:34.979513 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:01:34.979535 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:01:34.980021 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:01:34.980051 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:01:34.980392 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:03:34.731872 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:03:34.731972 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:03:34.732686 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:03:34.732717 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:03:34.732736 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:03:34.732740 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:03:34.732744 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:03:34.733145 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:03:34.734826 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:03:34.734857 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:03:34.734875 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:03:34.734892 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:04:19.294759 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:04:19.294790 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:04:19.294807 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:04:19.294823 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:04:19.294837 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:04:49.917763 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:04:49.917791 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:04:49.917809 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:04:49.917842 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:04:49.917879 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:07:22.139097 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:07:22.139258 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:07:22.139274 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:07:22.139279 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:07:22.139284 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:07:22.139287 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:07:22.139293 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:07:22.139323 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:07:22.139341 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:07:22.139358 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:07:22.139851 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:07:22.139875 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:07:22.139894 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:07:31.648896 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:07:31.648993 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:07:31.649019 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:07:31.649050 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:07:31.649130 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:08:52.319900 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:08:52.319934 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:08:52.319951 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:08:52.320478 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:08:52.320533 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:09:18.455058 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:09:18.455176 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:09:18.455243 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:09:18.455247 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:09:18.455261 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:09:18.455283 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:09:18.455300 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:09:50.180556 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:09:50.180797 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:09:50.180822 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:09:50.180829 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:09:50.180859 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:09:50.180863 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:09:50.181066 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:09:50.181089 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:09:50.181385 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:09:50.181635 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:09:50.181637 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:09:50.181653 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:09:50.181666 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:10:35.118758 7f0f16740700 1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4 2015-04-08 19:10:35.118792 7f0f17742700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4 2015-04-08 19:10:35.119429 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:10:35.119455 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:10:35.119479 7f0f29eaf700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:10:35.119484 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:10:35.119488 7f0f29eaf700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:10:35.119506 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:10:35.119526 7f0f16f41700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4 2015-04-08 19:10:35.119541 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 2015-04-08 19:10:35.120129 7f0f15f3f700 1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4 2015-04-08 19:10:35.120164 7f0f1573e700 1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4 2015-04-08 19:10:50.073367 7f0f21e9f700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4 2015-04-08 19:10:50.073413 7f0f2169e700 1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4 The OSDs are still down despite having caught up: # ceph osd stat osdmap e3814: 16 osds: 10 up, 10 in # ceph daemon osd.15 status { "cluster_fsid": "****", "osd_fsid": "****", "whoami": 15, "state": "booting", "oldest_map": 2527, "newest_map": 3814, "num_pgs": 0} Any further idea? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com