Re: OSDs not coming up on one host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You can turn up debugging ("debug osd = 10" and "debug filestore = 10"
are probably enough, or maybe 20 each) and see what comes out to get
more information about why the threads are stuck.

But just from the log my answer is the same as before, and now I don't
trust that controller (or maybe its disks), regardless of what it's
admitting to. ;)
-Greg

On Thu, Apr 9, 2015 at 1:28 AM, Jacob Reid <lists-ceph@xxxxxxxxxxxxxxxx> wrote:
> On Wed, Apr 08, 2015 at 03:42:29PM +0000, Gregory Farnum wrote:
>> Im on my phone so can't check exactly what those threads are trying to do,
>> but the osd has several threads which are stuck. The FileStore threads are
>> certainly trying to access the disk/local filesystem. You may not have a
>> hardware fault, but it looks like something in your stack is not behaving
>> when the osd asks the filesystem to do something. Check dmesg, etc.
>> -Greg
>
>
> Noticed a bit in dmesg that seems to be controller-related (HP Smart Array P420i) where I/O was hanging in some cases[1]; fixed by updating from 5.42 to 6.00
>
> [1] http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03555882
>
> In dmesg:
> [11775.779477] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x49fb  REQUEST SUCCEEDED.
> [11812.170350] hpsa 0000:08:00.0: Abort request on C1:B0:T0:L0
> [11817.386773] hpsa 0000:08:00.0: cp ffff880522bff000 is reported invalid (probably means target device no longer present)
> [11817.386784] hpsa 0000:08:00.0: ABORT REQUEST on C1:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x4a13  REQUEST SUCCEEDED.
>
> The problem still appears to be persisting in the cluster, although I am no longer seeing the disk-related errors in dmesg, I am still getting errors in the osd logs:
>
> 2015-04-08 17:24:15.024820 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 17:24:15.025043 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 17:48:33.146399 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 17:48:33.146439 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 18:55:31.107727 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 18:55:31.107774 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 18:55:31.107789 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 18:55:31.108225 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 18:55:31.108268 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 18:55:31.108272 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 18:55:31.108281 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 18:55:31.108285 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 18:55:31.108345 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 18:55:31.108378 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:01:20.694897 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:01:20.694928 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:01:20.694970 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:01:20.695544 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:01:20.695665 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:01:34.979288 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:01:34.979498 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:01:34.979513 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:01:34.979535 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:01:34.980021 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:01:34.980051 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:01:34.980392 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:03:34.731872 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:03:34.731972 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:03:34.732686 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:03:34.732717 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:03:34.732736 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:03:34.732740 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:03:34.732744 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:03:34.733145 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:03:34.734826 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:03:34.734857 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:03:34.734875 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:03:34.734892 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:04:19.294759 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:04:19.294790 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:04:19.294807 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:04:19.294823 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:04:19.294837 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:04:49.917763 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:04:49.917791 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:04:49.917809 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:04:49.917842 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:04:49.917879 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:07:22.139097 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:07:22.139258 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:07:22.139274 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:07:22.139279 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:07:22.139284 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:07:22.139287 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:07:22.139293 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:07:22.139323 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:07:22.139341 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:07:22.139358 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:07:22.139851 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:07:22.139875 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:07:22.139894 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:07:31.648896 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:07:31.648993 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:07:31.649019 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:07:31.649050 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:07:31.649130 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:08:52.319900 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:08:52.319934 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:08:52.319951 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:08:52.320478 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:08:52.320533 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:09:18.455058 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:09:18.455176 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:09:18.455243 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:09:18.455247 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:09:18.455261 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:09:18.455283 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:09:18.455300 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:09:50.180556 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:09:50.180797 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:09:50.180822 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:09:50.180829 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:09:50.180859 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:09:50.180863 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:09:50.181066 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:09:50.181089 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:09:50.181385 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:09:50.181635 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:09:50.181637 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:09:50.181653 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:09:50.181666 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:10:35.118758 7f0f16740700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f0f16740700' had timed out after 4
> 2015-04-08 19:10:35.118792 7f0f17742700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f17742700' had timed out after 4
> 2015-04-08 19:10:35.119429 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:10:35.119455 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:10:35.119479 7f0f29eaf700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:10:35.119484 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:10:35.119488 7f0f29eaf700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:10:35.119506 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:10:35.119526 7f0f16f41700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f0f16f41700' had timed out after 4
> 2015-04-08 19:10:35.119541 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
> 2015-04-08 19:10:35.120129 7f0f15f3f700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f0f15f3f700' had timed out after 4
> 2015-04-08 19:10:35.120164 7f0f1573e700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f0f1573e700' had timed out after 4
> 2015-04-08 19:10:50.073367 7f0f21e9f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f21e9f700' had timed out after 4
> 2015-04-08 19:10:50.073413 7f0f2169e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f0f2169e700' had timed out after 4
>
> The OSDs are still down despite having caught up:
>
> # ceph osd stat
>      osdmap e3814: 16 osds: 10 up, 10 in
> # ceph daemon osd.15 status
> { "cluster_fsid": "****",
>   "osd_fsid": "****",
>   "whoami": 15,
>   "state": "booting",
>   "oldest_map": 2527,
>   "newest_map": 3814,
>   "num_pgs": 0}
>
> Any further idea?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux