jewel OSDs refuse to start up again

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Three OSDs, holding the 3 replicas of a PG here are only half-starting, and hence that single PG gets stuck as "stale+active+clean".
All died of suicide timeout while walking over a huge omap (pool 7 'default.rgw.buckets.index')  and would not get the PG 7.b back online again.

From the logs, they try to start normally, get into a bit of leveldb things, play the journal and then say nothing more.

2019-11-19 15:15:46.967543 7fe644fad840  0 set uid:gid to 167:167 (ceph:ceph)
2019-11-19 15:15:46.967600 7fe644fad840  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 5149
2019-11-19 15:15:47.026065 7fe644fad840  0 pidfile_write: ignore empty --pid-file
2019-11-19 15:15:47.078291 7fe644fad840  0 filestore(/var/lib/ceph/osd/ceph-22) backend xfs (magic 0x58465342)
2019-11-19 15:15:47.079317 7fe644fad840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2019-11-19 15:15:47.079331 7fe644fad840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2019-11-19 15:15:47.079352 7fe644fad840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: splice is supported
2019-11-19 15:15:47.080287 7fe644fad840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2019-11-19 15:15:47.080529 7fe644fad840  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_feature: extsize is disabled by conf
2019-11-19 15:15:47.095819 7fe644fad840  1 leveldb: Recovering log #2731809
2019-11-19 15:15:47.119792 7fe644fad840  1 leveldb: Level-0 table #2731812: started
2019-11-19 15:15:47.132107 7fe644fad840  1 leveldb: Level-0 table #2731812: 140642 bytes OK
2019-11-19 15:15:47.143782 7fe644fad840  1 leveldb: Delete type=0 #2731809

2019-11-19 15:15:47.147198 7fe644fad840  1 leveldb: Delete type=3 #2731792

2019-11-19 15:15:47.159339 7fe644fad840  0 filestore(/var/lib/ceph/osd/ceph-22) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2019-11-19 15:15:47.243262 7fe644fad840  1 journal _open /var/lib/ceph/osd/ceph-22/journal fd 18: 21472739328 bytes, block size 4096 bytes, directio = 1, aio = 1

At this point they consume a ton of cpu, systemd thinks all is fine, and this has been going on for some 5 hours.
ceph -s think they are down, I can't talk to the OSDs remotely from a mon, but ceph daemon on the OSD hosts works normally, except I can't do anything from there except get conf or perf numbers.

Strace shows they all keep looping over the same sequence:
machine1:

stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine2:

stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

machine3:

stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B", {st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4", {st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D", 0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

Help wanted.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux