Hi all,
What could be the reason that all pgs of a whole Erasure Coded pool are
stuck stale? All OSDS are restarted and up..
The details:
We have a setup with 14 OSD hosts with specific OSDs for an Erasure
coded pool and 2 SSDS for a cache pool, and 3 seperate monitor/metadata
nodes with ssds for the metadata pool
This afternoon I had to reboot some OSD nodes, because they weren't
reachable anymore. After the cluster recovered, some pgs were stuck
stale. I saw with `health detail` that it were all the pgs of 2 specific
EC-pool osds. I tried with restarting them, but that didn't solve the
problem. I restarted all osds on those nodes, but now all pgs on the
osds for EC on that node were stuck stale. I read in the doc that this
state is reached when it is not communicating with the monitors, so I
restarted the monitors. Since that did not solve it, I tried to restart
everything.
When the cluster was recovered again, all other PGs are back
active+clean, except for the pgs in the EC pool, those are still
stale+active+clean or even stale+active+clean+scrubbing+deep
When I try to query such a pg (eg. `ceph pg 2.1b0 query`), it just hangs
there.. That is not the case for the other pools
If I interrupt, I get: Error EINTR: problem getting command descriptions
from pg.2.1b0
I can't see anything strange in the logs of these pgs (attached)
Someone an idea?
Help very much appreciated!
Thanks!
Kenneth
2015-11-13 17:07:38.362392 7fe857b73900 0 ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da), process ceph-osd, pid 16956
2015-11-13 17:07:38.489267 7fe857b73900 0 filestore(/var/lib/ceph/osd/ceph-29) backend xfs (magic 0x58465342)
2015-11-13 17:07:38.494638 7fe857b73900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-11-13 17:07:38.494646 7fe857b73900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-11-13 17:07:38.494696 7fe857b73900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: splice is supported
2015-11-13 17:07:38.538539 7fe857b73900 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-11-13 17:07:38.561220 7fe857b73900 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: extsize is supported and your kernel >= 3.5
2015-11-13 17:07:38.790119 7fe857b73900 0 filestore(/var/lib/ceph/osd/ceph-29) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-11-13 17:07:39.038637 7fe857b73900 1 journal _open /var/lib/ceph/osd/ceph-29/journal fd 21: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-11-13 17:07:39.055782 7fe857b73900 1 journal _open /var/lib/ceph/osd/ceph-29/journal fd 21: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-11-13 17:07:39.059490 7fe857b73900 0 <cls> cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
2015-11-13 17:07:39.059702 7fe857b73900 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-11-13 17:07:39.066342 7fe857b73900 0 osd.29 10582 crush map has features 104186773504, adjusting msgr requires for clients
2015-11-13 17:07:39.066349 7fe857b73900 0 osd.29 10582 crush map has features 379064680448 was 8705, adjusting msgr requires for mons
2015-11-13 17:07:39.066354 7fe857b73900 0 osd.29 10582 crush map has features 379064680448, adjusting msgr requires for osds
2015-11-13 17:08:00.020520 7fe857b73900 0 osd.29 10582 load_pgs
2015-11-13 17:08:04.948021 7fe857b73900 0 osd.29 10582 load_pgs opened 254 pgs
2015-11-13 17:08:04.959217 7fe857b73900 -1 osd.29 10582 log_to_monitors {default=true}
2015-11-13 17:08:04.963778 7fe83d9a2700 0 osd.29 10582 ignoring osdmap until we have initialized
2015-11-13 17:08:04.963814 7fe83d9a2700 0 osd.29 10582 ignoring osdmap until we have initialized
2015-11-13 17:08:04.996676 7fe857b73900 0 osd.29 10582 done with init, starting boot process
2015-11-13 17:08:11.360655 7fe826e4f700 0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6816/2822 pipe(0x4c259000 sd=181 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a1e40).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.360716 7fe826f50700 0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6814/2729 pipe(0x4c254000 sd=180 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a1ce0).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.360736 7fe826b4c700 0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6800/1002914 pipe(0x4c26f000 sd=183 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a2260).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.361034 7fe82694a700 0 -- 10.143.16.13:6812/16956 >> 10.143.16.14:6808/13526 pipe(0x4c292000 sd=185 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a23c0).accept connect_seq 0 vs existing 0 state connecting
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com