all pgs of erasure coded pool stuck stale

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

What could be the reason that all pgs of a whole Erasure Coded pool are stuck stale? All OSDS are restarted and up..

The details:
We have a setup with 14 OSD hosts with specific OSDs for an Erasure coded pool and 2 SSDS for a cache pool, and 3 seperate monitor/metadata nodes with ssds for the metadata pool

This afternoon I had to reboot some OSD nodes, because they weren't reachable anymore. After the cluster recovered, some pgs were stuck stale. I saw with `health detail` that it were all the pgs of 2 specific EC-pool osds. I tried with restarting them, but that didn't solve the problem. I restarted all osds on those nodes, but now all pgs on the osds for EC on that node were stuck stale. I read in the doc that this state is reached when it is not communicating with the monitors, so I restarted the monitors. Since that did not solve it, I tried to restart everything.

When the cluster was recovered again, all other PGs are back active+clean, except for the pgs in the EC pool, those are still stale+active+clean or even stale+active+clean+scrubbing+deep

When I try to query such a pg (eg. `ceph pg 2.1b0 query`), it just hangs there.. That is not the case for the other pools If I interrupt, I get: Error EINTR: problem getting command descriptions from pg.2.1b0

I can't see anything strange in the logs of these pgs (attached)

Someone an idea?

Help very much appreciated!

Thanks!

Kenneth
2015-11-13 17:07:38.362392 7fe857b73900  0 ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da), process ceph-osd, pid 16956
2015-11-13 17:07:38.489267 7fe857b73900  0 filestore(/var/lib/ceph/osd/ceph-29) backend xfs (magic 0x58465342)
2015-11-13 17:07:38.494638 7fe857b73900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-11-13 17:07:38.494646 7fe857b73900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-11-13 17:07:38.494696 7fe857b73900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: splice is supported
2015-11-13 17:07:38.538539 7fe857b73900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-11-13 17:07:38.561220 7fe857b73900  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-29) detect_features: extsize is supported and your kernel >= 3.5
2015-11-13 17:07:38.790119 7fe857b73900  0 filestore(/var/lib/ceph/osd/ceph-29) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-11-13 17:07:39.038637 7fe857b73900  1 journal _open /var/lib/ceph/osd/ceph-29/journal fd 21: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-11-13 17:07:39.055782 7fe857b73900  1 journal _open /var/lib/ceph/osd/ceph-29/journal fd 21: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-11-13 17:07:39.059490 7fe857b73900  0 <cls> cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
2015-11-13 17:07:39.059702 7fe857b73900  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-11-13 17:07:39.066342 7fe857b73900  0 osd.29 10582 crush map has features 104186773504, adjusting msgr requires for clients
2015-11-13 17:07:39.066349 7fe857b73900  0 osd.29 10582 crush map has features 379064680448 was 8705, adjusting msgr requires for mons
2015-11-13 17:07:39.066354 7fe857b73900  0 osd.29 10582 crush map has features 379064680448, adjusting msgr requires for osds
2015-11-13 17:08:00.020520 7fe857b73900  0 osd.29 10582 load_pgs
2015-11-13 17:08:04.948021 7fe857b73900  0 osd.29 10582 load_pgs opened 254 pgs
2015-11-13 17:08:04.959217 7fe857b73900 -1 osd.29 10582 log_to_monitors {default=true}
2015-11-13 17:08:04.963778 7fe83d9a2700  0 osd.29 10582 ignoring osdmap until we have initialized
2015-11-13 17:08:04.963814 7fe83d9a2700  0 osd.29 10582 ignoring osdmap until we have initialized
2015-11-13 17:08:04.996676 7fe857b73900  0 osd.29 10582 done with init, starting boot process
2015-11-13 17:08:11.360655 7fe826e4f700  0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6816/2822 pipe(0x4c259000 sd=181 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a1e40).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.360716 7fe826f50700  0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6814/2729 pipe(0x4c254000 sd=180 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a1ce0).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.360736 7fe826b4c700  0 -- 10.143.16.13:6812/16956 >> 10.143.16.13:6800/1002914 pipe(0x4c26f000 sd=183 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a2260).accept connect_seq 0 vs existing 0 state connecting
2015-11-13 17:08:11.361034 7fe82694a700  0 -- 10.143.16.13:6812/16956 >> 10.143.16.14:6808/13526 pipe(0x4c292000 sd=185 :6812 s=0 pgs=0 cs=0 l=0 c=0x4c1a23c0).accept connect_seq 0 vs existing 0 state connecting

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux