Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@xxxxxxxxx> wrote: > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : > >> Are you saying you saw this problem more than once, and so you > >> completely wiped the OSD in question, then brought it back into the > >> cluster, and now it's seeing this error again? > > > > Yes, it's exactly that. > > > > > >> Are any other OSDs experiencing this issue? > > > > No, only this one have the problem. > > Did you run scrubs while this node was out of the cluster? If you > wiped the data and this is recurring then this is apparently an issue > with the cluster state, not just one node, and any other primary for > the broken PG(s) should crash as well. Can you verify by taking this > one down and then doing a full scrub? > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > So, I mark this OSD as "out" to balance data and be able to re-do a scrum. You are probably right, since I now have 3 other OSD on the same host which are down. I still haven't any PG in error (and the cluster is in HEALTH_WARN status), but something goes wrong. In syslog I have : Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149) Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489) Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605) Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257) Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984) Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323) Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598) Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. and last lines from osd.24.log are : -10> 2013-04-16 08:08:54.991371 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub osd.6 has 10 items -9> 2013-04-16 08:08:54.991876 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -8> 2013-04-16 08:08:54.991906 7f5bb4569700 0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -7> 2013-04-16 08:08:54.991913 7f5bb4569700 0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -6> 2013-04-16 08:08:54.991915 7f5bb4569700 0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -5> 2013-04-16 08:08:54.991917 7f5bb4569700 0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 -4> 2013-04-16 08:08:54.991919 7f5bb4569700 0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 -3> 2013-04-16 08:08:54.991986 7f5bb4569700 0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304) -2> 2013-04-16 08:08:54.993813 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) -1> 2013-04-16 08:08:54.993901 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) 0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990 osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t()) ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 4: (PG::scrub()+0x145) [0x6c4e55] 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 8: (()+0x68ca) [0x7f5bc72908ca] 9: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/osd.24.log --- end dump of recent events --- 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** in thread 7f5bb4569700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: /usr/bin/ceph-osd() [0x7a6289] 2: (()+0xeff0) [0x7f5bc7298ff0] 3: (gsignal()+0x35) [0x7f5bc5d221b5] 4: (abort()+0x180) [0x7f5bc5d24fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] 6: (()+0xcb166) [0x7f5bc65b5166] 7: (()+0xcb193) [0x7f5bc65b5193] 8: (()+0xcb28e) [0x7f5bc65b528e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 13: (PG::scrub()+0x145) [0x6c4e55] 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 17: (()+0x68ca) [0x7f5bc72908ca] 18: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** in thread 7f5bb4569700 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) 1: /usr/bin/ceph-osd() [0x7a6289] 2: (()+0xeff0) [0x7f5bc7298ff0] 3: (gsignal()+0x35) [0x7f5bc5d221b5] 4: (abort()+0x180) [0x7f5bc5d24fc0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] 6: (()+0xcb166) [0x7f5bc65b5166] 7: (()+0xcb193) [0x7f5bc65b5193] 8: (()+0xcb28e) [0x7f5bc65b528e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] 13: (PG::scrub()+0x145) [0x6c4e55] 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] 17: (()+0x68ca) [0x7f5bc72908ca] 18: (clone()+0x6d) [0x7f5bc5dbfb6d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -1/-1 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/osd.24.log --- end dump of recent events --- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com