Le mardi 16 avril 2013 à 08:56 +0200, Olivier Bonvalet a écrit : > > > Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit : > > On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <ceph.list@xxxxxxxxx> wrote: > > > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit : > > >> Are you saying you saw this problem more than once, and so you > > >> completely wiped the OSD in question, then brought it back into the > > >> cluster, and now it's seeing this error again? > > > > > > Yes, it's exactly that. > > > > > > > > >> Are any other OSDs experiencing this issue? > > > > > > No, only this one have the problem. > > > > Did you run scrubs while this node was out of the cluster? If you > > wiped the data and this is recurring then this is apparently an issue > > with the cluster state, not just one node, and any other primary for > > the broken PG(s) should crash as well. Can you verify by taking this > > one down and then doing a full scrub? > > -Greg > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > So, I mark this OSD as "out" to balance data and be able to re-do a > scrum. You are probably right, since I now have 3 other OSD on the same > host which are down. > > I still haven't any PG in error (and the cluster is in HEALTH_WARN > status), but something goes wrong. > > In syslog I have : > > Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or directory > Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1 filestore(/var/lib/ceph/osd/ceph-31) could not find 85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or directory > Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1 osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16 03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149) > Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489) > Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1 osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16 04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605) > Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257) > Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1 osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16 05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984) > Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323) > Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1 osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16 05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598) > Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.051839 7fe65012f700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca) [0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.277072 7fe65012f700 -1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1: /usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3: (gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6: (()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8: (()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549]#012 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca) [0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > and last lines from osd.24.log are : > > -10> 2013-04-16 08:08:54.991371 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub osd.6 has 10 items > -9> 2013-04-16 08:08:54.991876 7f5bb4569700 2 osd.24 pg_epoch: 49397 pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382 n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375 mlcod 49397'11387808 active+clean+scrubbing+deep snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > > -8> 2013-04-16 08:08:54.991906 7f5bb4569700 0 log [ERR] : 3.7c osd.24 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -7> 2013-04-16 08:08:54.991913 7f5bb4569700 0 log [ERR] : 3.7c osd.13 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -6> 2013-04-16 08:08:54.991915 7f5bb4569700 0 log [ERR] : 3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > -5> 2013-04-16 08:08:54.991917 7f5bb4569700 0 log [ERR] : 3.7c osd.6 inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7 > -4> 2013-04-16 08:08:54.991919 7f5bb4569700 0 log [ERR] : 3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0, digest 1360833101 != known digest 0 > -3> 2013-04-16 08:08:54.991986 7f5bb4569700 0 log [ERR] : deep-scrub 3.7c 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not match object info size (4194304) > -2> 2013-04-16 08:08:54.993813 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event: op_applied, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) > -1> 2013-04-16 08:08:54.993901 7f5bbbd78700 5 --OSD::tracker-- reqid: client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event: done, request: osd_op(client.1811920.1:164200641 rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc 2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98]) > 0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700 time 2013-04-16 08:08:54.991990 > osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t()) > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 2: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 3: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 4: (PG::scrub()+0x145) [0x6c4e55] > 5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 7: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 8: (()+0x68ca) [0x7f5bc72908ca] > 9: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 0/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 hadoop > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -1/-1 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/osd.24.log > --- end dump of recent events --- > 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** > in thread 7f5bb4569700 > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: /usr/bin/ceph-osd() [0x7a6289] > 2: (()+0xeff0) [0x7f5bc7298ff0] > 3: (gsignal()+0x35) [0x7f5bc5d221b5] > 4: (abort()+0x180) [0x7f5bc5d24fc0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] > 6: (()+0xcb166) [0x7f5bc65b5166] > 7: (()+0xcb193) [0x7f5bc65b5193] > 8: (()+0xcb28e) [0x7f5bc65b528e] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] > 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 13: (PG::scrub()+0x145) [0x6c4e55] > 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 17: (()+0x68ca) [0x7f5bc72908ca] > 18: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- begin dump of recent events --- > 0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) ** > in thread 7f5bb4569700 > > ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55) > 1: /usr/bin/ceph-osd() [0x7a6289] > 2: (()+0xeff0) [0x7f5bc7298ff0] > 3: (gsignal()+0x35) [0x7f5bc5d221b5] > 4: (abort()+0x180) [0x7f5bc5d24fc0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5] > 6: (()+0xcb166) [0x7f5bc65b5166] > 7: (()+0xcb193) [0x7f5bc65b5193] > 8: (()+0xcb28e) [0x7f5bc65b528e] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x8f9549] > 10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038] > 11: (PG::scrub_compare_maps()+0xeb8) [0x696c18] > 12: (PG::chunky_scrub()+0x2d9) [0x6c37f9] > 13: (PG::scrub()+0x145) [0x6c4e55] > 14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c] > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179] > 16: (ThreadPool::WorkThread::entry()+0x10) [0x817980] > 17: (()+0x68ca) [0x7f5bc72908ca] > 18: (clone()+0x6d) [0x7f5bc5dbfb6d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 0/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 hadoop > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -1/-1 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/osd.24.log > --- end dump of recent events --- > > > > > > > > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com So, what can I do to fix that ? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html