On one of my test clusters that I¹ve upgraded from Infernalis to Jewel (10.2.1), and I¹m having a problem where reads are resulting in unfound objects. I¹m using cephfs on top of a erasure coded pool with cache tiering which I believe is related. >From what I can piece together, here is what the sequence of events looks like: Try doing an md5sum on all files in a directory: $ date Tue May 24 16:06:01 MDT 2016 $ md5sum * Shortly afterward I see this in the logs: 2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster [INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different host after 21.000162 >= grace 20.000000) 2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster [INF] osd.24 172.24.88.54:6813/21502 boot 2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster [INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different host after 21.000314 >= grace 20.000000) 2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster [WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster [WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster [WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster [WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster [WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster [WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster [WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster [WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster [WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster [WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster [WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster [INF] osd.21 172.24.88.56:6800/27171 boot 2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster [WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster [WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster [WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster [WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster [WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster [WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster [WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster [WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster [WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830 12 : cluster [WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent can activate 2016-05-24 16:07:30.024493 osd.16 172.24.88.56:6824/25830 13 : cluster [WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent can activate Then I see the following in ceph health detail: pg 4.2e is active+degraded, acting [21,16,13], 1 unfound pg 4.13 is active+degraded, acting [24,17,21], 1 unfound pg 3.6 is active+degraded, acting [18,14,21], 1 unfound The relevant logs on osd.24 appear to be: 0> 2016-05-24 16:06:01.365111 7fa1753e1700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned int)' thread 7fa1753e1700 time 2016-05-24 16:06:01.361148 osd/ReplicatedPG.cc: 11757: FAILED assert(obc) ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55ef2b835e2b] 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e) [0x55ef2b33dd2e] 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262] 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e) [0x55ef2b35ed9e] 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285] 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x55ef2b1db4ad] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x55ef2b826195] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0] 11: (()+0x8182) [0x7fa1990dd182] 12: (clone()+0x6d) [0x7fa19720b47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-05-24 16:06:01.422963 7fa1753e1700 -1 *** Caught signal (Aborted) ** in thread 7fa1753e1700 thread_name:tp_osd_tp ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) 1: (()+0x8eac12) [0x55ef2b73ec12] 2: (()+0x10340) [0x7fa1990e5340] 3: (gsignal()+0x39) [0x7fa197147cc9] 4: (abort()+0x148) [0x7fa19714b0d8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x55ef2b836005] 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e) [0x55ef2b33dd2e] 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262] 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e) [0x55ef2b35ed9e] 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c] 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285] 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x55ef2b1db4ad] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x55ef2b826195] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0] 15: (()+0x8182) [0x7fa1990dd182] 16: (clone()+0x6d) [0x7fa19720b47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. The relevant logs on osd.21 appear to be: 0> 2016-05-24 16:06:04.627770 7f3c87965700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned int)' thread 7f3c87965700 time 2016-05-24 16:06:04.620212 osd/ReplicatedPG.cc: 11757: FAILED assert(obc) ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x556e01cf9e2b] 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e) [0x556e01801d2e] 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262] 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e) [0x556e01822d9e] 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x556e017e092c] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285] 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x556e0169f4ad] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x556e01cea195] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0] 11: (()+0x8182) [0x7f3caac76182] 12: (clone()+0x6d) [0x7f3ca8da447d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-05-24 16:06:04.884712 7f3c87965700 -1 *** Caught signal (Aborted) ** in thread 7f3c87965700 thread_name:tp_osd_tp ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) 1: (()+0x8eac12) [0x556e01c02c12] 2: (()+0x10340) [0x7f3caac7e340] 3: (gsignal()+0x39) [0x7f3ca8ce0cc9] 4: (abort()+0x148) [0x7f3ca8ce40d8] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x556e01cfa005] 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e) [0x556e01801d2e] 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262] 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e) [0x556e01822d9e] 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x70c) [0x556e017e092c] 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285] 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x556e0169f4ad] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885) [0x556e01cea195] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0] 15: (()+0x8182) [0x7f3caac76182] 16: (clone()+0x6d) [0x7f3ca8da447d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Any ideas what might be going on? Thanks, Bryan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com