Re: pg has invalid (post-split) stats; must scrub before tier agent can activate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I wanted to report back what the solution was to this problem.  It appears
like I was running into this bug:

http://tracker.ceph.com/issues/16113


After running 'ceph osd unset sortbitwise' all the unfound objects were
found!  Which makes me happy again.  :)

Bryan

On 5/24/16, 4:27 PM, "Stillwell, Bryan J" <Bryan.Stillwell@xxxxxxxxxxx>
wrote:

>On one of my test clusters that I¹ve upgraded from Infernalis to Jewel
>(10.2.1), and I¹m having a problem where reads are resulting in unfound
>objects.
>
>I¹m using cephfs on top of a erasure coded pool with cache tiering which I
>believe is related.
>
>From what I can piece together, here is what the sequence of events looks
>like:
>
>Try doing an md5sum on all files in a directory:
>
>$ date
>Tue May 24 16:06:01 MDT 2016
>$ md5sum *
>
>
>Shortly afterward I see this in the logs:
>
>2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster
>[INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different
>host after 21.000162 >= grace 20.000000)
>2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster
>[INF] osd.24 172.24.88.54:6813/21502 boot
>
>2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster
>[INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different
>host after 21.000314 >= grace 20.000000)
>2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster
>[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster
>[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster
>[INF] osd.21 172.24.88.56:6800/27171 boot
>2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830 12 : cluster
>[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.024493 osd.16 172.24.88.56:6824/25830 13 : cluster
>[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
>can activate
>
>
>
>
>Then I see the following in ceph health detail:
>
>pg 4.2e is active+degraded, acting [21,16,13], 1 unfound
>pg 4.13 is active+degraded, acting [24,17,21], 1 unfound
>pg 3.6 is active+degraded, acting [18,14,21], 1 unfound
>
>
>
>The relevant logs on osd.24 appear to be:
>
>     0> 2016-05-24 16:06:01.365111 7fa1753e1700 -1 osd/ReplicatedPG.cc: In
>function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
>unsigned int)' thread 7fa1753e1700 time 2016-05-24 16:06:01.361148
>osd/ReplicatedPG.cc: 11757: FAILED assert(obc)
>
> ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x8b) [0x55ef2b835e2b]
> 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
>std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
>[0x55ef2b33dd2e]
> 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262]
> 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
>[0x55ef2b35ed9e]
> 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c]
> 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
>ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285]
> 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
>[0x55ef2b1db4ad]
> 8: (OSD::ShardedOpWQ::_process(unsigned int,
>ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9]
> 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
>[0x55ef2b826195]
> 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0]
> 11: (()+0x8182) [0x7fa1990dd182]
> 12: (clone()+0x6d) [0x7fa19720b47d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>to interpret this.
>
>
>2016-05-24 16:06:01.422963 7fa1753e1700 -1 *** Caught signal (Aborted) **
> in thread 7fa1753e1700 thread_name:tp_osd_tp
>
> ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
> 1: (()+0x8eac12) [0x55ef2b73ec12]
> 2: (()+0x10340) [0x7fa1990e5340]
> 3: (gsignal()+0x39) [0x7fa197147cc9]
> 4: (abort()+0x148) [0x7fa19714b0d8]
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x265) [0x55ef2b836005]
> 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
>std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
>[0x55ef2b33dd2e]
> 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262]
> 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
>[0x55ef2b35ed9e]
> 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c]
> 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285]
> 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
>[0x55ef2b1db4ad]
> 12: (OSD::ShardedOpWQ::_process(unsigned int,
>ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9]
> 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
>[0x55ef2b826195]
> 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0]
> 15: (()+0x8182) [0x7fa1990dd182]
> 16: (clone()+0x6d) [0x7fa19720b47d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>to interpret this.
>
>
>
>The relevant logs on osd.21 appear to be:
>
>     0> 2016-05-24 16:06:04.627770 7f3c87965700 -1 osd/ReplicatedPG.cc: In
>function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
>unsigned int)' thread 7f3c87965700 time 2016-05-24 16:06:04.620212
>osd/ReplicatedPG.cc: 11757: FAILED assert(obc)
>
> ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x8b) [0x556e01cf9e2b]
> 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
>std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
>[0x556e01801d2e]
> 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262]
> 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
>[0x556e01822d9e]
> 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>ThreadPool::TPHandle&)+0x70c) [0x556e017e092c]
> 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
>ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285]
> 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
>[0x556e0169f4ad]
> 8: (OSD::ShardedOpWQ::_process(unsigned int,
>ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9]
> 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
>[0x556e01cea195]
> 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0]
> 11: (()+0x8182) [0x7f3caac76182]
> 12: (clone()+0x6d) [0x7f3ca8da447d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>to interpret this.
>
>
>
>2016-05-24 16:06:04.884712 7f3c87965700 -1 *** Caught signal (Aborted) **
> in thread 7f3c87965700 thread_name:tp_osd_tp
>
> ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
> 1: (()+0x8eac12) [0x556e01c02c12]
> 2: (()+0x10340) [0x7f3caac7e340]
> 3: (gsignal()+0x39) [0x7f3ca8ce0cc9]
> 4: (abort()+0x148) [0x7f3ca8ce40d8]
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x265) [0x556e01cfa005]
> 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
>std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
>[0x556e01801d2e]
> 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262]
> 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
>[0x556e01822d9e]
> 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>ThreadPool::TPHandle&)+0x70c) [0x556e017e092c]
> 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285]
> 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
>[0x556e0169f4ad]
> 12: (OSD::ShardedOpWQ::_process(unsigned int,
>ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9]
> 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
>[0x556e01cea195]
> 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0]
> 15: (()+0x8182) [0x7f3caac76182]
> 16: (clone()+0x6d) [0x7f3ca8da447d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>to interpret this.
>
>
>
>Any ideas what might be going on?
>
>Thanks,
>Bryan
>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux