pg has invalid (post-split) stats; must scrub before tier agent can activate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On one of my test clusters that I¹ve upgraded from Infernalis to Jewel
(10.2.1), and I¹m having a problem where reads are resulting in unfound
objects.

I¹m using cephfs on top of a erasure coded pool with cache tiering which I
believe is related.

>From what I can piece together, here is what the sequence of events looks
like:

Try doing an md5sum on all files in a directory:

$ date
Tue May 24 16:06:01 MDT 2016
$ md5sum *


Shortly afterward I see this in the logs:

2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster
[INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different
host after 21.000162 >= grace 20.000000)
2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster
[INF] osd.24 172.24.88.54:6813/21502 boot

2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster
[INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different
host after 21.000314 >= grace 20.000000)
2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster
[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster
[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster
[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster
[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster
[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster
[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster
[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster
[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster
[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster
[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster
[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster
[INF] osd.21 172.24.88.56:6800/27171 boot
2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster
[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster
[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster
[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster
[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster
[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster
[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster
[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster
[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster
[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830 12 : cluster
[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.024493 osd.16 172.24.88.56:6824/25830 13 : cluster
[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
can activate




Then I see the following in ceph health detail:

pg 4.2e is active+degraded, acting [21,16,13], 1 unfound
pg 4.13 is active+degraded, acting [24,17,21], 1 unfound
pg 3.6 is active+degraded, acting [18,14,21], 1 unfound



The relevant logs on osd.24 appear to be:

     0> 2016-05-24 16:06:01.365111 7fa1753e1700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7fa1753e1700 time 2016-05-24 16:06:01.361148
osd/ReplicatedPG.cc: 11757: FAILED assert(obc)

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x55ef2b835e2b]
 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
[0x55ef2b33dd2e]
 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262]
 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
[0x55ef2b35ed9e]
 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285]
 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
[0x55ef2b1db4ad]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
[0x55ef2b826195]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0]
 11: (()+0x8182) [0x7fa1990dd182]
 12: (clone()+0x6d) [0x7fa19720b47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


2016-05-24 16:06:01.422963 7fa1753e1700 -1 *** Caught signal (Aborted) **
 in thread 7fa1753e1700 thread_name:tp_osd_tp

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (()+0x8eac12) [0x55ef2b73ec12]
 2: (()+0x10340) [0x7fa1990e5340]
 3: (gsignal()+0x39) [0x7fa197147cc9]
 4: (abort()+0x148) [0x7fa19714b0d8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x265) [0x55ef2b836005]
 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
[0x55ef2b33dd2e]
 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x55ef2b340262]
 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
[0x55ef2b35ed9e]
 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x70c) [0x55ef2b31c92c]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x55ef2b1db285]
 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
[0x55ef2b1db4ad]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x869) [0x55ef2b1dfec9]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
[0x55ef2b826195]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ef2b8280d0]
 15: (()+0x8182) [0x7fa1990dd182]
 16: (clone()+0x6d) [0x7fa19720b47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.



The relevant logs on osd.21 appear to be:

     0> 2016-05-24 16:06:04.627770 7f3c87965700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7f3c87965700 time 2016-05-24 16:06:04.620212
osd/ReplicatedPG.cc: 11757: FAILED assert(obc)

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x556e01cf9e2b]
 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
[0x556e01801d2e]
 3: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262]
 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
[0x556e01822d9e]
 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x70c) [0x556e017e092c]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285]
 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
[0x556e0169f4ad]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
[0x556e01cea195]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0]
 11: (()+0x8182) [0x7f3caac76182]
 12: (clone()+0x6d) [0x7f3ca8da447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.



2016-05-24 16:06:04.884712 7f3c87965700 -1 *** Caught signal (Aborted) **
 in thread 7f3c87965700 thread_name:tp_osd_tp

 ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
 1: (()+0x8eac12) [0x556e01c02c12]
 2: (()+0x10340) [0x7f3caac7e340]
 3: (gsignal()+0x39) [0x7f3ca8ce0cc9]
 4: (abort()+0x148) [0x7f3ca8ce40d8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x265) [0x556e01cfa005]
 6: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext,
std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x54e)
[0x556e01801d2e]
 7: (ReplicatedPG::hit_set_persist()+0xcb2) [0x556e01804262]
 8: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1f0e)
[0x556e01822d9e]
 9: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x70c) [0x556e017e092c]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) [0x556e0169f285]
 11: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d)
[0x556e0169f4ad]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x869) [0x556e016a3ec9]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x885)
[0x556e01cea195]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556e01cec0d0]
 15: (()+0x8182) [0x7f3caac76182]
 16: (clone()+0x6d) [0x7f3ca8da447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.



Any ideas what might be going on?

Thanks,
Bryan


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux