Re: Ceph Luminous - pg is down due to src/osd/SnapMapper.cc: 246: FAILED assert(r == -2)

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 18 Jan 2018 17:50:07 +0300

On 1/18/2018 5:24 PM, Sage Weil wrote:
On Thu, 18 Jan 2018, Stefan Priebe - Profihost AG wrote:
Hi Sage,

Am 18.01.2018 um 14:16 schrieb Sage Weil:
On Thu, 18 Jan 2018, Stefan Priebe - Profihost AG wrote:
Hi,

it also crashes in (marked with HERE):

int SnapMapper::get_snaps(
   const hobject_t &oid,
   object_snaps *out)
{
   assert(check(oid));
   set<string> keys;
   map<string, bufferlist> got;
   keys.insert(to_object_key(oid));
   int r = backend.get_keys(keys, &got);
   if (r < 0)
     return r;
   if (got.empty())
     return -ENOENT;
   if (out) {
     bufferlist::iterator bp = got.begin()->second.begin();
     ::decode(*out, bp);
     dout(20) << __func__ << " " << oid << " " << out->snaps << dendl;
     assert(!out->snaps.empty());            ########### HERE ###########
   } else {
     dout(20) << __func__ << " " << oid << " (out == NULL)" << dendl;
   }
   return 0;
}

is it save to comment that assert?
I think so.  All of these asserts are just about ensuring the snapmapper
state is consistent, and it's only purpose is to find snaps to trim.
Since your snapmapper metadata is clearly not consistent, there isn't a
whole lot of risk here.  You might want to set nosnaptrim for the time
being so you don't get a surprise if trimming kicks in.  The next step is
probably going to be to do a scrub and see what it finds, repairs, or
can't repair.
snap trimming works fine i already trimmed and removed some snaps.

I was able to get the cluster into a state where all is backfilled. But
enabling / doing deep-scrubs results into this one:
      0> 2018-01-18 13:00:54.843076 7fbd253ff700 -1
/build/ceph/src/osd/SnapMapper.cc: In function 'int
SnapMapper::get_snaps(const h
object_t&, SnapMapper::object_snaps*)' thread 7fbd253ff700 time
2018-01-18 13:00:54.840396
/build/ceph/src/osd/SnapMapper.cc: 154: FAILED assert(!out->snaps.empty())
I think if you switch that
We observed similar issue with that assert as well.
Removing the assert partially resolves the issue.
Another segmentation fault was observed in 
PrimaryLogPG::find_object_assert() when accessing empty snapshot list on 
object removal (specificall p->second.back() in the code below).
...
auto p = obc->ssc->snapset.clone_snaps.find(soid.snap);
assert(p != obc->ssc->snapset.clone_snaps.end());
first = p->second.back();
^^^^^^

This indicates (and confirmed by our manual omap/attrs verification) 
that both SnapMapper's and per-object snapshot lists are empty.
Current workaround for that was to verify p->second.empty() and return 
an error for the function. This still prevents from object removal 
(cache tier flush in our case) but fixed frequent OSD crashes for now.
assert to a warning it will repair...
Removing an assert partially help

sage

  ceph version 12.2.2-94-g92923ef
(92923ef323d32d8321e86703ce1f9016f19472fb) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x5561ce28d1f2]
  2: (SnapMapper::get_snaps(hobject_t const&,
SnapMapper::object_snaps*)+0x46b) [0x5561cdef755b]
  3: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t,
std::less<snapid_t>, std::allocator<snapid_t> >*)+0xd7) [0x5561cdef7
697]
  4: (PG::_scan_snaps(ScrubMap&)+0x692) [0x5561cdd8ad22]
  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
unsigned int, ThreadPool::TPHandle&)+0x22f) [0x5561cdd8bc5f]
  6: (PG::replica_scrub(boost::intrusive_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x624) [0x5561cdd8c584]
  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x8b6) [0x5561cde4b326]
  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
[0x5561cdcd7bc7]
  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x5561cdf4f957]
  10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x108c) [0x5561cdd06d1c]
  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
[0x5561ce292e7d]
  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5561ce294e40]
  13: (()+0x8064) [0x7fbd70d86064]
  14: (clone()+0x6d) [0x7fbd6fe7a62d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


Greets,
Stefan

sage


Stefan
Am 17.01.2018 um 19:56 schrieb Sage Weil:
On Wed, 17 Jan 2018, Stefan Priebe - Profihost AG wrote:
Am 17.01.2018 um 19:48 schrieb Sage Weil:
On Wed, 17 Jan 2018, Stefan Priebe - Profihost AG wrote:
Hi Sage,

this gives me another crash while that pg is recovering:

      0> 2018-01-17 19:25:09.328935 7f48f8fff700 -1
/build/ceph/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_l
ocal_recover(const hobject_t&, const ObjectRecoveryInfo&,
ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f48f8fff700 ti
me 2018-01-17 19:25:09.322287
/build/ceph/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())
Is this a cache tiering pool?
no normal 3 replica but the pg is degraded:
ceph pg dump | grep 3.80e

3.80e      1709                  0     1579         0       0 6183674368
10014    10014 active+undersized+degraded+remapped+backfill_wait
2018-01-17 19:42:55.840884  918403'70041375  918403:77171331 [50,54,86]
        50       [39]             39  907737'69776430 2018-01-14
22:19:54.110864  907737'69776430 2018-01-14 22:19:54.110864
Hrm, no real clues on teh root cause then.  Something like this will work
around the current assert:

diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
index d42f3a401b..0f76134f74 100644
--- a/src/osd/PrimaryLogPG.cc
+++ b/src/osd/PrimaryLogPG.cc
@@ -372,8 +372,11 @@ void PrimaryLogPG::on_local_recover(
      set<snapid_t> snaps;
      dout(20) << " snapset " << recovery_info.ss << dendl;
      auto p = recovery_info.ss.clone_snaps.find(hoid.snap);
-    assert(p != recovery_info.ss.clone_snaps.end());  // hmm, should we warn?
-    snaps.insert(p->second.begin(), p->second.end());
+    if (p != recovery_info.ss.clone_snaps.end()) {
+      derr << __func__ << " no clone_snaps for " << hoid << dendl;
+    } else {
+      snaps.insert(p->second.begin(), p->second.end());
+    }
      dout(20) << " snaps " << snaps << dendl;
      snap_mapper.add_oid(
        recovery_info.soid,


Stefan

s
  ceph version 12.2.2-94-g92923ef
(92923ef323d32d8321e86703ce1f9016f19472fb) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55addb5eb1f2]
  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0x11f0) [0x55addb1957a0]
  3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&,
PushReplyOp*, ObjectStore::Transaction*)+0x31d) [0x55addb3071ed]
  4: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x18f)
[0x55addb30748f]
  5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2d1)
[0x55addb317531]
  6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x55addb23cf10]
  7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x77b) [0x55addb1a91eb]
  8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
[0x55addb035bc7]
  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x55addb2ad957]
  10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x108c) [0x55addb064d1c]
  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
[0x55addb5f0e7d]
  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55addb5f2e40]
  13: (()+0x8064) [0x7f4955b68064]
  14: (clone()+0x6d) [0x7f4954c5c62d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Greets,
Stefan

Am 17.01.2018 um 15:28 schrieb Sage Weil:
On Wed, 17 Jan 2018, Stefan Priebe - Profihost AG wrote:
Hi,

i there any chance to fix this instead of removing manually all the clones?
I believe you can avoid the immediate problem and get the PG up by
commenting out the assert.  set_snaps() will overwrite the object->snap
list mapping.

The problem is you'll probably still a stray snapid -> object mapping, so
when snaptrimming runs you might end up with a PG in the snaptrim_error
state that won't trim (although from a quick look at the code it won't
crash).  I'd probably remove the assert and deal with that if/when it
happens.

I'm adding a ticket to relax these asserts for production but keep them
enabled for qa.  This isn't something that needs to take down the OSD!

sage


  >

Stefan

Am 16.01.2018 um 23:24 schrieb Gregory Farnum:
On Mon, Jan 15, 2018 at 5:23 PM, Stefan Priebe - Profihost AG
<s.priebe@xxxxxxxxxxxx> wrote:
Hello,

currently one of my clusters is missing a whole pg due to all 3 osds
being down.

All of them fail with:
     0> 2018-01-16 02:05:33.353293 7f944dbfe700 -1
/build/ceph/src/osd/SnapMapper.cc: In function 'void
SnapMapper::add_oid(const hobject_t&, const std::set<snapid_t>&,
MapCacher::Transaction<std::basic_string<char>, ceph::buffer::list>*)'
thread 7f944dbfe700 time 2018-01-16 02:05:33.349946
/build/ceph/src/osd/SnapMapper.cc: 246: FAILED assert(r == -2)

  ceph version 12.2.2-93-gd6da8d7
(d6da8d77a4b2220e6bdd61e4bdd911a9cd91946c) luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x561f9ff0b1e2]
  2: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
std::less<snapid_t>, std::allocator<snapid_t> > const&,
MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x64b)
[0x561f9fb76f3b]
  3: (PG::update_snap_map(std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&,
ObjectStore::Transaction&)+0x38f) [0x561f9fa0ae3f]
  4: (PG::append_log(std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&, eversion_t, eversion_t,
ObjectStore::Transaction&, bool)+0x538) [0x561f9fa31018]
  5: (PrimaryLogPG::log_operation(std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&,
boost::optional<pg_hit_set_history_t> const&, eversion_t const&,
eversion_t const&, bool, ObjectStore::Transaction&)+0x64) [0x561f9fb25d64]
  6: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xa92)
[0x561f9fc314b2]
  7:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a4)
[0x561f9fc374f4]
  8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x561f9fb5cf10]
  9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x77b) [0x561f9fac91eb]
  10: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
[0x561f9f955bc7]
  11: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x561f9fbcd947]
  12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x108c) [0x561f9f984d1c]
  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
[0x561f9ff10e6d]
  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561f9ff12e30]
  15: (()+0x8064) [0x7f949afcb064]
  16: (clone()+0x6d) [0x7f949a0bf62d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
By the time it gets there, something else has gone wrong. The OSD is
adding a snapid/object pair to its "SnapMapper", and discovering that
there are already entries (which it thinks there shouldn't be).

You'll need to post more of a log, along with background, if anybody's
going to diagnose it: is there cache tiering on the cluster? What is
this pool used for? Were there other errors on this PG in the past?

I also notice a separate email about deleting the data; I don't have
any experience with this but you'd probably have to export the PG
using ceph-objectstore-tool and then find a way to delete the object
out of it. I see options to remove both an object and
"remove-clone-metadata" on a particular ID, but I've not used any of
them myself.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html