Re: OSDs crash after deleting unfound object in Luminous 12.2.8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



re-adding the list.

i'm glad to hear you got things back to a working state. one thing you might want to check is the hit_set_history in the pg data. if the missing hit sets are no longer in the history, then it is probably safe to go back to the normal builds. that is until you have to mark another hit set missing. :)  i think the code that removes the hit set from the pg data is before that assert so its possible it still removed it from the history.

mike

On Thu, Oct 18, 2018 at 9:11 AM Lawrence Smith <lawrence.smith@xxxxxxxxxxxxxxx> wrote:

Hi Mike,

Thanks a bunch for your writeup, that was exactly the problem and solution! All i did was comment out the assert and ad an if(obc){ } after to make sure i don't run into a segfault, and now the cluster is healthy once again. I am not sure if ceph will register a mismatch in a byte count while scrubbing due to the missing object, but I don't think so.

Anyway, I just wanted to thank you for your help!

Best wishes,

Lawrence


On 10/13/2018 02:00 AM, Mike Lovell wrote:
what was the object name that you marked lost? was it one of the cache tier hit_sets?

the trace you have does seem to be failing when the OSD is trying to remove a hit set that is no longer needed. i ran into a similar problem which might have been why that bug you listed was created. maybe providing what i have since discovered about hit sets might help.

the hit sets are what the cache tier uses to know which objects have been accessed in a given period of time. these hit sets are then stored in the object store using an object name that is generated. for the version you're running, the code for that generation is at https://github.com/ceph/ceph/blob/v12.2.8/src/osd/PrimaryLogPG.cc#L12667. its bascially "hit_set_<pgid>_archive_<begin time>_<end time>" where the times are recorded in the hit set history. that hit set history is stored as part of the PG metadata. you can get a list of all of the hit sets the PG has by looking at 'ceph pg <pgid> query' and looking at the ['info']['hit_set_history']['history'] array. each entry in that array has the information on each hit set for the PG and the times are what is used in generation of the object name. there should be one ceph object for each hit set listed in that array.

if you told the cluster to mark one of the hit set objects as lost, its possible the OSD cannot get that object and is hitting the assert(obc) near the end of PrimaryLogPG::hit_set_trim in the same source file referenced above. you can potentially verify this by a couple methods. i think if you set debug_osd at 20 that it should log a line saying something like "hit_set_trim removing hit_set_<pgid>_archive....." if that name matches one of the ones you marked lost, then is this almost certainly the cause. you can also do a find on the OSD directory, if you're using file store, and look for the right file name. something like 'find /var/lib/ceph/osd/ceph-<osd num>/current/<pgid>_head -name hit\*set\*<pgid>\*archive\*' should work. include the \ to escape the * so bash doesn't interpret it. if you're using bluestore, i think you can use the ceph-objectstore-tool while the osd is stopped to get a list of objects. you'll probably want to only look in the .ceph-internal namespace since the hit sets are stored in that namespace.

there are a couple potential ways to get around this. what we did when we had the problem was run a custom build of the ceph-osd where we commented out the assert(obc); line in hit_set_trim. that build was only run for long enough to get the cluster back online and then to flush and evict the entire cache, remove the cache, restart using the normal ceph builds, and then recreate the cache.

the other options are things that i don't know for sure if they'll work. if you're using file store, you might be able to just copy another hit set to the file name of the missing hit set object. this should be pretty benign and its just going to remove the object in a moment anyways. also, i'm not entirely sure how to come up with what directory to put the object in if the osd has done any directory splitting. maybe someone on the list will know how to do this. there might be a way with the ceph-objectstore-tool to write in the object but i couldn't find one in my testing on hammer.

the last option i can think of, is that if you can completely stop any traffic to the pools in question, its possible the OSDs wont crash. hit_set_trim doesn't appear to get called if there is no client traffic reaching the osds and the hit sets aren't being updated. if you can stop anything from using the pools in question and guarantee nothing will come in, then it might be possible to keep the OSDs long enough to flush everything from the cache tier, remove it, and recreate it. this option seems like a long shot and i don't know for sure it'll work. it just seemed to me like the OSDs would stay up in a similar scenario on my hammer test cluster. its possible things have changed in luminous and hit_set_trim gets called more often. i also didn't test whether the process of flushing and evicting the objects in the cache caused hit_set_trim to get called.

hopefully that gives you some more info on what might be going on and ways around it. i'm not entirely sure why there is still the assert(obj); in hit_set_trim still. there was a little bit of discussion about removing it since it means the object its trying to remove is gone anyways. i think that just happened for a little bit in irc. i guess it didn't happen cause no one followed up on it.

good luck and hopefully you don't blame me if things get worse. :)
mike

On Fri, Oct 12, 2018 at 7:34 AM Lawrence Smith <lawrence.smith@xxxxxxxxxxxxxxx> wrote:
Hi all,

we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
pool with a min_size of 2. We recently encountered an "object unfound"
error in one of our pgs in this pool. After marking this object lost,
two of the acting osds crashed and were unable to start up again, with
only the primary osd staying up. Hoping the cluster might remap the
copies of this pg, we marked the two crashed osds as out. Now the
primary osd of this pg has also gone down leaving again only one active
osd with the cluster reporting a degraded filesystem. All the affected
osds are running filestore, while about half the cluster has already
been upgraded to run bluestore osds.

All three of the crashed osds fail to restart, reporting the following
error during startup:

Oct 12 13:19:12 kaa-109 ceph-osd[166266]:      0> 2018-10-12
13:19:12.782652 7f1f2d79b700 -1
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
In function '
void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
12985: FAILED assert(obc)

  ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x109) [0x562265bfda9c]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942)
[0x5622657d6cea]
  3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a)
[0x562265800c64]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396)
[0x562265655cf8]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x5622658c09a6]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd)
[0x562265c026f5]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
  11: (()+0x751e) [0x7f1f4fd7f51e]
  12: (clone()+0x3f) [0x7f1f4ed7ef0f]

A search in the bug tracker revealed that a similar error has been
resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't
know if this is in any way relevant.

We are currently at a loss how to get these osds back up. Any
suggestions how to approach this would be very welcome. If there is any
further information that is needed or additional context please let me know.

Thanks,

Lawrence


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux