Re: Luminous 12.2.2 OSDs with Bluestore crashing randomly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
Hi,

we have several times a day different OSDs running Luminous 12.2.2 and
Bluestore crashing with errors like this:


starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082 log_to_monitors
{default=true}
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
  ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)

  ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


Is it a known issue? How can we fix that?


Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but that wasn't supposed to be a problem in Luminous. When was this cluster created?

There was a thread in October titled "[ceph-users] [Jewel] Crash Osd with void Hit_set_trim" that had instructions for diagnosing and dealing with it in Jewel; you might investigate that.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux