Re: Luminous 12.2.2 OSDs with Bluestore crashing randomly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,

many thanks. This is a new cluster created initially with luminous 12.2.0. I'm not sure the instructions on jewel really apply on my case too, and all the machines have ntp enabled, but I'll have a look, many thanks for the link. All machines are set to CET, although I'm running over docker containers which are using UTC internally, but they are all consistent.

At the moment, after setting 5 of the osds out the cluster resumed, and now I'm recreating those osds to be on the safe side.

Thanks,


    Alessandro


Il 31/01/18 19:26, Gregory Farnum ha scritto:
On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo <Alessandro.DeSalvo@xxxxxxxxxxxxx> wrote:
Hi,

we have several times a day different OSDs running Luminous 12.2.2 and
Bluestore crashing with errors like this:


starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082 log_to_monitors
{default=true}
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
  ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)

  ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


Is it a known issue? How can we fix that?


Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but that wasn't supposed to be a problem in Luminous. When was this cluster created?

There was a thread in October titled " [Jewel] Crash Osd with void Hit_set_trim" that had instructions for diagnosing and dealing with it in Jewel; you might investigate that.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux