Hi Greg,
many thanks. This is a new cluster created initially with
luminous 12.2.0. I'm not sure the instructions on jewel really
apply on my case too, and all the machines have ntp enabled, but
I'll have a look, many thanks for the link. All machines are set
to CET, although I'm running over docker containers which are
using UTC internally, but they are all consistent.
At the moment, after setting 5 of the osds out the cluster
resumed, and now I'm recreating those osds to be on the safe side.
Thanks,
Alessandro
Il 31/01/18 19:26, Gregory Farnum ha
scritto:
On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo
< Alessandro.DeSalvo@xxxxxxxxxxxxx>
wrote:
Hi,
we have several times a day different OSDs running Luminous
12.2.2 and
Bluestore crashing with errors like this:
starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082
log_to_monitors
{default=true}
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&,
unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
ceph version 12.2.2
(cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
const*)+0x110) [0x556c6df51550]
2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext>
>&, unsigned int)+0x3b6)
[0x556c6db5e106]
3: (PrimaryLogPG::hit_set_persist()+0xb67)
[0x556c6db61fb7]
4:
(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
5:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
7:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x839)
[0x556c6df57069]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
11: (()+0x7e25) [0x7f1e16c17e25]
12: (clone()+0x6d) [0x7f1e15d0b34d]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&,
unsigned int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
ceph version 12.2.2
(cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*,
int, char
const*)+0x110) [0x556c6df51550]
2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext>
>&, unsigned int)+0x3b6)
[0x556c6db5e106]
3: (PrimaryLogPG::hit_set_persist()+0xb67)
[0x556c6db61fb7]
4:
(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
5:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
7:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x839)
[0x556c6df57069]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
11: (()+0x7e25) [0x7f1e16c17e25]
12: (clone()+0x6d) [0x7f1e15d0b34d]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is
needed to interpret this.
Is it a known issue? How can we fix that?
There was a thread in October titled "
[Jewel] Crash Osd with void Hit_set_trim" that had
instructions for diagnosing and dealing with it in Jewel;
you might investigate that.
-Greg
|