Hi,
Another OSD falled
down. And it's pretty scary how easy is to break the cluster.
This time is something related to the journal.
/usr/bin/ceph-osd -f
--cluster ceph --id 6 --setuser ceph --setgroup ceph
starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6
/var/lib/ceph/osd/ceph-6/journal
2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6 10538
log_to_monitors {default=true}
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time
2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
2017-12-05 13:19:04.437968 7f243d1a0700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time
2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time
2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
-405> 2017-12-05 13:19:03.473082 7f24515148c0 -1 osd.6
10538 log_to_monitors {default=true}
0> 2017-12-05 13:19:04.437968 7f243d1a0700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d1a0700 time
2017-12-05 13:19:04.433036
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
2017-12-05 13:19:04.442866 7f243d9a1700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time
2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
0> 2017-12-05 13:19:04.442866 7f243d9a1700 -1
os/filestore/FileStore.cc: In function 'void
FileStore::_do_transaction(ObjectStore::Transaction&,
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time
2017-12-05 13:19:04.435362
os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected
error")
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x55569c1ff790]
2: (FileStore::_do_transaction(ObjectStore::Transaction&,
unsigned long, int, ThreadPool::TPHandle*)+0xb8e)
[0x55569be9d58e]
3:
(FileStore::_do_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
4: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1)
[0x55569c1f1961]
6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
7: (()+0x76ba) [0x7f24503e36ba]
8: (clone()+0x6d) [0x7f244e45b3dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
*** Caught signal (Aborted) **
in thread 7f243d1a0700 thread_name:tp_fstore_op
I tried to boot it
several times.
I zero the journal
dd if=/dev/zero
of=/dev/sde2
create a new journal
ceph-osd --mkjournal -i
6
Flush it. But's empty
so ok.
/usr/bin/ceph-osd -f
--cluster ceph --id 6 --setuser ceph --setgroup ceph
--flush-journal
and boot manually the
osd.
/usr/bin/ceph-osd -f
--cluster ceph --id 6 --setuser ceph --setgroup ceph
Then it breaks. I pasted
bin my whole configuration in https://pastebin.com/QfrE71Dg.
But I changed also the
journal partition from sde4 to sde2 to see if this has something
to do. sde is SSD disk so wanted to see no block is corrupting
everything.
Nothing it breaks 100%
of time after a while. I'm desperate to see how it breaks. I
must say that this is other OSD that failed and I recovered.
Smartscan long is correct xfs_repair is ok on disk everything seems
correct. But it keep crashing.
Any advice?
Can I run the disk
without journal for a while until all pg are backup to the other
disks? I just increased the size of the pools and min size as well
and I need this disk in order to recover all information.
Best regards,