Re: Have I configured erasure coding wrong ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15/01/2018 7:46 AM, Christian Wuerdig wrote:
> Depends on what you mean with "your pool overloads"? What's your
> hardware setup (CPU, RAM, how many nodes, network etc.)? What can you
> see when you monitor the system resources with atop or the likes?
Single node, 8 core (16 hyperthread) CPU, 32G ECC ram (will be upgrading
this once ram prices drop)

I adjusted the OSD memory cache down from 3G to 1G as the machine was
swapping like mad.

The issue I have is that if write data at about 2Mb a second the cluster
(particularly with small files) I get messages about OSD being down and
slow requests.

I've fairly confident that the hardware is good, I've checked the LSI
SAS cards for interface errors nothing there. Smart is not reporting an
issues. It seems like the slow requests can end up on any drive so its
no the older ones causing an issue.

As a result I was wondering if I got the erasure code settings wrong
which was causing an issue.

When the OSD goes down I get the following in syslog

Jan 15 06:33:17 pve ceph-osd[18628]: 2018-01-15 06:33:17.284797
7fbd74e71700 -1 abort: Corruption: block checksum mismatch
Jan 15 06:33:17 pve ceph-osd[18628]: *** Caught signal (Aborted) **
Jan 15 06:33:17 pve ceph-osd[18628]:  in thread 7fbd74e71700
thread_name:tp_osd_tp
Jan 15 06:33:17 pve ceph-osd[18628]:  ceph version 12.2.2
(215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)
Jan 15 06:33:17 pve ceph-osd[18628]:  1: (()+0xa16664) [0x56218ee28664]
Jan 15 06:33:17 pve ceph-osd[18628]:  2: (()+0x110c0) [0x7fbd8e2640c0]
Jan 15 06:33:17 pve ceph-osd[18628]:  3: (gsignal()+0xcf) [0x7fbd8d22bfcf]
Jan 15 06:33:17 pve ceph-osd[18628]:  4: (abort()+0x16a) [0x7fbd8d22d3fa]
Jan 15 06:33:17 pve ceph-osd[18628]:  5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, char const*,
unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f]
Jan 15 06:33:17 pve ceph-osd[18628]:  6:
(BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae)
[0x56218ecea2ae]
Jan 15 06:33:17 pve ceph-osd[18628]:  7:
(BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0xa19) [0x56218ed21059]
Jan 15 06:33:17 pve ceph-osd[18628]:  8:
(BlueStore::queue_transactions(ObjectStore::Sequencer*,
std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546)
[0x56218ed232a6]
Jan 15 06:33:17 pve ceph-osd[18628]:  9:
(PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46]
Jan 15 06:33:17 pve ceph-osd[18628]:  10:
(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&,
eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&,
boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*,
unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1)
[0x56218eb78491]
Jan 15 06:33:17 pve ceph-osd[18628]:  11:
(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a]
Jan 15 06:33:17 pve ceph-osd[18628]:  12:
(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d)
[0x56218ea2bb3d]
Jan 15 06:33:17 pve ceph-osd[18628]:  13:
(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5)
[0x56218ea2f375]
Jan 15 06:33:17 pve ceph-osd[18628]:  14:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436]
Jan 15 06:33:17 pve ceph-osd[18628]:  15:
(OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab)
[0x56218e8669eb]
Jan 15 06:33:17 pve ceph-osd[18628]:  16:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x56218eb04eba]
Jan 15 06:33:17 pve ceph-osd[18628]:  17:
(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d]
Jan 15 06:33:17 pve ceph-osd[18628]:  18:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef)
[0x56218ee7506f]
Jan 15 06:33:17 pve ceph-osd[18628]:  19:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370]
Jan 15 06:33:17 pve ceph-osd[18628]:  20: (()+0x7494) [0x7fbd8e25a494]
Jan 15 06:33:17 pve ceph-osd[18628]:  21: (clone()+0x3f) [0x7fbd8d2e1aff]
Jan 15 06:33:17 pve ceph-osd[18628]: 2018-01-15 06:33:17.290018
7fbd74e71700 -1 *** Caught signal (Aborted) **
Jan 15 06:33:17 pve ceph-osd[18628]:  in thread 7fbd74e71700
thread_name:tp_osd_tp
Jan 15 06:33:17 pve ceph-osd[18628]:  ceph version 12.2.2
(215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)
Jan 15 06:33:17 pve ceph-osd[18628]:  1: (()+0xa16664) [0x56218ee28664]
Jan 15 06:33:17 pve ceph-osd[18628]:  2: (()+0x110c0) [0x7fbd8e2640c0]
Jan 15 06:33:17 pve ceph-osd[18628]:  3: (gsignal()+0xcf) [0x7fbd8d22bfcf]
Jan 15 06:33:17 pve ceph-osd[18628]:  4: (abort()+0x16a) [0x7fbd8d22d3fa]
Jan 15 06:33:17 pve ceph-osd[18628]:  5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, char const*,
unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f]
Jan 15 06:33:17 pve ceph-osd[18628]:  6:
(BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae)
[0x56218ecea2ae]
Jan 15 06:33:17 pve ceph-osd[18628]:  7:
(BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0xa19) [0x56218ed21059]
Jan 15 06:33:17 pve ceph-osd[18628]:  8:
(BlueStore::queue_transactions(ObjectStore::Sequencer*,
std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546)
[0x56218ed232a6]
Jan 15 06:33:17 pve ceph-osd[18628]:  9:
(PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46]
Jan 15 06:33:17 pve ceph-osd[18628]:  10:
(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&,
eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&,
boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*,
unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1)
[0x56218eb78491]
Jan 15 06:33:17 pve ceph-osd[18628]:  11:
(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a]
Jan 15 06:33:17 pve ceph-osd[18628]:  12:
(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d)
[0x56218ea2bb3d]
Jan 15 06:33:17 pve ceph-osd[18628]:  13:
(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5)
[0x56218ea2f375]
Jan 15 06:33:17 pve ceph-osd[18628]:  14:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436]
Jan 15 06:33:17 pve ceph-osd[18628]:  15:
(OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab)
[0x56218e8669eb]
Jan 15 06:33:17 pve ceph-osd[18628]:  16:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x56218eb04eba]
Jan 15 06:33:17 pve ceph-osd[18628]:  17:
(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d]
Jan 15 06:33:17 pve ceph-osd[18628]:  18:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef)
[0x56218ee7506f]
Jan 15 06:33:17 pve ceph-osd[18628]:  19:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370]
Jan 15 06:33:17 pve ceph-osd[18628]:  20: (()+0x7494) [0x7fbd8e25a494]
Jan 15 06:33:17 pve ceph-osd[18628]:  21: (clone()+0x3f) [0x7fbd8d2e1aff]
Jan 15 06:33:17 pve ceph-osd[18628]:  NOTE: a copy of the executable, or
`objdump -rdS <executable>` is needed to interpret this.
Jan 15 06:33:17 pve ceph-osd[18628]:     -2> 2018-01-15 06:33:17.284797
7fbd74e71700 -1 abort: Corruption: block checksum mismatch
Jan 15 06:33:17 pve ceph-osd[18628]:      0> 2018-01-15 06:33:17.290018
7fbd74e71700 -1 *** Caught signal (Aborted) **
Jan 15 06:33:17 pve ceph-osd[18628]:  in thread 7fbd74e71700
thread_name:tp_osd_tp
Jan 15 06:33:17 pve ceph-osd[18628]:  ceph version 12.2.2
(215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)
Jan 15 06:33:17 pve ceph-osd[18628]:  1: (()+0xa16664) [0x56218ee28664]
Jan 15 06:33:17 pve ceph-osd[18628]:  2: (()+0x110c0) [0x7fbd8e2640c0]
Jan 15 06:33:17 pve ceph-osd[18628]:  3: (gsignal()+0xcf) [0x7fbd8d22bfcf]
Jan 15 06:33:17 pve ceph-osd[18628]:  4: (abort()+0x16a) [0x7fbd8d22d3fa]
Jan 15 06:33:17 pve ceph-osd[18628]:  5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, char const*,
unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f]
Jan 15 06:33:17 pve ceph-osd[18628]:  6:
(BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae)
[0x56218ecea2ae]
Jan 15 06:33:17 pve ceph-osd[18628]:  7:
(BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0xa19) [0x56218ed21059]
Jan 15 06:33:17 pve ceph-osd[18628]:  8:
(BlueStore::queue_transactions(ObjectStore::Sequencer*,
std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546)
[0x56218ed232a6]
Jan 15 06:33:17 pve ceph-osd[18628]:  9:
(PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction,
std::allocator<ObjectStore::Transaction> >&,
boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46]
Jan 15 06:33:17 pve ceph-osd[18628]:  10:
(ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&,
eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t,
std::allocator<pg_log_entry_t> > const&,
boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*,
unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1)
[0x56218eb78491]
Jan 15 06:33:17 pve ceph-osd[18628]:  11:
(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a]
Jan 15 06:33:17 pve ceph-osd[18628]:  12:
(PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d)
[0x56218ea2bb3d]
Jan 15 06:33:17 pve ceph-osd[18628]:  13:
(PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5)
[0x56218ea2f375]
Jan 15 06:33:17 pve ceph-osd[18628]:  14:
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436]
Jan 15 06:33:17 pve ceph-osd[18628]:  15:
(OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab)
[0x56218e8669eb]
Jan 15 06:33:17 pve ceph-osd[18628]:  16:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x56218eb04eba]
Jan 15 06:33:17 pve ceph-osd[18628]:  17:
(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d]
Jan 15 06:33:17 pve ceph-osd[18628]:  18:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef)
[0x56218ee7506f]
Jan 15 06:33:17 pve ceph-osd[18628]:  19:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370]
Jan 15 06:33:17 pve ceph-osd[18628]:  20: (()+0x7494) [0x7fbd8e25a494]
Jan 15 06:33:17 pve ceph-osd[18628]:  21: (clone()+0x3f) [0x7fbd8d2e1aff]
Jan 15 06:33:17 pve ceph-osd[18628]:  NOTE: a copy of the executable, or
`objdump -rdS <executable>` is needed to interpret this.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux