On 15/01/2018 7:46 AM, Christian Wuerdig wrote: > Depends on what you mean with "your pool overloads"? What's your > hardware setup (CPU, RAM, how many nodes, network etc.)? What can you > see when you monitor the system resources with atop or the likes? Single node, 8 core (16 hyperthread) CPU, 32G ECC ram (will be upgrading this once ram prices drop) I adjusted the OSD memory cache down from 3G to 1G as the machine was swapping like mad. The issue I have is that if write data at about 2Mb a second the cluster (particularly with small files) I get messages about OSD being down and slow requests. I've fairly confident that the hardware is good, I've checked the LSI SAS cards for interface errors nothing there. Smart is not reporting an issues. It seems like the slow requests can end up on any drive so its no the older ones causing an issue. As a result I was wondering if I got the erasure code settings wrong which was causing an issue. When the OSD goes down I get the following in syslog Jan 15 06:33:17 pve ceph-osd[18628]: 2018-01-15 06:33:17.284797 7fbd74e71700 -1 abort: Corruption: block checksum mismatch Jan 15 06:33:17 pve ceph-osd[18628]: *** Caught signal (Aborted) ** Jan 15 06:33:17 pve ceph-osd[18628]: in thread 7fbd74e71700 thread_name:tp_osd_tp Jan 15 06:33:17 pve ceph-osd[18628]: ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable) Jan 15 06:33:17 pve ceph-osd[18628]: 1: (()+0xa16664) [0x56218ee28664] Jan 15 06:33:17 pve ceph-osd[18628]: 2: (()+0x110c0) [0x7fbd8e2640c0] Jan 15 06:33:17 pve ceph-osd[18628]: 3: (gsignal()+0xcf) [0x7fbd8d22bfcf] Jan 15 06:33:17 pve ceph-osd[18628]: 4: (abort()+0x16a) [0x7fbd8d22d3fa] Jan 15 06:33:17 pve ceph-osd[18628]: 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f] Jan 15 06:33:17 pve ceph-osd[18628]: 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae) [0x56218ecea2ae] Jan 15 06:33:17 pve ceph-osd[18628]: 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xa19) [0x56218ed21059] Jan 15 06:33:17 pve ceph-osd[18628]: 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x56218ed232a6] Jan 15 06:33:17 pve ceph-osd[18628]: 9: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46] Jan 15 06:33:17 pve ceph-osd[18628]: 10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1) [0x56218eb78491] Jan 15 06:33:17 pve ceph-osd[18628]: 11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a] Jan 15 06:33:17 pve ceph-osd[18628]: 12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d) [0x56218ea2bb3d] Jan 15 06:33:17 pve ceph-osd[18628]: 13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5) [0x56218ea2f375] Jan 15 06:33:17 pve ceph-osd[18628]: 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436] Jan 15 06:33:17 pve ceph-osd[18628]: 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab) [0x56218e8669eb] Jan 15 06:33:17 pve ceph-osd[18628]: 16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x56218eb04eba] Jan 15 06:33:17 pve ceph-osd[18628]: 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d] Jan 15 06:33:17 pve ceph-osd[18628]: 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) [0x56218ee7506f] Jan 15 06:33:17 pve ceph-osd[18628]: 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370] Jan 15 06:33:17 pve ceph-osd[18628]: 20: (()+0x7494) [0x7fbd8e25a494] Jan 15 06:33:17 pve ceph-osd[18628]: 21: (clone()+0x3f) [0x7fbd8d2e1aff] Jan 15 06:33:17 pve ceph-osd[18628]: 2018-01-15 06:33:17.290018 7fbd74e71700 -1 *** Caught signal (Aborted) ** Jan 15 06:33:17 pve ceph-osd[18628]: in thread 7fbd74e71700 thread_name:tp_osd_tp Jan 15 06:33:17 pve ceph-osd[18628]: ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable) Jan 15 06:33:17 pve ceph-osd[18628]: 1: (()+0xa16664) [0x56218ee28664] Jan 15 06:33:17 pve ceph-osd[18628]: 2: (()+0x110c0) [0x7fbd8e2640c0] Jan 15 06:33:17 pve ceph-osd[18628]: 3: (gsignal()+0xcf) [0x7fbd8d22bfcf] Jan 15 06:33:17 pve ceph-osd[18628]: 4: (abort()+0x16a) [0x7fbd8d22d3fa] Jan 15 06:33:17 pve ceph-osd[18628]: 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f] Jan 15 06:33:17 pve ceph-osd[18628]: 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae) [0x56218ecea2ae] Jan 15 06:33:17 pve ceph-osd[18628]: 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xa19) [0x56218ed21059] Jan 15 06:33:17 pve ceph-osd[18628]: 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x56218ed232a6] Jan 15 06:33:17 pve ceph-osd[18628]: 9: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46] Jan 15 06:33:17 pve ceph-osd[18628]: 10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1) [0x56218eb78491] Jan 15 06:33:17 pve ceph-osd[18628]: 11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a] Jan 15 06:33:17 pve ceph-osd[18628]: 12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d) [0x56218ea2bb3d] Jan 15 06:33:17 pve ceph-osd[18628]: 13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5) [0x56218ea2f375] Jan 15 06:33:17 pve ceph-osd[18628]: 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436] Jan 15 06:33:17 pve ceph-osd[18628]: 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab) [0x56218e8669eb] Jan 15 06:33:17 pve ceph-osd[18628]: 16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x56218eb04eba] Jan 15 06:33:17 pve ceph-osd[18628]: 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d] Jan 15 06:33:17 pve ceph-osd[18628]: 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) [0x56218ee7506f] Jan 15 06:33:17 pve ceph-osd[18628]: 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370] Jan 15 06:33:17 pve ceph-osd[18628]: 20: (()+0x7494) [0x7fbd8e25a494] Jan 15 06:33:17 pve ceph-osd[18628]: 21: (clone()+0x3f) [0x7fbd8d2e1aff] Jan 15 06:33:17 pve ceph-osd[18628]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jan 15 06:33:17 pve ceph-osd[18628]: -2> 2018-01-15 06:33:17.284797 7fbd74e71700 -1 abort: Corruption: block checksum mismatch Jan 15 06:33:17 pve ceph-osd[18628]: 0> 2018-01-15 06:33:17.290018 7fbd74e71700 -1 *** Caught signal (Aborted) ** Jan 15 06:33:17 pve ceph-osd[18628]: in thread 7fbd74e71700 thread_name:tp_osd_tp Jan 15 06:33:17 pve ceph-osd[18628]: ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable) Jan 15 06:33:17 pve ceph-osd[18628]: 1: (()+0xa16664) [0x56218ee28664] Jan 15 06:33:17 pve ceph-osd[18628]: 2: (()+0x110c0) [0x7fbd8e2640c0] Jan 15 06:33:17 pve ceph-osd[18628]: 3: (gsignal()+0xcf) [0x7fbd8d22bfcf] Jan 15 06:33:17 pve ceph-osd[18628]: 4: (abort()+0x16a) [0x7fbd8d22d3fa] Jan 15 06:33:17 pve ceph-osd[18628]: 5: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, unsigned long, ceph::buffer::list*)+0x29f) [0x56218ed6695f] Jan 15 06:33:17 pve ceph-osd[18628]: 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae) [0x56218ecea2ae] Jan 15 06:33:17 pve ceph-osd[18628]: 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xa19) [0x56218ed21059] Jan 15 06:33:17 pve ceph-osd[18628]: 8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x56218ed232a6] Jan 15 06:33:17 pve ceph-osd[18628]: 9: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x56218ea45f46] Jan 15 06:33:17 pve ceph-osd[18628]: 10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcd1) [0x56218eb78491] Jan 15 06:33:17 pve ceph-osd[18628]: 11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x56218e9e164a] Jan 15 06:33:17 pve ceph-osd[18628]: 12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x134d) [0x56218ea2bb3d] Jan 15 06:33:17 pve ceph-osd[18628]: 13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ef5) [0x56218ea2f375] Jan 15 06:33:17 pve ceph-osd[18628]: 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xec6) [0x56218e9e9436] Jan 15 06:33:17 pve ceph-osd[18628]: 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab) [0x56218e8669eb] Jan 15 06:33:17 pve ceph-osd[18628]: 16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x56218eb04eba] Jan 15 06:33:17 pve ceph-osd[18628]: 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x103d) [0x56218e88df4d] Jan 15 06:33:17 pve ceph-osd[18628]: 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) [0x56218ee7506f] Jan 15 06:33:17 pve ceph-osd[18628]: 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56218ee78370] Jan 15 06:33:17 pve ceph-osd[18628]: 20: (()+0x7494) [0x7fbd8e25a494] Jan 15 06:33:17 pve ceph-osd[18628]: 21: (clone()+0x3f) [0x7fbd8d2e1aff] Jan 15 06:33:17 pve ceph-osd[18628]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com