12.2.5 on Proxmox cluster.
6 nodes, about 50 OSDs, bluestore and cache tiering on an EC pool. Mostly spinners with an SSD OSD drive and an SSD WAL DB drive on each node. PM863 SSDs with ~75%+ endurance remaning.
Has been running relatively okay besides some spinner failures until I checked today and found 5-6 OSDs flapping. I remember reading about some issues with 12.2.5, so I upgraded one node to 12.2.10 but no change.
Seeing:
2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0)
-3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug
in the logs for each of them.
I've found several bugs in the tracker related to these, but nothing with a resolution I could apply besides upgrading, which doesn't appear to have helped.
Suggestions welcome.
Snippet of the last few lines:
rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive
-6> 2018-12-20 00:27:42.753106 7f578f68a700 5 osd.33 pg_epoch: 59344 pg[18.10s5( v 15786'3211273 (15781'3201224,15786'3211273] lb 18:0af1fd67:::rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive/RepNotRecovering
-5> 2018-12-20 00:27:42.753186 7f578f68a700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 15786'3211274, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
-4> 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0)
-3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug
-2> 2018-12-20 00:27:42.754507 7f578f68a700 0 bluestore(/var/lib/ceph/osd/ceph-33) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "clonerange2",
"collection": "18.10s5_head",
"src_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014",
"dst_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#",
"src_offset": 512000,
"len": 8192,
"dst_offset": 512000
},
{
"op_num": 1,
"op_name": "remove",
"collection": "18.10s5_head",
"oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014"
},
{
"op_num": 2,
"op_name": "setattrs",
"collection": "18.10s5_head",
"oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#",
"attr_lens": {
"_": 298,
"hinfo_key": 18,
"snapset": 35
}
},
{
"op_num": 3,
"op_name": "nop"
},
{
"op_num": 4,
"op_name": "op_omap_rmkeyrange",
"collection": "18.10s5_head",
"oid": "5#18:08000000::::head#",
"first": "0000015786.00000000000003211274",
"last": "4294967295.18446744073709551615"
},
{
"op_num": 5,
"op_name": "omap_setkeys",
"collection": "18.10s5_head",
"oid": "5#18:08000000::::head#",
"attr_lens": {
"_biginfo": 1646,
"_epoch": 4,
"_info": 1014,
"can_rollback_to": 12,
"rollback_info_trimmed_to": 12
}
}
]
}
-1> 2018-12-20 00:27:42.757231 7f5795696700 1 -- 10.10.145.105:6801/29516 --> 10.10.145.100:6818/5468876 -- pg_info((query:59344 sent:59344 18.1es9( v 15786'3322304 (15786'3312225,15786'3322304] local-lis/les=59343/59344 n=163868 ec=408/408 lis/c 59343/3966 les/c/f 59344/3980/0 59341/59343/15773) 9->0)=(empty) epoch 59344) v5 -- 0x55c2f7e7be00 con 0
0> 2018-12-20 00:27:42.758519 7f578f68a700 -1 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f578f68a700 time 2018-12-20 00:27:42.754596
/home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: 9621: FAILED assert(0 == "unexpected error")
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55c2d9a2d9b2]
2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
6: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
9: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
10: (()+0x7494) [0x7f57a7275494]
11: (clone()+0x3f) [0x7f57a62fcacf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.33.log
--- end dump of recent events ---
2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) **
in thread 7f578f68a700 thread_name:tp_peering
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
1: (()+0xa56bd4) [0x55c2d99e5bd4]
2: (()+0x110c0) [0x7f57a727f0c0]
3: (gsignal()+0xcf) [0x7f57a6246fff]
4: (abort()+0x16a) [0x7f57a624842a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e]
6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
14: (()+0x7494) [0x7f57a7275494]
15: (clone()+0x3f) [0x7f57a62fcacf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-13> 2018-12-20 00:27:42.764248 7f57a3ed5700 5 -- 10.10.145.105:6802/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8090000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54007 cs=1 l=1). rx osd.8 seq 2 0x55c2f7e7be00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4
-12> 2018-12-20 00:27:42.764269 7f57a3ed5700 1 -- 10.10.145.105:6802/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f7e7be00 con 0x55c2f8090000
-11> 2018-12-20 00:27:42.764282 7f57a3ed5700 1 -- 10.10.145.105:6802/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7f12200 con 0
-10> 2018-12-20 00:27:42.764391 7f57a2ed3700 5 -- 10.10.145.105:6803/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8091800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54004 cs=1 l=1). rx osd.8 seq 2 0x55c2f76c8000 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4
-9> 2018-12-20 00:27:42.764413 7f57a2ed3700 1 -- 10.10.145.105:6803/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f76c8000 con 0x55c2f8091800
-8> 2018-12-20 00:27:42.764430 7f57a2ed3700 1 -- 10.10.145.105:6803/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7df0800 con 0
-7> 2018-12-20 00:27:42.773561 7f57a2ed3700 5 -- 10.10.145.105:6802/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80eb000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50456 cs=1 l=1). rx osd.24 seq 1 0x55c2f7f97800 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4
-6> 2018-12-20 00:27:42.773580 7f57a2ed3700 1 -- 10.10.145.105:6802/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7f97800 con 0x55c2f80eb000
-5> 2018-12-20 00:27:42.773591 7f57a2ed3700 1 -- 10.10.145.105:6802/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f7df0800 con 0
-4> 2018-12-20 00:27:42.773632 7f57a2ed3700 1 -- 10.10.145.105:6801/29516 --> 10.10.145.104:6807/23815033 -- osd_map(59344..59344 src has 3756..59344) v3 -- 0x55c2f892a080 con 0
-3> 2018-12-20 00:27:42.773737 7f57a36d4700 5 -- 10.10.145.105:6803/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80ec800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50470 cs=1 l=1). rx osd.24 seq 1 0x55c2f7fc6c00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4
-2> 2018-12-20 00:27:42.773757 7f57a36d4700 1 -- 10.10.145.105:6803/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7fc6c00 con 0x55c2f80ec800
-1> 2018-12-20 00:27:42.773772 7f57a36d4700 1 -- 10.10.145.105:6803/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f822d600 con 0
0> 2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) **
in thread 7f578f68a700 thread_name:tp_peering
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
1: (()+0xa56bd4) [0x55c2d99e5bd4]
2: (()+0x110c0) [0x7f57a727f0c0]
3: (gsignal()+0xcf) [0x7f57a6246fff]
4: (abort()+0x16a) [0x7f57a624842a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e]
6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
14: (()+0x7494) [0x7f57a7275494]
15: (clone()+0x3f) [0x7f57a62fcacf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.33.log
--- end dump of recent events ---
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com