12.2.5 multiple OSDs crashing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



12.2.5 on Proxmox cluster. 

6 nodes, about 50 OSDs, bluestore and cache tiering on an EC pool. Mostly spinners with an SSD OSD drive and an SSD WAL DB drive on each node. PM863 SSDs with ~75%+ endurance remaning.

Has been running relatively okay besides some spinner failures until I checked today and found 5-6 OSDs flapping. I remember reading about some issues with 12.2.5, so I upgraded one node to 12.2.10 but no change.

Seeing:
2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0)
    -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug

in the logs for each of them. 

I've found several bugs in the tracker related to these, but nothing with a resolution I could apply besides upgrading, which doesn't appear to have helped.

Suggestions welcome.

 
Snippet of the last few lines:

rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive
    -6> 2018-12-20 00:27:42.753106 7f578f68a700  5 osd.33 pg_epoch: 59344 pg[18.10s5( v 15786'3211273 (15781'3201224,15786'3211273] lb 18:0af1fd67:::rbd_data.17.afb3726b8b4567.00000000000db0d8:head (bitwise) local-lis/les=10138/10139 n=163802 ec=408/408 lis/c 15778/4121 les/c/f 15781/4127/0 59341/59343/15778) [28,27,17,19,32,33,14,22,7,9,25,23]/[28,27,17,19,2147483647,23,2147483647,13,2147483647,9,32,33]p28(0) r=-1 lpr=59343 pi=[4121,59343)/12 crt=15786'3211273 lcod 0'0 remapped NOTIFY mbc={}] enter Started/ReplicaActive/RepNotRecovering
    -5> 2018-12-20 00:27:42.753186 7f578f68a700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 15786'3211274, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -4> 2018-12-20 00:27:42.754485 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 0, counting from 0)
    -3> 2018-12-20 00:27:42.754503 7f578f68a700 -1 bluestore(/var/lib/ceph/osd/ceph-33) ENOENT on clone suggests osd bug
    -2> 2018-12-20 00:27:42.754507 7f578f68a700  0 bluestore(/var/lib/ceph/osd/ceph-33)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "clonerange2",
            "collection": "18.10s5_head",
            "src_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014",
            "dst_oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#",
            "src_offset": 512000,
            "len": 8192,
            "dst_offset": 512000
        },
        {
            "op_num": 1,
            "op_name": "remove",
            "collection": "18.10s5_head",
            "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#310014"
        },
        {
            "op_num": 2,
            "op_name": "setattrs",
            "collection": "18.10s5_head",
            "oid": "5#18:0a10e4e8:::rbd_data.17.afb3726b8b4567.00000000002ad142:head#",
            "attr_lens": {
                "_": 298,
                "hinfo_key": 18,
                "snapset": 35
            }
        },
        {
            "op_num": 3,
            "op_name": "nop"
        },
        {
            "op_num": 4,
            "op_name": "op_omap_rmkeyrange",
            "collection": "18.10s5_head",
            "oid": "5#18:08000000::::head#",
            "first": "0000015786.00000000000003211274",
            "last": "4294967295.18446744073709551615"
        },
        {
            "op_num": 5,
            "op_name": "omap_setkeys",
            "collection": "18.10s5_head",
            "oid": "5#18:08000000::::head#",
            "attr_lens": {
                "_biginfo": 1646,
                "_epoch": 4,
                "_info": 1014,
                "can_rollback_to": 12,
                "rollback_info_trimmed_to": 12
            }
        }
    ]
}

    -1> 2018-12-20 00:27:42.757231 7f5795696700  1 -- 10.10.145.105:6801/29516 --> 10.10.145.100:6818/5468876 -- pg_info((query:59344 sent:59344 18.1es9( v 15786'3322304 (15786'3312225,15786'3322304] local-lis/les=59343/59344 n=163868 ec=408/408 lis/c 59343/3966 les/c/f 59344/3980/0 59341/59343/15773) 9->0)=(empty) epoch 59344) v5 -- 0x55c2f7e7be00 con 0
     0> 2018-12-20 00:27:42.758519 7f578f68a700 -1 /home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f578f68a700 time 2018-12-20 00:27:42.754596
/home/builder/source/ceph-12.2.10/src/os/bluestore/BlueStore.cc: 9621: FAILED assert(0 == "unexpected error")

 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55c2d9a2d9b2]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
 6: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
 10: (()+0x7494) [0x7f57a7275494]
 11: (clone()+0x3f) [0x7f57a62fcacf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.33.log
--- end dump of recent events ---
2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) **
 in thread 7f578f68a700 thread_name:tp_peering

 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
 1: (()+0xa56bd4) [0x55c2d99e5bd4]
 2: (()+0x110c0) [0x7f57a727f0c0]
 3: (gsignal()+0xcf) [0x7f57a6246fff]
 4: (abort()+0x16a) [0x7f57a624842a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
 7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
 8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
 9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
 10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
 11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
 14: (()+0x7494) [0x7f57a7275494]
 15: (clone()+0x3f) [0x7f57a62fcacf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -13> 2018-12-20 00:27:42.764248 7f57a3ed5700  5 -- 10.10.145.105:6802/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8090000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54007 cs=1 l=1). rx osd.8 seq 2 0x55c2f7e7be00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4
   -12> 2018-12-20 00:27:42.764269 7f57a3ed5700  1 -- 10.10.145.105:6802/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f7e7be00 con 0x55c2f8090000
   -11> 2018-12-20 00:27:42.764282 7f57a3ed5700  1 -- 10.10.145.105:6802/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7f12200 con 0
   -10> 2018-12-20 00:27:42.764391 7f57a2ed3700  5 -- 10.10.145.105:6803/29516 >> 10.10.145.102:0/436138 conn(0x55c2f8091800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=54004 cs=1 l=1). rx osd.8 seq 2 0x55c2f76c8000 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4
    -9> 2018-12-20 00:27:42.764413 7f57a2ed3700  1 -- 10.10.145.105:6803/29516 <== osd.8 10.10.145.102:0/436138 2 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.752504) v4 ==== 2004+0+0 (2799076138 0 0) 0x55c2f76c8000 con 0x55c2f8091800
    -8> 2018-12-20 00:27:42.764430 7f57a2ed3700  1 -- 10.10.145.105:6803/29516 --> 10.10.145.102:0/436138 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.752504) v4 -- 0x55c2f7df0800 con 0
    -7> 2018-12-20 00:27:42.773561 7f57a2ed3700  5 -- 10.10.145.105:6802/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80eb000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50456 cs=1 l=1). rx osd.24 seq 1 0x55c2f7f97800 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4
    -6> 2018-12-20 00:27:42.773580 7f57a2ed3700  1 -- 10.10.145.105:6802/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7f97800 con 0x55c2f80eb000
    -5> 2018-12-20 00:27:42.773591 7f57a2ed3700  1 -- 10.10.145.105:6802/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f7df0800 con 0
    -4> 2018-12-20 00:27:42.773632 7f57a2ed3700  1 -- 10.10.145.105:6801/29516 --> 10.10.145.104:6807/23815033 -- osd_map(59344..59344 src has 3756..59344) v3 -- 0x55c2f892a080 con 0
    -3> 2018-12-20 00:27:42.773737 7f57a36d4700  5 -- 10.10.145.105:6803/29516 >> 10.10.145.104:0/3815033 conn(0x55c2f80ec800 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=50470 cs=1 l=1). rx osd.24 seq 1 0x55c2f7fc6c00 osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4
    -2> 2018-12-20 00:27:42.773757 7f57a36d4700  1 -- 10.10.145.105:6803/29516 <== osd.24 10.10.145.104:0/3815033 1 ==== osd_ping(ping e59343 stamp 2018-12-20 00:27:42.760326) v4 ==== 2004+0+0 (1752418694 0 0) 0x55c2f7fc6c00 con 0x55c2f80ec800
    -1> 2018-12-20 00:27:42.773772 7f57a36d4700  1 -- 10.10.145.105:6803/29516 --> 10.10.145.104:0/3815033 -- osd_ping(ping_reply e59344 stamp 2018-12-20 00:27:42.760326) v4 -- 0x55c2f822d600 con 0
     0> 2018-12-20 00:27:42.795716 7f578f68a700 -1 *** Caught signal (Aborted) **
 in thread 7f578f68a700 thread_name:tp_peering

 ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable)
 1: (()+0xa56bd4) [0x55c2d99e5bd4]
 2: (()+0x110c0) [0x7f57a727f0c0]
 3: (gsignal()+0xcf) [0x7f57a6246fff]
 4: (abort()+0x16a) [0x7f57a624842a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55c2d9a2db3e]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15fa) [0x55c2d98d6c2a]
 7: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x546) [0x55c2d98d8296]
 8: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x14f) [0x55c2d945e0ef]
 9: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55c2d93df8ec]
 10: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x442) [0x55c2d940de22]
 11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x2c) [0x55c2d947e66c]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb8) [0x55c2d9a34b58]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x55c2d9a35cf0]
 14: (()+0x7494) [0x7f57a7275494]
 15: (clone()+0x3f) [0x7f57a62fcacf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.33.log
--- end dump of recent events ---


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux