Hello
I've got a small cluster of 3 osd servers and 30 osds between them running Jewel 10.2.2 on Ubuntu 16.04 LTS with stock kernel version 4.4.0-34-generic.
I am experiencing rather frequent osd crashes, which tend to happen a few times a month on random osds. The latest one gave me the following log message:
2016-08-30 06:26:29.861106 7f8ed54f1700 -1 journal aio to 13085011968~8192 wrote 18446744073709551615
2016-08-30 06:26:29.862558 7f8ed54f1700 -1 os/filestore/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f8ed54f1700 time 2016-08-30 06:26:29.86112
2
os/filestore/FileJournal.cc: 1541: FAILED assert(0 == "unexpected aio error")
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x563563844040]
2: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
3: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
4: (()+0x76fa) [0x7f8ee354f6fa]
5: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-08-30 06:26:29.862558 7f8ed54f1700 -1 os/filestore/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f8ed54f1700 time 2016-08-30 06:26:29.86112
2
os/filestore/FileJournal.cc: 1541: FAILED assert(0 == "unexpected aio error")
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x563563844040]
2: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
3: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
4: (()+0x76fa) [0x7f8ee354f6fa]
5: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--- end dump of recent events ---
2016-08-30 06:26:29.897346 7f8ed54f1700 -1 *** Caught signal (Aborted) **
in thread 7f8ed54f1700 thread_name:journal_wrt_fin
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x96bdde) [0x56356373fdde]
2: (()+0x113d0) [0x7f8ee35593d0]
3: (gsignal()+0x38) [0x7f8ee14f7418]
4: (abort()+0x16a) [0x7f8ee14f901a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56356384422b]
6: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
7: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
8: (()+0x76fa) [0x7f8ee354f6fa]
9: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--- end dump of recent events ---
2016-08-30 06:26:29.897346 7f8ed54f1700 -1 *** Caught signal (Aborted) **
in thread 7f8ed54f1700 thread_name:journal_wrt_fin
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x96bdde) [0x56356373fdde]
2: (()+0x113d0) [0x7f8ee35593d0]
3: (gsignal()+0x38) [0x7f8ee14f7418]
4: (abort()+0x16a) [0x7f8ee14f901a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56356384422b]
6: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
7: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
8: (()+0x76fa) [0x7f8ee354f6fa]
9: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-6> 2016-08-30 06:26:29.864455 7f8e913db700 1 -- 192.168.168.203:6821/6750 <== osd.17 192.168.168.202:6814/6009 90921 ==== osd_repop_reply(client.156984580.0:1576297 5.d6) v1 ==== 83+0+0 (394499027 0 0) 0x56359a299480 con 0x56358db22600
-5> 2016-08-30 06:26:29.864480 7f8e913db700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864480, event: queued_for_pg, op: osd_repop_reply(client.156984580.0:1576297 5.d6)
-4> 2016-08-30 06:26:29.864534 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864533, event: reached_pg, op: osd_repop_reply(client.156984580.0:1576297 5.d6)
-3> 2016-08-30 06:26:29.864547 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864547, event: started, op: osd_repop_reply(client.156984580.0:1576297 5.d6 ondisk, result = 0)
-2> 2016-08-30 06:26:29.864556 7f8eae5a6700 5 -- op tracker -- seq: 4070376, time: 2016-08-30 06:26:29.864554, event: sub_op_commit_rec from 17, op: osd_op(client.156984580.0:1576297 5.6579f8d6 rbd_data.95a339c56ad606.00000000000000ad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 999424~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e95553)
-1> 2016-08-30 06:26:29.864569 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864569, event: done, op: osd_repop_reply(client.156984580.0:1576297 5.d6 ondisk, result = 0)
0> 2016-08-30 06:26:29.897346 7f8ed54f1700 -1 *** Caught signal (Aborted) **
in thread 7f8ed54f1700 thread_name:journal_wrt_fin
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x96bdde) [0x56356373fdde]
2: (()+0x113d0) [0x7f8ee35593d0]
3: (gsignal()+0x38) [0x7f8ee14f7418]
4: (abort()+0x16a) [0x7f8ee14f901a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56356384422b]
6: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
7: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
8: (()+0x76fa) [0x7f8ee354f6fa]
9: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-6> 2016-08-30 06:26:29.864455 7f8e913db700 1 -- 192.168.168.203:6821/6750 <== osd.17 192.168.168.202:6814/6009 90921 ==== osd_repop_reply(client.156984580.0:1576297 5.d6) v1 ==== 83+0+0 (394499027 0 0) 0x56359a299480 con 0x56358db22600
-5> 2016-08-30 06:26:29.864480 7f8e913db700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864480, event: queued_for_pg, op: osd_repop_reply(client.156984580.0:1576297 5.d6)
-4> 2016-08-30 06:26:29.864534 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864533, event: reached_pg, op: osd_repop_reply(client.156984580.0:1576297 5.d6)
-3> 2016-08-30 06:26:29.864547 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864547, event: started, op: osd_repop_reply(client.156984580.0:1576297 5.d6 ondisk, result = 0)
-2> 2016-08-30 06:26:29.864556 7f8eae5a6700 5 -- op tracker -- seq: 4070376, time: 2016-08-30 06:26:29.864554, event: sub_op_commit_rec from 17, op: osd_op(client.156984580.0:1576297 5.6579f8d6 rbd_data.95a339c56ad606.00000000000000ad [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 999424~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e95553)
-1> 2016-08-30 06:26:29.864569 7f8eae5a6700 5 -- op tracker -- seq: 4070377, time: 2016-08-30 06:26:29.864569, event: done, op: osd_repop_reply(client.156984580.0:1576297 5.d6 ondisk, result = 0)
0> 2016-08-30 06:26:29.897346 7f8ed54f1700 -1 *** Caught signal (Aborted) **
in thread 7f8ed54f1700 thread_name:journal_wrt_fin
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x96bdde) [0x56356373fdde]
2: (()+0x113d0) [0x7f8ee35593d0]
3: (gsignal()+0x38) [0x7f8ee14f7418]
4: (abort()+0x16a) [0x7f8ee14f901a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56356384422b]
6: (FileJournal::write_finish_thread_entry()+0x991) [0x5635635c9701]
7: (FileJournal::WriteFinisher::entry()+0xd) [0x5635634f5b6d]
8: (()+0x76fa) [0x7f8ee354f6fa]
9: (clone()+0x6d) [0x7f8ee15c8b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Does anyone know what is causing this? Is it a known issue currently being addressed?
Cheers
Andrei
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com