Hi everyone,
I run a Ceph Jewel cluster over 3 nodes having 3 Samsung 256GB SSD each
(9 OSD total).
I use it for RBD disks for my VMs.
It has run nice for a few weeks, then suddenly whole cluster is
extremely slow, Ceph is reporting blocked requests, recovery is endless
("ceph -s" is not showing recovery speed so it's slow...).
If I copy RBD images to another cluster, read of RBD is OK and fast. But
if I try to remove it after copy, slow requests are back.
Here is my Ceph status without any VM running :
root@host105:~# ceph -s
cluster 853be806-f101-45ec-9926-73df7e159838
health HEALTH_WARN
noout,require_jewel_osds flag(s) set
monmap e9: 3 mons at
{3=10.111.0.105:6789/0,4=10.111.0.106:6789/0,5=10.111.0.107:6789/0}
election epoch 3770, quorum 0,1,2 3,4,5
fsmap e1103917: 1/1/1 up {0=host107=up:active}, 2 up:standby
osdmap e1113536: 9 osds: 9 up, 9 in
flags noout,require_jewel_osds
pgmap v36244619: 168 pgs, 6 pools, 467 GB data, 120 kobjects
743 GB used, 1192 GB / 1935 GB avail
168 active+clean
If I start a single VM, it's OK, no slow request. But if I heavily write
inside it, slow requests are back.
If I look in ceph.log, I can see that slow requests are due to
"currently waiting for subops from" and never from same OSD.
2017-01-24 16:29:51.576115 osd.7 10.111.0.105:6807/2675 74 : cluster
[WRN] slow request 120.096960 seconds old, received at 2017-01-24
16:27:51.479116: osd_op(client.116307465.0:6786 1.ac8f524e
rbd_data.30e6a6430d938c1.00000000000000ef [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684) currently waiting for
subops from 9,13
2017-01-24 16:29:52.300764 osd.10 10.111.0.106:6802/18968 57 : cluster
[WRN] 1 slow requests, 1 included below; oldest blocked for > 120.311863
secs
2017-01-24 16:29:52.300766 osd.10 10.111.0.106:6802/18968 58 : cluster
[WRN] slow request 120.311863 seconds old, received at 2017-01-24
16:27:51.988838: osd_op(client.116307465.0:6787 1.e9020b2b
rbd_data.30e6a6430d938c1.00000000000000f0 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684) currently waiting for
subops from 8,12
2017-01-24 16:29:54.088301 osd.6 10.111.0.105:6803/2493 124 : cluster
[WRN] 3 slow requests, 2 included below; oldest blocked for > 120.298857
secs
2017-01-24 16:29:54.088304 osd.6 10.111.0.105:6803/2493 125 : cluster
[WRN] slow request 120.298857 seconds old, received at 2017-01-24
16:27:53.789391: osd_op(client.116307465.0:6793 1.454a3d14
rbd_data.30e6a6430d938c1.00000000000000f5 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113686) currently waiting for
subops from 10,12
2017-01-24 16:29:54.088305 osd.6 10.111.0.105:6803/2493 126 : cluster
[WRN] slow request 120.196257 seconds old, received at 2017-01-24
16:27:53.891992: osd_op(client.116307465.0:6794 1.f585dbe2
rbd_data.30e6a6430d938c1.00000000000000f6 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113686) currently waiting for
subops from 10,12
2017-01-24 16:29:57.088860 osd.6 10.111.0.105:6803/2493 127 : cluster
[WRN] 3 slow requests, 1 included below; oldest blocked for > 123.299409
secs
2017-01-24 16:29:57.088863 osd.6 10.111.0.105:6803/2493 128 : cluster
[WRN] slow request 120.554780 seconds old, received at 2017-01-24
16:27:56.534020: osd_op(client.116307465.0:6798 1.b5e4e0cf
rbd_data.30e6a6430d938c1.00000000000000f8 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113688) currently waiting for
subops from 9,12
Sometime I can dump operations from osd, for example :
root@host106:~# ceph daemon osd.10 dump_ops_in_flight
{
"ops": [
{
"description": "osd_op(client.116307465.0:6787 1.e9020b2b
rbd_data.30e6a6430d938c1.00000000000000f0 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684)",
"initiated_at": "2017-01-24 16:27:51.988838",
"age": 126.079344,
"duration": 126.079362,
"type_data": [
"waiting for sub ops",
{
"client": "client.116307465",
"tid": 6787
},
[
{
"time": "2017-01-24 16:27:51.988838",
"event": "initiated"
},
{
"time": "2017-01-24 16:27:51.990699",
"event": "queued_for_pg"
},
{
"time": "2017-01-24 16:27:51.990761",
"event": "reached_pg"
},
{
"time": "2017-01-24 16:27:51.990835",
"event": "started"
},
{
"time": "2017-01-24 16:27:51.991043",
"event": "waiting for subops from 8,12"
},
{
"time": "2017-01-24 16:27:51.991919",
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-01-24 16:27:51.991928",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-01-24 16:27:52.048083",
"event": "journaled_completion_queued"
},
{
"time": "2017-01-24 16:27:52.048153",
"event": "op_commit"
},
{
"time": "2017-01-24 16:27:52.049780",
"event": "op_applied"
},
{
"time": "2017-01-24 16:27:52.140422",
"event": "sub_op_commit_rec from 12"
}
]
]
}
],
"num_ops": 1
}
What does it mean exactly ?
Network seams OK, every host can ping each other, I don't have any problem.
Disks are recent, and Media_Wearout_Indicator is never under 80.
I tried to switch OSD journal to another partition (of the same disk),
but always same problem.
I did fstrim on every OSD, I added "discard" on mount options. Always
same problem.
Ceph OSD bench is reporting speeds between 115 and 280MB/s, but
sometimes some OSDs are under, like 60MB/s.
Can someone help me to understand where problem is ? If you have some
commands that I can run to debug ?