Re: Suddenly having slow writes

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 24 Jan 2017 12:26:40 -0600

On 01/24/2017 09:38 AM, Florent B wrote:
Hi everyone,

I run a Ceph Jewel cluster over 3 nodes having 3 Samsung 256GB SSD each
(9 OSD total).

I use it for RBD disks for my VMs.

It has run nice for a few weeks, then suddenly whole cluster is
extremely slow, Ceph is reporting blocked requests, recovery is endless
("ceph -s" is not showing recovery speed so it's slow...).

If I copy RBD images to another cluster, read of RBD is OK and fast. But
if I try to remove it after copy, slow requests are back.

Here is my  Ceph status without any VM running :

root@host105:~# ceph -s
    cluster 853be806-f101-45ec-9926-73df7e159838
     health HEALTH_WARN
            noout,require_jewel_osds flag(s) set
     monmap e9: 3 mons at
{3=10.111.0.105:6789/0,4=10.111.0.106:6789/0,5=10.111.0.107:6789/0}
            election epoch 3770, quorum 0,1,2 3,4,5
      fsmap e1103917: 1/1/1 up {0=host107=up:active}, 2 up:standby
     osdmap e1113536: 9 osds: 9 up, 9 in
            flags noout,require_jewel_osds
      pgmap v36244619: 168 pgs, 6 pools, 467 GB data, 120 kobjects
            743 GB used, 1192 GB / 1935 GB avail
                 168 active+clean

If I start a single VM, it's OK, no slow request. But if I heavily write
inside it, slow requests are back.

If I look in ceph.log, I can see that slow requests are due to
"currently waiting for subops from" and never from same OSD.

2017-01-24 16:29:51.576115 osd.7 10.111.0.105:6807/2675 74 : cluster
[WRN] slow request 120.096960 seconds old, received at 2017-01-24
16:27:51.479116: osd_op(client.116307465.0:6786 1.ac8f524e
rbd_data.30e6a6430d938c1.00000000000000ef [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684) currently waiting for
subops from 9,13
2017-01-24 16:29:52.300764 osd.10 10.111.0.106:6802/18968 57 : cluster
[WRN] 1 slow requests, 1 included below; oldest blocked for > 120.311863
secs
2017-01-24 16:29:52.300766 osd.10 10.111.0.106:6802/18968 58 : cluster
[WRN] slow request 120.311863 seconds old, received at 2017-01-24
16:27:51.988838: osd_op(client.116307465.0:6787 1.e9020b2b
rbd_data.30e6a6430d938c1.00000000000000f0 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684) currently waiting for
subops from 8,12
2017-01-24 16:29:54.088301 osd.6 10.111.0.105:6803/2493 124 : cluster
[WRN] 3 slow requests, 2 included below; oldest blocked for > 120.298857
secs
2017-01-24 16:29:54.088304 osd.6 10.111.0.105:6803/2493 125 : cluster
[WRN] slow request 120.298857 seconds old, received at 2017-01-24
16:27:53.789391: osd_op(client.116307465.0:6793 1.454a3d14
rbd_data.30e6a6430d938c1.00000000000000f5 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113686) currently waiting for
subops from 10,12
2017-01-24 16:29:54.088305 osd.6 10.111.0.105:6803/2493 126 : cluster
[WRN] slow request 120.196257 seconds old, received at 2017-01-24
16:27:53.891992: osd_op(client.116307465.0:6794 1.f585dbe2
rbd_data.30e6a6430d938c1.00000000000000f6 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113686) currently waiting for
subops from 10,12
2017-01-24 16:29:57.088860 osd.6 10.111.0.105:6803/2493 127 : cluster
[WRN] 3 slow requests, 1 included below; oldest blocked for > 123.299409
secs
2017-01-24 16:29:57.088863 osd.6 10.111.0.105:6803/2493 128 : cluster
[WRN] slow request 120.554780 seconds old, received at 2017-01-24
16:27:56.534020: osd_op(client.116307465.0:6798 1.b5e4e0cf
rbd_data.30e6a6430d938c1.00000000000000f8 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113688) currently waiting for
subops from 9,12

Sometime I can dump operations from osd, for example :

root@host106:~# ceph daemon osd.10 dump_ops_in_flight
{
    "ops": [
        {
            "description": "osd_op(client.116307465.0:6787 1.e9020b2b
rbd_data.30e6a6430d938c1.00000000000000f0 [set-alloc-hint object_size
4194304 write_size 4194304,writefull 0~4194304] snapc 0=[]
ack+ondisk+write+known_if_redirected e1113684)",
            "initiated_at": "2017-01-24 16:27:51.988838",
            "age": 126.079344,
            "duration": 126.079362,
            "type_data": [
                "waiting for sub ops",
                {
                    "client": "client.116307465",
                    "tid": 6787
                },
                [
                    {
                        "time": "2017-01-24 16:27:51.988838",
                        "event": "initiated"
                    },
                    {
                        "time": "2017-01-24 16:27:51.990699",
                        "event": "queued_for_pg"
                    },
                    {
                        "time": "2017-01-24 16:27:51.990761",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2017-01-24 16:27:51.990835",
                        "event": "started"
                    },
                    {
                        "time": "2017-01-24 16:27:51.991043",
                        "event": "waiting for subops from 8,12"
                    },
                    {
                        "time": "2017-01-24 16:27:51.991919",
                        "event": "commit_queued_for_journal_write"
                    },
                    {
                        "time": "2017-01-24 16:27:51.991928",
                        "event": "write_thread_in_journal_buffer"
                    },
                    {
                        "time": "2017-01-24 16:27:52.048083",
                        "event": "journaled_completion_queued"
                    },
                    {
                        "time": "2017-01-24 16:27:52.048153",
                        "event": "op_commit"
                    },
                    {
                        "time": "2017-01-24 16:27:52.049780",
                        "event": "op_applied"
                    },
                    {
                        "time": "2017-01-24 16:27:52.140422",
                        "event": "sub_op_commit_rec from 12"
                    }
                ]
            ]
        }
    ],
    "num_ops": 1
}

What does it mean exactly ?

Network seams OK, every host can ping each other, I don't have any problem.

Disks are recent, and Media_Wearout_Indicator is never under 80.

I tried to switch OSD journal to another partition (of the same disk),
but always same problem.

I did fstrim on every OSD, I added "discard" on mount options. Always
same problem.

Ceph OSD bench is reporting speeds between 115 and 280MB/s, but
sometimes some OSDs are under, like 60MB/s.

Can someone help me to understand where problem is ? If you have some
commands that I can run to debug ?

My first thought is that PGs are splitting.  You only appear to have 
168PGs for 9 OSDs, that's not nearly enough.  Beyond the poor data 
distribution and associated performance imbalance, your PGs will split 
very quickly since by default PGs start splitting at 320 Objects each. 
Typically this is less of a problem with RBD since by default it uses 
4MB objects (and thus there are fewer bigger objects), but with only 168 
PGs you are likely to be heavily splitting by the time you hit ~218GB of 
data (make sure to take replication into account).

Normally PG splitting shouldn't be terribly expensive since it's 
basically just reading a directory xattr, readdir on a small directory, 
then a bunch of link/unlink operations.  When SELinux is enable it 
appears that link/unlink might require an xattr read on each object/file 
to determine if the link/unlink can happen.  That's a ton of extra seek 
overhead.  On spinning disks this is especially bad with XFS since 
subdirs may not be in the same AG as a parent dir, so after subsequent 
splits, the directories become fragmented and those reads happen all 
over the disk (not as much of a problem with SSDs though).

Anyway, that's my guess as to what's going on, but it could be something 
else.  blktrace and/or xfs's kernel debugging stuff would probably lend 
some supporting evidence if this is what's going on.

Mark

Thank you a lot.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com