Re: slow requests from rados bench with small writes

Wido den Hollander <wido@xxxxxxxx> · Mon, 17 Feb 2014 12:35:37 +0100

On 02/16/2014 05:18 PM, Sage Weil wrote:
Good catch!

It sounds like what is needed here is for the deb and rpm packages to add
/var/lib/ceph to the PRUNEPATHS in /etc/updatedb.conf.  Unfortunately
there isn't a /etc/updatedb.conf.d type file, so that promises to be
annoying.

Has anyone done this before?

No, I haven't, but I've seen this before. With Puppet I also overwrite 
this file.

Btw, I suggest we also contact Canonical to add 'ceph' to PRUNEFS, 
otherwise clients will start indexing CephFS filesystems later.

Wido

sage

On Sun, 16 Feb 2014, Dan van der Ster wrote:

After some further digging I realized that updatedb was running over
the pgs, indexing all the objects. (According to iostat, updatedb was
keeping the indexed disk 100% busy!) Oops!
Since the disks are using the deadline elevator (which by default
prioritizes reads over writes, and gives writes a deadline of 5
seconds!), it is perhaps conceivable (yet still surprising) that the
queues on a few disks were so full of reads that the writes were
starved for many 10s of seconds.

I've killed updatedb everywhere and now the rados bench below isn't
triggering slow requests.
So now I'm planning to tune deadline so it doesn't prioritize reads so
much, namely by decreasing write_expire to equal read_expire at 500ms,
and setting writes_starved to 1. Initial tests are showing that this
further decreases latency a bit -- but my hope is that this will
eliminate the possibility of a very long tail of writes. I hope that
someone will chip in if they've already been down this path and has
advice/warnings.

Cheers,
dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

On Sat, Feb 15, 2014 at 11:48 PM, Dan van der Ster
<daniel.vanderster@xxxxxxx> wrote:
Dear Ceph experts,

We've found that a single client running rados bench can drive other
users, ex. RBD users, into slow requests.

Starting with a cluster that is not particularly busy, e.g. :

2014-02-15 23:14:33.714085 mon.0 xx:6789/0 725224 : [INF] pgmap
v6561996: 27952 pgs: 27952 active+clean; 66303 GB data, 224 TB used,
2850 TB / 3075 TB avail; 4880KB
/s rd, 28632KB/s wr, 271op/s

We then start a rados bench writing many small objects:
rados bench -p test 60 write -t 500 -b 1024 --no-cleanup

which gives these results (note the >60s max latency!!):

Total time run: 86.351424
Total writes made: 91425
Write size: 1024
Bandwidth (MB/sec): 1.034
Stddev Bandwidth: 1.26486
Max bandwidth (MB/sec): 7.14941
Min bandwidth (MB/sec): 0
Average Latency: 0.464847
Stddev Latency: 3.04961
Max latency: 66.4363
Min latency: 0.003188

30 seconds into this bench we start seeing slow requests, not only
from bench writes but also some poor RBD clients, e.g.:

2014-02-15 23:16:02.820507 osd.483 xx:6804/46799 2201 : [WRN] slow
request 30.195634 seconds old, received at 2014-02-15 23:15:32.624641:
osd_sub_op(client.18535427.0:3922272 4.d42
4eb00d42/rbd_data.11371325138b774.0000000000006577/head//4 [] v
42083'71453 snapset=0=[]:[] snapc=0=[]) v7 currently commit sent

During a longer, many-hour instance of this small write test, some of
these RBD slow writes became very user visible, with disk flushes
being blocked long enough (>120s) for the VM kernels to start
complaining.

A rados bench from a 10Gig-e client writing 4MB objects doesn't have
the same long tail of latency, namely:

# rados bench -p test 60 write -t 500 --no-cleanup
...
Total time run: 62.811466
Total writes made: 8553
Write size: 4194304
Bandwidth (MB/sec): 544.678

Stddev Bandwidth: 173.163
Max bandwidth (MB/sec): 1000
Min bandwidth (MB/sec): 0
Average Latency: 3.50719
Stddev Latency: 0.309876
Max latency: 8.04493
Min latency: 0.166138

and there are zero slow requests, at least during this 60s duration.

While the vast majority of small writes are completing with a
reasonable sub-second latency, what is causing the very long tail seen
by a few writes?? -- 60-120s!! Can someone advise us where to look in
the perf dump, etc... to find which resource/queue is being exhausted
during these tests?

Oh yeah, we're running latest dumpling stable, 0.67.5, on the servers.

Best Regards, Thanks in advance!
Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com