Re: slow requests from rados bench with small writes

Dan van der Ster <daniel.vanderster@xxxxxxx> · Sun, 16 Feb 2014 11:50:19 +0100

After some further digging I realized that updatedb was running over
the pgs, indexing all the objects. (According to iostat, updatedb was
keeping the indexed disk 100% busy!) Oops!
Since the disks are using the deadline elevator (which by default
prioritizes reads over writes, and gives writes a deadline of 5
seconds!), it is perhaps conceivable (yet still surprising) that the
queues on a few disks were so full of reads that the writes were
starved for many 10s of seconds.

I've killed updatedb everywhere and now the rados bench below isn't
triggering slow requests.
So now I'm planning to tune deadline so it doesn't prioritize reads so
much, namely by decreasing write_expire to equal read_expire at 500ms,
and setting writes_starved to 1. Initial tests are showing that this
further decreases latency a bit -- but my hope is that this will
eliminate the possibility of a very long tail of writes. I hope that
someone will chip in if they've already been down this path and has
advice/warnings.

Cheers,
dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

On Sat, Feb 15, 2014 at 11:48 PM, Dan van der Ster
<daniel.vanderster@xxxxxxx> wrote:
> Dear Ceph experts,
>
> We've found that a single client running rados bench can drive other
> users, ex. RBD users, into slow requests.
>
> Starting with a cluster that is not particularly busy, e.g. :
>
> 2014-02-15 23:14:33.714085 mon.0 xx:6789/0 725224 : [INF] pgmap
> v6561996: 27952 pgs: 27952 active+clean; 66303 GB data, 224 TB used,
> 2850 TB / 3075 TB avail; 4880KB
> /s rd, 28632KB/s wr, 271op/s
>
> We then start a rados bench writing many small objects:
> rados bench -p test 60 write -t 500 -b 1024 --no-cleanup
>
> which gives these results (note the >60s max latency!!):
>
> Total time run: 86.351424
> Total writes made: 91425
> Write size: 1024
> Bandwidth (MB/sec): 1.034
> Stddev Bandwidth: 1.26486
> Max bandwidth (MB/sec): 7.14941
> Min bandwidth (MB/sec): 0
> Average Latency: 0.464847
> Stddev Latency: 3.04961
> Max latency: 66.4363
> Min latency: 0.003188
>
> 30 seconds into this bench we start seeing slow requests, not only
> from bench writes but also some poor RBD clients, e.g.:
>
> 2014-02-15 23:16:02.820507 osd.483 xx:6804/46799 2201 : [WRN] slow
> request 30.195634 seconds old, received at 2014-02-15 23:15:32.624641:
> osd_sub_op(client.18535427.0:3922272 4.d42
> 4eb00d42/rbd_data.11371325138b774.0000000000006577/head//4 [] v
> 42083'71453 snapset=0=[]:[] snapc=0=[]) v7 currently commit sent
>
> During a longer, many-hour instance of this small write test, some of
> these RBD slow writes became very user visible, with disk flushes
> being blocked long enough (>120s) for the VM kernels to start
> complaining.
>
> A rados bench from a 10Gig-e client writing 4MB objects doesn't have
> the same long tail of latency, namely:
>
> # rados bench -p test 60 write -t 500 --no-cleanup
> ...
> Total time run: 62.811466
> Total writes made: 8553
> Write size: 4194304
> Bandwidth (MB/sec): 544.678
>
> Stddev Bandwidth: 173.163
> Max bandwidth (MB/sec): 1000
> Min bandwidth (MB/sec): 0
> Average Latency: 3.50719
> Stddev Latency: 0.309876
> Max latency: 8.04493
> Min latency: 0.166138
>
> and there are zero slow requests, at least during this 60s duration.
>
> While the vast majority of small writes are completing with a
> reasonable sub-second latency, what is causing the very long tail seen
> by a few writes?? -- 60-120s!! Can someone advise us where to look in
> the perf dump, etc... to find which resource/queue is being exhausted
> during these tests?
>
> Oh yeah, we're running latest dumpling stable, 0.67.5, on the servers.
>
> Best Regards, Thanks in advance!
> Dan
>
> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com