Re: CephFS and slow requests

Dan van der Ster <daniel.vanderster@xxxxxxx> · Tue, 25 Feb 2014 08:54:40 +0100

It's really bizarre, since we can easily pump ~1GB/s into the cluster with rados bench from a single 10Gig-E client. We only observe this with kernel CephFS on that host -- which is why our original theory something like this:

   - client caches 4GB of writes
   - client starts many opening IOs in parallel to flush that cache
   - each individual 4MB write is taking longer than 30s to send from the client to the OSD, due to the 1Gig-E network interface on the client.

But in that we assume quite a lot about the implementations of librados and the osd. But something like this would also explain why only the cephfs writes are becoming slow -- the 2kHz of other (mostly RBD) IOs are not affected by this "overload".

Cheers, Dan

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

On Tue, Feb 25, 2014 at 7:25 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

I'm with Zheng on this one. I'm a little confused though, because I

thought this was a pretty large cluster that should be able to absorb

that much data pretty easily. But if you're using a custom striping

strategy and pushing it all through one OSD, that could do it. Or

anything else with that sort of outcome, because obviously you've got

OSDs that are simply getting overloaded by the traffic pattern.

-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

On Fri, Feb 21, 2014 at 4:06 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:

> On Sat, Feb 22, 2014 at 12:04 AM, Dan van der Ster

> <daniel.vanderster@xxxxxxx> wrote:

>> Hi Greg,

>> Yes, this still happens after the updatedb fix.

>>

>> [root@xxx dan]# mount

>> ...

>> zzz:6789:/ on /mnt/ceph type ceph (name=cephfs,key=client.cephfs)

>>

>> [root@xxx dan]# pwd

>> /mnt/ceph/dan

>>

>> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000

>> 2000+0 records in

>> 2000+0 records out

>> 8388608000 bytes (8.4 GB) copied, 9.21217 s, 911 MB/s

>>

>>

>> Then 30s later:

>>

>> 2014-02-21 16:16:11.315110 osd.326 x:6836/31929 683 : [WRN] 1 slow requests,

>> 1 included below; oldest blocked for > 32.432401 secs

>> 2014-02-21 16:16:11.315317 osd.326 x:6836/31929 684 : [WRN] slow request

>> 32.432401 seconds old, received at 2014-02-21 16:15:38.882584:

>> osd_op(client.16735018.1:22522476 100000352bf.000002a4 [write 0~4194304

>> [8@0],startsync 0~0] 0.5447d769 snapc 1=[] e42655) v4 currently waiting for

>> subops from [357,191]

>>

>> And no slow requests for other active clients.

>>

>> Reminder, this is 1GigE client, 64GB RAM, kernel 3.13.0-1.el6.elrepo.x86_64,

>> kernel mounted cephfs. I can't reproduce this on a 1GigE client with only

>> 8GB ram, 3.11.0-15-generic and 3.13.4-031304-generic. (The smaller RAM

>> client is writing at 110-120MB/s vs the 900MB/s writes seen on the big RAM

>> machine -- obviously the writes are all buffered on the big ram machine).

>> Maybe the RAM isn't related, though, as with fdatasync mode we still see the

>> slow requests:

>>

>> [root@xxx dan]# dd if=/dev/zero of=yyy bs=4M count=2000 conv=fdatasync

>> 2000+0 records in

>> 2000+0 records out

>> 8388608000 bytes (8.4 GB) copied, 78.26 s, 107 MB/s

>

> It's likely this issue is related to big RAM. Big RAM allow the kernel

> to cache large amount of dirty data. Therefore the kernel creates lots

> of OSD requests when flushing dirty data. (conv=fdatasync doesn't help

> here because dd calls fdatasync after all buffered writes finish)

>

> Regards

> Yan, Zheng

>

>>

>> 2014-02-21 16:26:15.202047 osd.818 x:6803/128164 1219 : [WRN] 1 slow

>> requests, 1 included below; oldest blocked for > 30.446683 secs

>> 2014-02-21 16:26:15.202194 osd.818 x:6803/128164 1220 : [WRN] slow request

>> 30.446683 seconds old, received at 2014-02-21 16:25:44.754914:

>> osd_op(client.16735018.1:22524842 100000352bf.00000355 [write 0~4194304

>> [12@0],startsync 0~0] 0.c36d4557 snapc 1=[] e42655) v4 currently waiting for

>> subops from [558,827]

>>

>>

>> Cheers, Dan

>>

>>

>>

>> -- Dan van der Ster || Data & Storage Services || CERN IT Department --

>>

>>

>> On Thu, Feb 20, 2014 at 4:02 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

>>>

>>> Arne,

>>> Sorry this got dropped -- I had it marked in my mail but didn't have

>>> the chance to think about it seriously when you sent it. Does this

>>> still happen after the updatedb config change you guys made recently?

>>> -Greg

>>> Software Engineer #42 @ http://inktank.com | http://ceph.com

>>>

>>>

>>> On Fri, Jan 31, 2014 at 5:52 AM, Arne Wiebalck <Arne.Wiebalck@xxxxxxx>

>>> wrote:

>>> > Hi,

>>> >

>>> > We observe that we can easily create slow requests with a simple dd on

>>> > CephFS:

>>> >

>>> > -->

>>> > [root@p05153026953834 dd]# dd if=/dev/zero of=xxx bs=4M count=1000

>>> > 1000+0 records in

>>> > 1000+0 records out

>>> > 4194304000 bytes (4.2 GB) copied, 4.27824 s, 980 MB/s

>>> >

>>> > ceph -w:

>>> > 2014-01-31 14:28:44.009543 osd.450 [WRN] 1 slow requests, 1 included

>>> > below;

>>> > oldest blocked for > 31.088950 secs

>>> > 2014-01-31 14:28:44.009676 osd.450 [WRN] slow request 31.088950 seconds

>>> > old,

>>> > received at 2014-01-31 14:28:12.920423:

>>> > osd_op(client.16735018.1:22493091

>>> > 100000352b3.000002e9 [write 0~4194304,startsync 0~0] 0.518f2eef snapc

>>> > 1=[]

>>> > e32400) v4 currently waiting for subops from [87,1190]

>>> > <---

>>> >

>>> > From what we see, the OSDs are not busy, so we suspect that it is the

>>> > client

>>> > starting all requests,

>>> > but then the requests take longer than 30 secs to finish writing, i.e.

>>> > flushing the client-side buffers.

>>> >

>>> > Is our understanding correct?

>>> > Do these slow requests have an impact on requests from other clients,

>>> > i.e.

>>> > some OSD resources

>>> > consumed by these clients?

>>> >

>>> > The setup is:

>>> > Client: kernel 3.13.0, 1GbE

>>> > MDS Emperor 0.72.2

>>> > OSDs Dumpling 0.67.5

>>> >

>>> > Thanks!

>>> >  Dan & Arne

>>> >

>>> >

>>> > --

>>> > Arne Wiebalck

>>> > CERN IT

>>> >

>>> >

>>> > _______________________________________________

>>> > ceph-users mailing list

>>> > ceph-users@xxxxxxxxxxxxxx

>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

>>

>>

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com