Re: ceph-fuse CPU and Memory usage vs CephFS kclient

Wido den Hollander <wido@xxxxxxxx> · Wed, 11 Apr 2018 08:30:24 +0200

On 04/10/2018 09:45 PM, Gregory Farnum wrote:
> On Tue, Apr 10, 2018 at 12:36 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>
>> On 04/10/2018 09:22 PM, Gregory Farnum wrote:
>>> On Tue, Apr 10, 2018 at 6:32 AM Wido den Hollander <wido@xxxxxxxx
>>> <mailto:wido@xxxxxxxx>> wrote:
>>>
>>>     Hi,
>>>
>>>     There have been numerous threads about this in the past, but I wanted to
>>>     bring this up again in a new situation.
>>>
>>>     Running with Luminous v12.2.4 I'm seeing some odd Memory and CPU usage
>>>     when using the ceph-fuse client to mount a multi-MDS CephFS filesystem.
>>>
>>>         health: HEALTH_OK
>>>
>>>       services:
>>>         mon: 3 daemons, quorum luvil,sanomat,tide
>>>         mgr: luvil(active), standbys: tide, sanomat
>>>         mds: svw-2/2/2 up  {0=luvil=up:active,1=tide=up:active}, 1
>>>     up:standby
>>>         osd: 112 osds: 111 up, 111 in
>>>
>>>       data:
>>>         pools:   2 pools, 4352 pgs
>>>         objects: 85549k objects, 4415 GB
>>>         usage:   50348 GB used, 772 TB / 821 TB avail
>>>         pgs:     4352 active+clean
>>>
>>>     After running a rsync with millions of files (and some directories
>>>     having 1M files) a ceph-fuse process was using 44GB RSS and using
>>>     between 100% and 200% CPU usage.
>>>
>>>     Looking at this FUSE client through the admin socket the objecter was
>>>     one of my first suspects, but it claimed to only use ~300M of data in
>>>     it's case spread out over tens of thousands of files.
>>>
>>>     After unmounting and mounting again the Memory usage was gone and we
>>>     tried the rsync again, but it wasn't reproducible.
>>>
>>>     The CPU usage however is, a "simple" rsync would cause ceph-fuse to use
>>>     up to 100% CPU.
>>>
>>>     Switching to the kernel client (4.16 kernel) seems to solve this, but
>>>     the reason for using ceph-fuse in this would be the lack of a recent
>>>     kernel in Debian 9 in this case and the easiness to upgrade the FUSE
>>>     client.
>>>
>>>     I've tried to disable all logging inside the FUSE client, but that
>>>     didn't help.
>>>
>>>     When checking on the FUSE client's socket I saw that rename() operations
>>>     were hanging and that's something which rsync does a lot.
>>>
>>>     At the same time I saw a getfattr() being done to the same inode by the
>>>     FUSE client, but to a different MDS:
>>>
>>>     rename(): mds rank 0
>>>     getfattr: mds rank 1
>>>
>>>     Although the kernel client seems to perform better it has the same
>>>     behavior when looking at the mdsc file in /sys
>>>
>>>     216729  mds0    create  (unsafe)
>>>     #100021abbd9/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj
>>>     (reddata2/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj)
>>>
>>>     216731  mds1    rename
>>>      #100021abbd9/ddd.010236269.mpeg21.a0065.folia.xml.gz
>>>     (reddata2/ddd.010236269.mpeg21.a0065.folia.xml.gz)
>>>     #100021abbd9/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj
>>>     (reddata2/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj)
>>>
>>>     So this is rsync talking to two MDS, one for a create and one for a
>>>     rename.
>>>
>>>     Is this normal? Is this expected behavior?
>>>
>>>
>>> If the directory got large enough to be sharded across MDSes, yes, it's
>>> expected behavior. There are filesystems that attempt to recognize rsync
>>> and change their normal behavior specifically to deal with this case,
>>> but CephFS isn't one of them (yet, anyway).
>>>
>>
>> Yes, that directory is rather large.
>>
>> I've set max_mds to 1 for now and suddenly both FUSE and the kclient are
>> a lot after, not 10% but something like 80 to 100% faster.
>>
>> It seems like that directory was being balanced between two MDS and that
>> caused a 'massive' slow down.
>>
>> This can probably be influenced by tuning the MDS balancer settings, but
>> I am not sure yet where to start, any suggestions?
> 
> Well, you can disable directory fragmentation, but if it's happening
> automatically that means it's probably necessary. You can also pin the
> directory to a specific MDS, which will prevent the balancer from
> moving it or its descendants around. I'd try that; it should have the
> same impact on the client.

Yes, I understand. But when running with one MDS it goes just fine, when
running with 2 Active MDS the performance gets a serious hit.

A create() is send to the MDS with rank 0 and a rename() for the same
file to rank 1.

That causes a massive slowdown where the rename sometimes takes up to 10
seconds.

So it seems to be something like balancing between the two MDS.

Wido

> -Greg
> 
>>
>> Wido
>>
>>> Not sure about the specifics of the client memory or CPU usage; I think
>>> you'd have to profile. rsync is a pretty pessimal CephFS workload though
>>> and I think I've heard about this before...
>>> -Greg
>>>
>>>
>>>
>>>     To me it seems like that possibly the Subtree Partitioning might be
>>>     interfering here, but it wanted to double check.
>>>
>>>     Apart from that the CPU and Memory usage of the FUSE client seems very
>>>     high and that might be related to this.
>>>
>>>     Any ideas?
>>>
>>>     Thanks,
>>>
>>>     Wido
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com