On 04/10/2018 09:45 PM, Gregory Farnum wrote: > On Tue, Apr 10, 2018 at 12:36 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >> >> >> On 04/10/2018 09:22 PM, Gregory Farnum wrote: >>> On Tue, Apr 10, 2018 at 6:32 AM Wido den Hollander <wido@xxxxxxxx >>> <mailto:wido@xxxxxxxx>> wrote: >>> >>> Hi, >>> >>> There have been numerous threads about this in the past, but I wanted to >>> bring this up again in a new situation. >>> >>> Running with Luminous v12.2.4 I'm seeing some odd Memory and CPU usage >>> when using the ceph-fuse client to mount a multi-MDS CephFS filesystem. >>> >>> health: HEALTH_OK >>> >>> services: >>> mon: 3 daemons, quorum luvil,sanomat,tide >>> mgr: luvil(active), standbys: tide, sanomat >>> mds: svw-2/2/2 up {0=luvil=up:active,1=tide=up:active}, 1 >>> up:standby >>> osd: 112 osds: 111 up, 111 in >>> >>> data: >>> pools: 2 pools, 4352 pgs >>> objects: 85549k objects, 4415 GB >>> usage: 50348 GB used, 772 TB / 821 TB avail >>> pgs: 4352 active+clean >>> >>> After running a rsync with millions of files (and some directories >>> having 1M files) a ceph-fuse process was using 44GB RSS and using >>> between 100% and 200% CPU usage. >>> >>> Looking at this FUSE client through the admin socket the objecter was >>> one of my first suspects, but it claimed to only use ~300M of data in >>> it's case spread out over tens of thousands of files. >>> >>> After unmounting and mounting again the Memory usage was gone and we >>> tried the rsync again, but it wasn't reproducible. >>> >>> The CPU usage however is, a "simple" rsync would cause ceph-fuse to use >>> up to 100% CPU. >>> >>> Switching to the kernel client (4.16 kernel) seems to solve this, but >>> the reason for using ceph-fuse in this would be the lack of a recent >>> kernel in Debian 9 in this case and the easiness to upgrade the FUSE >>> client. >>> >>> I've tried to disable all logging inside the FUSE client, but that >>> didn't help. >>> >>> When checking on the FUSE client's socket I saw that rename() operations >>> were hanging and that's something which rsync does a lot. >>> >>> At the same time I saw a getfattr() being done to the same inode by the >>> FUSE client, but to a different MDS: >>> >>> rename(): mds rank 0 >>> getfattr: mds rank 1 >>> >>> Although the kernel client seems to perform better it has the same >>> behavior when looking at the mdsc file in /sys >>> >>> 216729 mds0 create (unsafe) >>> #100021abbd9/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj >>> (reddata2/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj) >>> >>> 216731 mds1 rename >>> #100021abbd9/ddd.010236269.mpeg21.a0065.folia.xml.gz >>> (reddata2/ddd.010236269.mpeg21.a0065.folia.xml.gz) >>> #100021abbd9/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj >>> (reddata2/.ddd.010236269.mpeg21.a0065.folia.xml.gz.AuxBQj) >>> >>> So this is rsync talking to two MDS, one for a create and one for a >>> rename. >>> >>> Is this normal? Is this expected behavior? >>> >>> >>> If the directory got large enough to be sharded across MDSes, yes, it's >>> expected behavior. There are filesystems that attempt to recognize rsync >>> and change their normal behavior specifically to deal with this case, >>> but CephFS isn't one of them (yet, anyway). >>> >> >> Yes, that directory is rather large. >> >> I've set max_mds to 1 for now and suddenly both FUSE and the kclient are >> a lot after, not 10% but something like 80 to 100% faster. >> >> It seems like that directory was being balanced between two MDS and that >> caused a 'massive' slow down. >> >> This can probably be influenced by tuning the MDS balancer settings, but >> I am not sure yet where to start, any suggestions? > > Well, you can disable directory fragmentation, but if it's happening > automatically that means it's probably necessary. You can also pin the > directory to a specific MDS, which will prevent the balancer from > moving it or its descendants around. I'd try that; it should have the > same impact on the client. Yes, I understand. But when running with one MDS it goes just fine, when running with 2 Active MDS the performance gets a serious hit. A create() is send to the MDS with rank 0 and a rename() for the same file to rank 1. That causes a massive slowdown where the rename sometimes takes up to 10 seconds. So it seems to be something like balancing between the two MDS. Wido > -Greg > >> >> Wido >> >>> Not sure about the specifics of the client memory or CPU usage; I think >>> you'd have to profile. rsync is a pretty pessimal CephFS workload though >>> and I think I've heard about this before... >>> -Greg >>> >>> >>> >>> To me it seems like that possibly the Subtree Partitioning might be >>> interfering here, but it wanted to double check. >>> >>> Apart from that the CPU and Memory usage of the FUSE client seems very >>> high and that might be related to this. >>> >>> Any ideas? >>> >>> Thanks, >>> >>> Wido >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com