Re: CephFS Slow writes with 1MB files

Barclay Jameson <almightybeeij@xxxxxxxxx> · Thu, 2 Apr 2015 10:18:24 -0500

I am using the Giant release. The OSDs and MON/MDS are using default
RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
cephaux.

I may have found something.
I did the build manually as such I did _NOT_ set up these config settings:
filestore xattr use omap = false
filestore max inline xattr size = 65536,
filestore_max_inline_xattr_size_xfs = 65536
filestore_max_inline_xattr_size_other = 512
filestore_max_inline_xattrs_xfs = 10

I just changed these settings to see if it will make a difference.
I copied data from one directory that had files I created before I set
these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
to copy 1600 files.
If I took the files I just copied from small2 and copy them to a
different directory ( time cp small2/* small3/.) it only takes 5 mins
to copy 10000 files!

Could this be part of the problem?

On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
> <almightybeeij@xxxxxxxxx> wrote:
>> Here is the mds output from the command you requested. I did this
>> during the small data run . ( time cp small1/* small2/ )
>> It is 20MB in size so I couldn't find a place online that would accept
>> that much data.
>>
>> Please find attached file.
>>
>> Thanks,
>
> In the log file, each 'create' request is followed by several
> 'getattr' requests. I guess these 'getattr' requests resulted from
> some kinds of permission check, but I can't reproduce this situation
> locally.
>
> which version of ceph/kernel are you using? do you use ceph-fuse or
> kernel client, what's the mount options?
>
> Regards
> Yan, Zheng
>
>
>>
>> Beeij
>>
>>
>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>> <almightybeeij@xxxxxxxxx> wrote:
>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>> get the same performance I did last time.
>>>> The rados bench test was the best I have ever had with a time of 740
>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>> test that had performance equal to PanFS. I find that this does not
>>>> translate to my CephFS. Even with the following tweaking it still at
>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>> absolutely no tweaking):
>>>>
>>>> OSD
>>>>  osd_op_treads 8
>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>
>>>> Client
>>>>  rsize=16777216
>>>>  readdir_max_bytes=16777216
>>>>  readdir_max_entries=16777216
>>>>
>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>
>>>> Strange thing is none of the resources are taxed.
>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>> the client,mon/mds, or the osd nodes.
>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>> but you can see the huge difference in speed.
>>>>
>>>> As per Gregs questions before:
>>>> There is only one client reading and writing (time cp Small1/*
>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>> doing anything on the filesystem.
>>>>
>>>> I have done another test where I stream data info a file as fast as
>>>> the processor can put it there.
>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>> seconds for CephFS although the first build did it in 130 seconds
>>>> without any tuning.
>>>>
>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>> any thoughts on this?
>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>
>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>>>
>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>> tried using the MDS on the old node).
>>>>>> I have not tried moving the MON back to the old node.
>>>>>>
>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>> I created 2048 for data and metadata:
>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>
>>>>>>
>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>
>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>> pattern of files for the old cluster?
>>>>>
>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>> there are and what they have permissions on and check; otherwise
>>>>> you'll have to figure it out from the client side.
>>>>> -Greg
>>>>>
>>>>>>
>>>>>> Thanks for the input!
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>> faster hardware and the test is slower?
>>>>>>>
>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>> then reading it back in.
>>>>>>>
>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>
>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>> other inappropriately, etc.
>>>>>>> -Greg
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>
>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> BJ
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>
>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>
>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>> and it's the same results.
>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>
>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>
>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>
>>>>>>>>> Hardware Setup:
>>>>>>>>> [OSDs]
>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>
>>>>>>>>> [MDS/MON new]
>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>
>>>>>>>>> [MDS/MON old]
>>>>>>>>> 32 GB 800 MHz
>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>> 10Gb Intel NIC
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com