Re: CephFS Slow writes with 1MB files

"Yan, Zheng" <ukernel@xxxxxxxxx> · Fri, 3 Apr 2015 10:12:50 +0800

On Thu, Apr 2, 2015 at 11:18 PM, Barclay Jameson
<almightybeeij@xxxxxxxxx> wrote:
> I am using the Giant release. The OSDs and MON/MDS are using default
> RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
> cephaux.

I reproduced this issue by using giant release. It's a bug in the MDS
code. Could you try the newest development version ceph (It includes
the fix). Or apply the attached patch to source of  giant release.

Regards
Yan, Zheng

>
> I may have found something.
> I did the build manually as such I did _NOT_ set up these config settings:
> filestore xattr use omap = false
> filestore max inline xattr size = 65536,
> filestore_max_inline_xattr_size_xfs = 65536
> filestore_max_inline_xattr_size_other = 512
> filestore_max_inline_xattrs_xfs = 10
>
> I just changed these settings to see if it will make a difference.
> I copied data from one directory that had files I created before I set
> these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
> to copy 1600 files.
> If I took the files I just copied from small2 and copy them to a
> different directory ( time cp small2/* small3/.) it only takes 5 mins
> to copy 10000 files!
>
> Could this be part of the problem?
>
>
> On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>> On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
>> <almightybeeij@xxxxxxxxx> wrote:
>>> Here is the mds output from the command you requested. I did this
>>> during the small data run . ( time cp small1/* small2/ )
>>> It is 20MB in size so I couldn't find a place online that would accept
>>> that much data.
>>>
>>> Please find attached file.
>>>
>>> Thanks,
>>
>> In the log file, each 'create' request is followed by several
>> 'getattr' requests. I guess these 'getattr' requests resulted from
>> some kinds of permission check, but I can't reproduce this situation
>> locally.
>>
>> which version of ceph/kernel are you using? do you use ceph-fuse or
>> kernel client, what's the mount options?
>>
>> Regards
>> Yan, Zheng
>>
>>
>>>
>>> Beeij
>>>
>>>
>>> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>>>> get the same performance I did last time.
>>>>> The rados bench test was the best I have ever had with a time of 740
>>>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>>>> test that had performance equal to PanFS. I find that this does not
>>>>> translate to my CephFS. Even with the following tweaking it still at
>>>>> least twice as slow as PanFS and my first *Magical* build (that had
>>>>> absolutely no tweaking):
>>>>>
>>>>> OSD
>>>>>  osd_op_treads 8
>>>>>  /sys/block/sd*/queue/nr_requests 4096
>>>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>>>
>>>>> Client
>>>>>  rsize=16777216
>>>>>  readdir_max_bytes=16777216
>>>>>  readdir_max_entries=16777216
>>>>>
>>>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>>>
>>>>> Strange thing is none of the resources are taxed.
>>>>> CPU, ram, network, disks, are not even close to being taxed on either
>>>>> the client,mon/mds, or the osd nodes.
>>>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>>>> but you can see the huge difference in speed.
>>>>>
>>>>> As per Gregs questions before:
>>>>> There is only one client reading and writing (time cp Small1/*
>>>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>>>> doing anything on the filesystem.
>>>>>
>>>>> I have done another test where I stream data info a file as fast as
>>>>> the processor can put it there.
>>>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>>>> seconds for CephFS although the first build did it in 130 seconds
>>>>> without any tuning.
>>>>>
>>>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>>>> any thoughts on this?
>>>>> Are there any tuning parameters that I would need to speed up the mds?
>>>>
>>>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>>>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>>>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>>>> tried using the MDS on the old node).
>>>>>>> I have not tried moving the MON back to the old node.
>>>>>>>
>>>>>>> My default cache size is "mds cache size = 10000000"
>>>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>>>> I created 2048 for data and metadata:
>>>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>>>
>>>>>>>
>>>>>>> To your point on clients competing against each other... how would I check that?
>>>>>>
>>>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>>>> the directory(ies) you're testing? Were they accessing the same
>>>>>> pattern of files for the old cluster?
>>>>>>
>>>>>> If you happen to be running a hammer rc or something pretty new you
>>>>>> can use the MDS admin socket to explore a bit what client sessions
>>>>>> there are and what they have permissions on and check; otherwise
>>>>>> you'll have to figure it out from the client side.
>>>>>> -Greg
>>>>>>
>>>>>>>
>>>>>>> Thanks for the input!
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>>>> faster hardware and the test is slower?
>>>>>>>>
>>>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>>>> then reading it back in.
>>>>>>>>
>>>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>>>
>>>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>>>> metadata pool, that you've not got clients competing against each
>>>>>>>> other inappropriately, etc.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>>>
>>>>>>>>> time cp Small1/* Small2/*
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> BJ
>>>>>>>>>
>>>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>>>
>>>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>>>
>>>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>>>> and it's the same results.
>>>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>>>
>>>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>>>
>>>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>>>
>>>>>>>>>> Hardware Setup:
>>>>>>>>>> [OSDs]
>>>>>>>>>> 64 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON new]
>>>>>>>>>> 128 GB 2133 MHz
>>>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>>>
>>>>>>>>>> [MDS/MON old]
>>>>>>>>>> 32 GB 800 MHz
>>>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>>>> 10Gb Intel NIC
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
patch

Description: Binary data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com