Re: CephFS Slow writes with 1MB files

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 2 Apr 2015 19:03:34 +0800

On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
<almightybeeij@xxxxxxxxx> wrote:
> Here is the mds output from the command you requested. I did this
> during the small data run . ( time cp small1/* small2/ )
> It is 20MB in size so I couldn't find a place online that would accept
> that much data.
>
> Please find attached file.
>
> Thanks,

In the log file, each 'create' request is followed by several
'getattr' requests. I guess these 'getattr' requests resulted from
some kinds of permission check, but I can't reproduce this situation
locally.

which version of ceph/kernel are you using? do you use ceph-fuse or
kernel client, what's the mount options?

Regards
Yan, Zheng

>
> Beeij
>
>
> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
>> <almightybeeij@xxxxxxxxx> wrote:
>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>>> get the same performance I did last time.
>>> The rados bench test was the best I have ever had with a time of 740
>>> MB wr and 1300 MB rd. This was even better than the first rados bench
>>> test that had performance equal to PanFS. I find that this does not
>>> translate to my CephFS. Even with the following tweaking it still at
>>> least twice as slow as PanFS and my first *Magical* build (that had
>>> absolutely no tweaking):
>>>
>>> OSD
>>>  osd_op_treads 8
>>>  /sys/block/sd*/queue/nr_requests 4096
>>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>>
>>> Client
>>>  rsize=16777216
>>>  readdir_max_bytes=16777216
>>>  readdir_max_entries=16777216
>>>
>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>>
>>> Strange thing is none of the resources are taxed.
>>> CPU, ram, network, disks, are not even close to being taxed on either
>>> the client,mon/mds, or the osd nodes.
>>> The PanFS client node was a 10Gb network the same as the CephFS client
>>> but you can see the huge difference in speed.
>>>
>>> As per Gregs questions before:
>>> There is only one client reading and writing (time cp Small1/*
>>> Small2/.) but three clients have cephfs mounted, although they aren't
>>> doing anything on the filesystem.
>>>
>>> I have done another test where I stream data info a file as fast as
>>> the processor can put it there.
>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>>> seconds for CephFS although the first build did it in 130 seconds
>>> without any tuning.
>>>
>>> This leads me to believe the bottleneck is the mds. Does anybody have
>>> any thoughts on this?
>>> Are there any tuning parameters that I would need to speed up the mds?
>>
>> could you enable mds debugging for a few seconds (ceph daemon mds.x
>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set
>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.
>>
>> Regards
>> Yan, Zheng
>>
>>>
>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>>> tried using the MDS on the old node).
>>>>> I have not tried moving the MON back to the old node.
>>>>>
>>>>> My default cache size is "mds cache size = 10000000"
>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>>> I created 2048 for data and metadata:
>>>>> ceph osd pool create cephfs_data 2048 2048
>>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>>
>>>>>
>>>>> To your point on clients competing against each other... how would I check that?
>>>>
>>>> Do you have multiple clients mounted? Are they both accessing files in
>>>> the directory(ies) you're testing? Were they accessing the same
>>>> pattern of files for the old cluster?
>>>>
>>>> If you happen to be running a hammer rc or something pretty new you
>>>> can use the MDS admin socket to explore a bit what client sessions
>>>> there are and what they have permissions on and check; otherwise
>>>> you'll have to figure it out from the client side.
>>>> -Greg
>>>>
>>>>>
>>>>> Thanks for the input!
>>>>>
>>>>>
>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>>> faster hardware and the test is slower?
>>>>>>
>>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>>> then reading it back in.
>>>>>>
>>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>>
>>>>>> Other things to check are that you've got an appropriately-sized
>>>>>> metadata pool, that you've not got clients competing against each
>>>>>> other inappropriately, etc.
>>>>>> -Greg
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>>
>>>>>>> time cp Small1/* Small2/*
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> BJ
>>>>>>>
>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>>
>>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>>
>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>>> and it's the same results.
>>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>>
>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>>
>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>>
>>>>>>>> Hardware Setup:
>>>>>>>> [OSDs]
>>>>>>>> 64 GB 2133 MHz
>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>
>>>>>>>> [MDS/MON new]
>>>>>>>> 128 GB 2133 MHz
>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>>> 40Gb Mellanox NIC
>>>>>>>>
>>>>>>>> [MDS/MON old]
>>>>>>>> 32 GB 800 MHz
>>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>>> 10Gb Intel NIC
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com