Re: CephFS Slow writes with 1MB files

Barclay Jameson <almightybeeij@xxxxxxxxx> · Mon, 30 Mar 2015 13:46:53 -0500

I will take a look into the perf counters.
Thanks for the pointers!

On Mon, Mar 30, 2015 at 1:30 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Sat, Mar 28, 2015 at 10:12 AM, Barclay Jameson
> <almightybeeij@xxxxxxxxx> wrote:
>> I redid my entire Ceph build going back to to CentOS 7 hoping to the
>> get the same performance I did last time.
>> The rados bench test was the best I have ever had with a time of 740
>> MB wr and 1300 MB rd. This was even better than the first rados bench
>> test that had performance equal to PanFS. I find that this does not
>> translate to my CephFS. Even with the following tweaking it still at
>> least twice as slow as PanFS and my first *Magical* build (that had
>> absolutely no tweaking):
>>
>> OSD
>>  osd_op_treads 8
>>  /sys/block/sd*/queue/nr_requests 4096
>>  /sys/block/sd*/queue/read_ahead_kb 4096
>>
>> Client
>>  rsize=16777216
>>  readdir_max_bytes=16777216
>>  readdir_max_entries=16777216
>>
>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS.
>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.
>>
>> Strange thing is none of the resources are taxed.
>> CPU, ram, network, disks, are not even close to being taxed on either
>> the client,mon/mds, or the osd nodes.
>> The PanFS client node was a 10Gb network the same as the CephFS client
>> but you can see the huge difference in speed.
>>
>> As per Gregs questions before:
>> There is only one client reading and writing (time cp Small1/*
>> Small2/.) but three clients have cephfs mounted, although they aren't
>> doing anything on the filesystem.
>>
>> I have done another test where I stream data info a file as fast as
>> the processor can put it there.
>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);}
>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
>> seconds for CephFS although the first build did it in 130 seconds
>> without any tuning.
>>
>> This leads me to believe the bottleneck is the mds. Does anybody have
>> any thoughts on this?
>> Are there any tuning parameters that I would need to speed up the mds?
>
> This is pretty likely, but 10 creates/second is just impossibly slow.
> The only other thing I can think of is that you might have enabled
> fragmentation but aren't now, which might make an impact on a
> directory with 100k entries.
>
> Or else your hardware is just totally wonky, which we've seen in the
> past but your server doesn't look quite large enough to be hitting any
> of the nasty NUMA stuff...but that's something else to look at which I
> can't help you with, although maybe somebody else can.
>
> If you're interested in diving into it and depending on the Ceph
> version you're running you can also examine the mds perfcounters
> (http://ceph.com/docs/master/dev/perf_counters/) and the op history
> (dump_ops_in_flight etc) and look for any operations which are
> noticeably slow.
> -Greg
>
>>
>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
>>> <almightybeeij@xxxxxxxxx> wrote:
>>>> Yes it's the exact same hardware except for the MDS server (although I
>>>> tried using the MDS on the old node).
>>>> I have not tried moving the MON back to the old node.
>>>>
>>>> My default cache size is "mds cache size = 10000000"
>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
>>>> I created 2048 for data and metadata:
>>>> ceph osd pool create cephfs_data 2048 2048
>>>> ceph osd pool create cephfs_metadata 2048 2048
>>>>
>>>>
>>>> To your point on clients competing against each other... how would I check that?
>>>
>>> Do you have multiple clients mounted? Are they both accessing files in
>>> the directory(ies) you're testing? Were they accessing the same
>>> pattern of files for the old cluster?
>>>
>>> If you happen to be running a hammer rc or something pretty new you
>>> can use the MDS admin socket to explore a bit what client sessions
>>> there are and what they have permissions on and check; otherwise
>>> you'll have to figure it out from the client side.
>>> -Greg
>>>
>>>>
>>>> Thanks for the input!
>>>>
>>>>
>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>> So this is exactly the same test you ran previously, but now it's on
>>>>> faster hardware and the test is slower?
>>>>>
>>>>> Do you have more data in the test cluster? One obvious possibility is
>>>>> that previously you were working entirely in the MDS' cache, but now
>>>>> you've got more dentries and so it's kicking data out to RADOS and
>>>>> then reading it back in.
>>>>>
>>>>> If you've got the memory (you appear to) you can pump up the "mds
>>>>> cache size" config option quite dramatically from it's default 100000.
>>>>>
>>>>> Other things to check are that you've got an appropriately-sized
>>>>> metadata pool, that you've not got clients competing against each
>>>>> other inappropriately, etc.
>>>>> -Greg
>>>>>
>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>> Opps I should have said that I am not just writing the data but copying it :
>>>>>>
>>>>>> time cp Small1/* Small2/*
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> BJ
>>>>>>
>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
>>>>>> <almightybeeij@xxxxxxxxx> wrote:
>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great
>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61
>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance
>>>>>>> by adding a better MDS server so I redid the entire build.
>>>>>>>
>>>>>>> Now it takes 4 times as long to write the same data as it did before.
>>>>>>> The only thing that changed was the MDS server. (I even tried moving
>>>>>>> the MDS back on the old slower node and the performance was the same.)
>>>>>>>
>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6
>>>>>>> and it's the same results.
>>>>>>> I use the same scripts to install the OSDs (which I created because I
>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use
>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.)
>>>>>>>
>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read
>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench
>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read)
>>>>>>>
>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression.
>>>>>>>
>>>>>>> Hardware Setup:
>>>>>>> [OSDs]
>>>>>>> 64 GB 2133 MHz
>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores)
>>>>>>> 40Gb Mellanox NIC
>>>>>>>
>>>>>>> [MDS/MON new]
>>>>>>> 128 GB 2133 MHz
>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores)
>>>>>>> 40Gb Mellanox NIC
>>>>>>>
>>>>>>> [MDS/MON old]
>>>>>>> 32 GB 800 MHz
>>>>>>> Dual Proc E5472  @ 3.00GHz (8 Cores)
>>>>>>> 10Gb Intel NIC
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com