Re: Extremely slow small files rewrite performance

"Yan, Zheng" <ukernel@xxxxxxxxx> · Fri, 24 Oct 2014 09:35:33 -0700

On Fri, Oct 24, 2014 at 8:47 AM, Sergey Nazarov <natarajaya@xxxxxxxxx> wrote:
> Any update?
>

The short answer is that when the command is executed for second time,
the MDS needs to truncate the file zero length. The speed of truncate
a file is limited by the OSD speed. (creating file and write data to
the file are async operations, but truncating a file is sync
operation)

Regards
Yan, Zheng

> On Tue, Oct 21, 2014 at 3:32 PM, Sergey Nazarov <natarajaya@xxxxxxxxx> wrote:
>> Ouch, I think client log is missing.
>> Here it goes:
>> https://www.dropbox.com/s/650mjim2ldusr66/ceph-client.admin.log.gz?dl=0
>>
>> On Tue, Oct 21, 2014 at 3:22 PM, Sergey Nazarov <natarajaya@xxxxxxxxx> wrote:
>>> I enabled logging and performed same tests.
>>> Here is the link on archive with logs, they are only from one node
>>> (from the node where active MDS was sitting):
>>> https://www.dropbox.com/s/80axovtoofesx5e/logs.tar.gz?dl=0
>>>
>>> Rados bench results:
>>>
>>> # rados bench -p test 10 write
>>>  Maintaining 16 concurrent writes of 4194304 bytes for up to 10
>>> seconds or 0 objects
>>>  Object prefix: benchmark_data_atl-fs11_4630
>>>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>>      0       0         0         0         0         0         -         0
>>>      1      16        46        30   119.967       120  0.201327  0.348463
>>>      2      16        88        72   143.969       168  0.132983  0.353677
>>>      3      16       124       108   143.972       144  0.930837  0.383018
>>>      4      16       155       139   138.976       124  0.899468  0.426396
>>>      5      16       203       187   149.575       192  0.236534  0.400806
>>>      6      16       243       227   151.309       160  0.835213  0.397673
>>>      7      16       276       260   148.549       132  0.905989  0.406849
>>>      8      16       306       290   144.978       120  0.353279  0.422106
>>>      9      16       335       319   141.757       116   1.12114  0.428268
>>>     10      16       376       360    143.98       164  0.418921   0.43351
>>>     11      16       377       361   131.254         4  0.499769  0.433693
>>>  Total time run:         11.206306
>>> Total writes made:      377
>>> Write size:             4194304
>>> Bandwidth (MB/sec):     134.567
>>>
>>> Stddev Bandwidth:       60.0232
>>> Max bandwidth (MB/sec): 192
>>> Min bandwidth (MB/sec): 0
>>> Average Latency:        0.474923
>>> Stddev Latency:         0.376038
>>> Max latency:            1.82171
>>> Min latency:            0.060877
>>>
>>>
>>> # rados bench -p test 10 seq
>>>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>>      0       0         0         0         0         0         -         0
>>>      1      16        61        45   179.957       180  0.010405   0.25243
>>>      2      16       109        93   185.962       192  0.908263  0.284303
>>>      3      16       151       135   179.965       168  0.255312  0.297283
>>>      4      16       191       175    174.97       160  0.836727  0.330659
>>>      5      16       236       220   175.971       180  0.009995  0.330832
>>>      6      16       275       259   172.639       156   1.06855  0.345418
>>>      7      16       311       295   168.545       144  0.907648  0.361689
>>>      8      16       351       335   167.474       160  0.947688  0.363552
>>>      9      16       390       374   166.196       156  0.140539  0.369057
>>>  Total time run:        9.755367
>>> Total reads made:     401
>>> Read size:            4194304
>>> Bandwidth (MB/sec):    164.422
>>>
>>> Average Latency:       0.387705
>>> Max latency:           1.33852
>>> Min latency:           0.008064
>>>
>>> # rados bench -p test 10 rand
>>>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>>>      0       0         0         0         0         0         -         0
>>>      1      16        55        39   155.938       156  0.773716  0.257267
>>>      2      16        93        77   153.957       152  0.006573  0.339199
>>>      3      16       135       119   158.629       168  0.009851  0.359675
>>>      4      16       171       155   154.967       144  0.892027  0.359015
>>>      5      16       209       193   154.369       152   1.13945  0.378618
>>>      6      16       256       240    159.97       188  0.009965  0.368439
>>>      7      16       295       279     159.4       156  0.195812  0.371259
>>>      8      16       343       327   163.472       192  0.880587  0.370759
>>>      9      16       380       364    161.75       148  0.113111  0.377983
>>>     10      16       424       408   163.173       176  0.772274  0.379497
>>>  Total time run:        10.518482
>>> Total reads made:     425
>>> Read size:            4194304
>>> Bandwidth (MB/sec):    161.620
>>>
>>> Average Latency:       0.393978
>>> Max latency:           1.36572
>>> Min latency:           0.006448
>>>
>>> On Tue, Oct 21, 2014 at 2:03 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>> Can you enable debugging on the client ("debug ms = 1", "debug client
>>>> = 20") and mds ("debug ms = 1", "debug mds = 20"), run this test
>>>> again, and post them somewhere for me to look at?
>>>>
>>>> While you're at it, can you try rados bench and see what sort of
>>>> results you get?
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>
>>>>
>>>> On Tue, Oct 21, 2014 at 10:57 AM, Sergey Nazarov <natarajaya@xxxxxxxxx> wrote:
>>>>> It is CephFS mounted via ceph-fuse.
>>>>> I am getting the same results not depending on how many other clients
>>>>> are having this fs mounted and their activity.
>>>>> Cluster is working on Debian Wheezy, kernel 3.2.0-4-amd64.
>>>>>
>>>>> On Tue, Oct 21, 2014 at 1:44 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>>> Are these tests conducted using a local fs on RBD, or using CephFS?
>>>>>> If CephFS, do you have multiple clients mounting the FS, and what are
>>>>>> they doing? What client (kernel or ceph-fuse)?
>>>>>> -Greg
>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 21, 2014 at 9:05 AM, Sergey Nazarov <natarajaya@xxxxxxxxx> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> I just built a new cluster using this quickstart instructions:
>>>>>>> http://ceph.com/docs/master/start/
>>>>>>>
>>>>>>> And here is what I am seeing:
>>>>>>>
>>>>>>> # time for i in {1..10}; do echo $i > $i.txt ; done
>>>>>>> real 0m0.081s
>>>>>>> user 0m0.000s
>>>>>>> sys 0m0.004s
>>>>>>>
>>>>>>> And if I try to repeat the same command (when files already created):
>>>>>>>
>>>>>>> # time for i in {1..10}; do echo $i > $i.txt ; done
>>>>>>> real 0m48.894s
>>>>>>> user 0m0.000s
>>>>>>> sys 0m0.004s
>>>>>>>
>>>>>>> I was very surprised and then just tried to rewrite a single file:
>>>>>>>
>>>>>>> # time echo 1 > 1.txt
>>>>>>> real 0m3.133s
>>>>>>> user 0m0.000s
>>>>>>> sys 0m0.000s
>>>>>>>
>>>>>>> BTW, I dont think it is the problem with OSD speed or network:
>>>>>>>
>>>>>>> # time sysbench --num-threads=1 --test=fileio --file-total-size=1G
>>>>>>> --file-test-mode=rndrw prepare
>>>>>>> 1073741824 bytes written in 23.52 seconds (43.54 MB/sec).
>>>>>>>
>>>>>>> Here is my ceph cluster status and verion:
>>>>>>>
>>>>>>> # ceph -w
>>>>>>>     cluster d3dcacc3-89fb-4db0-9fa9-f1f6217280cb
>>>>>>>      health HEALTH_OK
>>>>>>>      monmap e4: 4 mons at
>>>>>>> {atl-fs10=10.44.101.70:6789/0,atl-fs11=10.44.101.91:6789/0,atl-fs12=10.44.101.92:6789/0,atl-fs9=10.44.101.69:6789/0},
>>>>>>> election epoch 40, quorum 0,1,2,3 atl-fs9,atl-fs10,atl-fs11,atl-fs12
>>>>>>>      mdsmap e33: 1/1/1 up {0=atl-fs12=up:active}, 3 up:standby
>>>>>>>      osdmap e92: 4 osds: 4 up, 4 in
>>>>>>>       pgmap v8091: 192 pgs, 3 pools, 123 MB data, 1658 objects
>>>>>>>             881 GB used, 1683 GB / 2564 GB avail
>>>>>>>                  192 active+clean
>>>>>>>   client io 1820 B/s wr, 1 op/s
>>>>>>>
>>>>>>> # ceph -v
>>>>>>> ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>>>>>
>>>>>>> All nodes connected with gigabit network.
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com