I am using the Giant release. The OSDs and MON/MDS are using default RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using cephaux. I may have found something. I did the build manually as such I did _NOT_ set up these config settings: filestore xattr use omap = false filestore max inline xattr size = 65536, filestore_max_inline_xattr_size_xfs = 65536 filestore_max_inline_xattr_size_other = 512 filestore_max_inline_xattrs_xfs = 10 I just changed these settings to see if it will make a difference. I copied data from one directory that had files I created before I set these values ( time cp small1/* small2/.) and it takes 2 min 30 secs to copy 1600 files. If I took the files I just copied from small2 and copy them to a different directory ( time cp small2/* small3/.) it only takes 5 mins to copy 10000 files! Could this be part of the problem? On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson > <almightybeeij@xxxxxxxxx> wrote: >> Here is the mds output from the command you requested. I did this >> during the small data run . ( time cp small1/* small2/ ) >> It is 20MB in size so I couldn't find a place online that would accept >> that much data. >> >> Please find attached file. >> >> Thanks, > > In the log file, each 'create' request is followed by several > 'getattr' requests. I guess these 'getattr' requests resulted from > some kinds of permission check, but I can't reproduce this situation > locally. > > which version of ceph/kernel are you using? do you use ceph-fuse or > kernel client, what's the mount options? > > Regards > Yan, Zheng > > >> >> Beeij >> >> >> On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>> On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson >>> <almightybeeij@xxxxxxxxx> wrote: >>>> I redid my entire Ceph build going back to to CentOS 7 hoping to the >>>> get the same performance I did last time. >>>> The rados bench test was the best I have ever had with a time of 740 >>>> MB wr and 1300 MB rd. This was even better than the first rados bench >>>> test that had performance equal to PanFS. I find that this does not >>>> translate to my CephFS. Even with the following tweaking it still at >>>> least twice as slow as PanFS and my first *Magical* build (that had >>>> absolutely no tweaking): >>>> >>>> OSD >>>> osd_op_treads 8 >>>> /sys/block/sd*/queue/nr_requests 4096 >>>> /sys/block/sd*/queue/read_ahead_kb 4096 >>>> >>>> Client >>>> rsize=16777216 >>>> readdir_max_bytes=16777216 >>>> readdir_max_entries=16777216 >>>> >>>> ~160 mins to copy 100000 (1MB) files for CephFS vs ~50 mins for PanFS. >>>> Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s. >>>> >>>> Strange thing is none of the resources are taxed. >>>> CPU, ram, network, disks, are not even close to being taxed on either >>>> the client,mon/mds, or the osd nodes. >>>> The PanFS client node was a 10Gb network the same as the CephFS client >>>> but you can see the huge difference in speed. >>>> >>>> As per Gregs questions before: >>>> There is only one client reading and writing (time cp Small1/* >>>> Small2/.) but three clients have cephfs mounted, although they aren't >>>> doing anything on the filesystem. >>>> >>>> I have done another test where I stream data info a file as fast as >>>> the processor can put it there. >>>> (for (i=0; i < 1000000001; i++){ fprintf (out_file, "I is : %d\n",i);} >>>> ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the >>>> above tuning vs 130 seconds for PanFS. Without the tuning it takes 230 >>>> seconds for CephFS although the first build did it in 130 seconds >>>> without any tuning. >>>> >>>> This leads me to believe the bottleneck is the mds. Does anybody have >>>> any thoughts on this? >>>> Are there any tuning parameters that I would need to speed up the mds? >>> >>> could you enable mds debugging for a few seconds (ceph daemon mds.x >>> config set debug_mds 10; sleep 10; ceph daemon mds.x config set >>> debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere. >>> >>> Regards >>> Yan, Zheng >>> >>>> >>>> On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >>>>> On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson >>>>> <almightybeeij@xxxxxxxxx> wrote: >>>>>> Yes it's the exact same hardware except for the MDS server (although I >>>>>> tried using the MDS on the old node). >>>>>> I have not tried moving the MON back to the old node. >>>>>> >>>>>> My default cache size is "mds cache size = 10000000" >>>>>> The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. >>>>>> I created 2048 for data and metadata: >>>>>> ceph osd pool create cephfs_data 2048 2048 >>>>>> ceph osd pool create cephfs_metadata 2048 2048 >>>>>> >>>>>> >>>>>> To your point on clients competing against each other... how would I check that? >>>>> >>>>> Do you have multiple clients mounted? Are they both accessing files in >>>>> the directory(ies) you're testing? Were they accessing the same >>>>> pattern of files for the old cluster? >>>>> >>>>> If you happen to be running a hammer rc or something pretty new you >>>>> can use the MDS admin socket to explore a bit what client sessions >>>>> there are and what they have permissions on and check; otherwise >>>>> you'll have to figure it out from the client side. >>>>> -Greg >>>>> >>>>>> >>>>>> Thanks for the input! >>>>>> >>>>>> >>>>>> On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >>>>>>> So this is exactly the same test you ran previously, but now it's on >>>>>>> faster hardware and the test is slower? >>>>>>> >>>>>>> Do you have more data in the test cluster? One obvious possibility is >>>>>>> that previously you were working entirely in the MDS' cache, but now >>>>>>> you've got more dentries and so it's kicking data out to RADOS and >>>>>>> then reading it back in. >>>>>>> >>>>>>> If you've got the memory (you appear to) you can pump up the "mds >>>>>>> cache size" config option quite dramatically from it's default 100000. >>>>>>> >>>>>>> Other things to check are that you've got an appropriately-sized >>>>>>> metadata pool, that you've not got clients competing against each >>>>>>> other inappropriately, etc. >>>>>>> -Greg >>>>>>> >>>>>>> On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson >>>>>>> <almightybeeij@xxxxxxxxx> wrote: >>>>>>>> Opps I should have said that I am not just writing the data but copying it : >>>>>>>> >>>>>>>> time cp Small1/* Small2/* >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> BJ >>>>>>>> >>>>>>>> On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson >>>>>>>> <almightybeeij@xxxxxxxxx> wrote: >>>>>>>>> I did a Ceph cluster install 2 weeks ago where I was getting great >>>>>>>>> performance (~= PanFS) where I could write 100,000 1MB files in 61 >>>>>>>>> Mins (Took PanFS 59 Mins). I thought I could increase the performance >>>>>>>>> by adding a better MDS server so I redid the entire build. >>>>>>>>> >>>>>>>>> Now it takes 4 times as long to write the same data as it did before. >>>>>>>>> The only thing that changed was the MDS server. (I even tried moving >>>>>>>>> the MDS back on the old slower node and the performance was the same.) >>>>>>>>> >>>>>>>>> The first install was on CentOS 7. I tried going down to CentOS 6.6 >>>>>>>>> and it's the same results. >>>>>>>>> I use the same scripts to install the OSDs (which I created because I >>>>>>>>> can never get ceph-deploy to behave correctly. Although, I did use >>>>>>>>> ceph-deploy to create the MDS and MON and initial cluster creation.) >>>>>>>>> >>>>>>>>> I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read >>>>>>>>> with rados bench -p cephfs_data 500 write --no-cleanup && rados bench >>>>>>>>> -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read) >>>>>>>>> >>>>>>>>> Could anybody think of a reason as to why I am now getting a huge regression. >>>>>>>>> >>>>>>>>> Hardware Setup: >>>>>>>>> [OSDs] >>>>>>>>> 64 GB 2133 MHz >>>>>>>>> Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores) >>>>>>>>> 40Gb Mellanox NIC >>>>>>>>> >>>>>>>>> [MDS/MON new] >>>>>>>>> 128 GB 2133 MHz >>>>>>>>> Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores) >>>>>>>>> 40Gb Mellanox NIC >>>>>>>>> >>>>>>>>> [MDS/MON old] >>>>>>>>> 32 GB 800 MHz >>>>>>>>> Dual Proc E5472 @ 3.00GHz (8 Cores) >>>>>>>>> 10Gb Intel NIC >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html