Re: Ceph performance, empty vs part full

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There's a lot of factors that play into all of this. The more PGs you have, the more total objects you can store before you hit the thresholds. More PGs also means slightly better random distribution across OSDs (Not really affected by the size of the OSD assuming all OSDs are uniform). You have to be careful increasing the PG count though. I've tested about a million PGs and things more or less worked but the mons were pretty laggy and I didn't test recovery. For small clusters I personally like to use more PGs than our guidelines indicate and for very large clusters I suspect you might have to under-allocate but then probably use larger directory splitting thresholds to at least balance that part of the equation out.

Mark

On 09/04/2015 07:18 AM, Nick Fisk wrote:
Actually just thinking about this some more, shouldn't the PG's per OSD "golden rule" also depend on the size of the OSD? If this Directory splitting is a big deal then an 8TB OSD is going to need a lot more PG's than say a 1TB OSD.

Any thoughts Mark?

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
Nick Fisk
Sent: 04 September 2015 13:08
To: 'Wang, Warren' <Warren_Wang@xxxxxxxxxxxxxxxxx>; 'Mark Nelson'
<mnelson@xxxxxxxxxx>; 'Ben Hines' <bhines@xxxxxxxxx>
Cc: 'ceph-users' <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Ceph performance, empty vs part full

I've just made the same change ( 4 and 40 for now) on my cluster which is a
similar size to yours. I didn't see any merging happening, although most of
the directory's I looked at had more files in than the new merge threshold, so
I guess this is to be expected

I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring
things back into order.

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of Wang, Warren
Sent: 04 September 2015 01:21
To: Mark Nelson <mnelson@xxxxxxxxxx>; Ben Hines <bhines@xxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Ceph performance, empty vs part full

I'm about to change it on a big cluster too. It totals around 30
million, so I'm a bit nervous on changing it. As far as I understood,
it would indeed move them around, if you can get underneath the
threshold, but it may be hard to do. Two more settings that I highly
recommend changing on a big prod cluster. I'm in favor of bumping these
two up in the defaults.

Warren

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of Mark Nelson
Sent: Thursday, September 03, 2015 6:04 PM
To: Ben Hines <bhines@xxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Ceph performance, empty vs part full

Hrm, I think it will follow the merge/split rules if it's out of whack
given the new settings, but I don't know that I've ever tested it on
an existing cluster to see that it actually happens.  I guess let it
sit for a while and then check the OSD PG directories to see if the
object counts make sense given the new settings? :D

Mark

On 09/03/2015 04:31 PM, Ben Hines wrote:
Hey Mark,

I've just tweaked these filestore settings for my cluster -- after
changing this, is there a way to make ceph move existing objects
around to new filestore locations, or will this only apply to newly
created objects? (i would assume the latter..)

thanks,

-Ben

On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson <mnelson@xxxxxxxxxx>
wrote:
Basically for each PG, there's a directory tree where only a
certain number of objects are allowed in a given directory before
it splits into new branches/leaves.  The problem is that this has a
fair amount of overhead and also there's extra associated dentry
lookups to get at any
given object.

You may want to try something like:

"filestore merge threshold = 40"
"filestore split multiple = 8"

This will dramatically increase the number of objects per directory
allowed.

Another thing you may want to try is telling the kernel to greatly
favor retaining dentries and inodes in cache:

echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure

Mark


On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:

If I create a new pool it is generally fast for a short amount of time.
Not as fast as if I had a blank cluster, but close to.

Bryn

On 8 Jul 2015, at 13:55, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

I think you're probably running into the internal PG/collection
splitting here; try searching for those terms and seeing what
your OSD folder structures look like. You could test by creating
a new pool and seeing if it's faster or slower than the one
you've already filled
up.
-Greg

On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
<bryn.mathias@xxxxxxxxxxxxxxxxxx> wrote:

Hi All,


I’m perf testing a cluster again, This time I have re-built the
cluster and am filling it for testing.

on a 10 min run I get the following results from 5 load
generators, each writing though 7 iocontexts, with a queue depth
of
50 async writes.


Gen1
Percentile 100 = 0.729775905609
Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
0.0750389684542
Total objects writen = 113088 in time 604.259738207s gives
187.151307376/s (748.605229503 MB/s)

Gen2
Percentile 100 = 0.735981941223
Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
0.0745198070711
Total objects writen = 113822 in time 604.437897921s gives
188.310495407/s (753.241981627 MB/s)

Gen3
Percentile 100 = 0.828994989395
Max latencies = 0.828994989395, Min = 0.0349340438843, mean =
0.0745455575197
Total objects writen = 113670 in time 604.352181911s gives
188.085694736/s (752.342778944 MB/s)

Gen4
Percentile 100 = 1.06834602356
Max latencies = 1.06834602356, Min = 0.0333499908447, mean =
0.0752239764659
Total objects writen = 112744 in time 604.408732891s gives
186.536020849/s (746.144083397 MB/s)

Gen5
Percentile 100 = 0.609658002853
Max latencies = 0.609658002853, Min = 0.032968044281, mean =
0.0744482759499
Total objects writen = 113918 in time 604.671534061s gives
188.396498897/s (753.585995589 MB/s)

example ceph -w output:
2015-07-07 15:50:16.507084 mon.0 [INF] pgmap v1077: 2880 pgs:
2880
active+clean; 1996 GB data, 2515 GB used, 346 TB / 348 TB avail;
active+2185 MB/s
wr, 572 op/s


However when the cluster gets over 20% full I see the following
results, this gets worse as the cluster fills up:

Gen1
Percentile 100 = 6.71176099777
Max latencies = 6.71176099777, Min = 0.0358741283417, mean =
0.161760483485
Total objects writen = 52196 in time 604.488474131s gives
86.347386648/s
(345.389546592 MB/s)

Gen2
Max latencies = 4.09169006348, Min = 0.0357890129089, mean =
0.163243938477
Total objects writen = 51702 in time 604.036739111s gives
85.5941313704/s (342.376525482 MB/s)

Gen3
Percentile 100 = 7.32526683807
Max latencies = 7.32526683807, Min = 0.0366668701172, mean =
0.163992217926
Total objects writen = 51476 in time 604.684302092s gives
85.1287189397/s (340.514875759 MB/s)

Gen4
Percentile 100 = 7.56094503403
Max latencies = 7.56094503403, Min = 0.0355761051178, mean =
0.162109421231
Total objects writen = 52092 in time 604.769910812s gives
86.1352376642/s (344.540950657 MB/s)


Gen5
Percentile 100 = 6.99595499039
Max latencies = 6.99595499039, Min = 0.0364680290222, mean =
0.163651215426
Total objects writen = 51566 in time 604.061977148s gives
85.3654127404/s (341.461650961 MB/s)






Cluster details:
5*HPDL380’s with 13*6Tb OSD’s
128Gb Ram
2*intel 2620v3
10 Gbit Ceph public network
10 Gbit Ceph private network

Load generators connected via a 20Gbit bond to the ceph public
network.


Is this likely to be something happening to the journals?

Or is there something else going on.

I have run FIO and iperf tests and the disk and network
performance is very high.


Kind Regards,
Bryn Mathias






_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux