Re: poor OSD performance using kernel 3.4

Yann Dupont <Yann.Dupont@xxxxxxxxxxxxxx> · Tue, 29 May 2012 21:50:35 +0200

Le 29/05/2012 19:50, Mark Nelson a écrit :

1 1GbE Client node
3 1GbE Mon nodes
2 1GbE OSD nodes with 1 OSD on each mounted on a 7200rpm SAS drive.  
btrfs with -l 64k -n64k, mounted using noatime.  H700 Raid controller 
with each drive in a 1 disk raid0.  Journals are partitioned on a 
separate drive.

Hello ,
Forgot to mention I'm using 10 Gbe and FS using btrfs with -l 64k -n64k, 
but also space_cache,compress=lzo,nobarrier,noatime.
journal is on tmpfs :

 osd journal = /dev/shm/journal
 osd journal size = 6144

Remember It's not a production system for the moment. I'm just trying to 
evaluate what is the best performance I can get. (and if the system is 
stable enough to start alpha/pre-production services). BTW, I noticed 
OSD usings XFS are much much slower than OSD with btrfs right now, 
particulary in rbd tests. btrfs have some stability problems, even if 
with newer kernels it seems better.

/proc/version:
Linux version 3.4.0-ceph (autobuild-ceph@gitbuilder-kernel-amd64)

rados -p data bench 120 write:

Total time run:        120.601286
Total writes made:     2979
Write size:            4194304
Bandwidth (MB/sec):    98.805

Average Latency:       0.647507
Max latency:           1.39966
Min latency:           0.181663

Once I get these nodes up to 0.47 and get them switched over to 10GbE 
I'll redo the btrfs tests and try out xfs as well with longer running 
tests.

As you can see, much more stable bandwith with this pool.
That's pretty strange...

Indeed, that is very strange!  Can you check to see how many pgs are 
in each?  Any difference in replication level?  You can check with:

ceph osd pool get <pool> size
root@label5:~# ceph osd pool get data size
don't know how to get pool field size
root@label5:~# ceph osd pool get rbd size
don't know how to get pool field size

Is size the good name of the field ? In the the wiki size isn't listed 
as a valid field

ceph osd pool get <pool> pg_num

root@label5:~# ceph osd pool get rbd pg_num
PG_NUM: 576
root@label5:~# ceph osd pool get data pg_num
PG_NUM: 576

Th pg num is quite low because I started with small OSD (9 osd with 200G 
each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 
is out) but with much larger (& faster) storage. 6 OSD have 5T on it, 2 
have still 200G but they are planned to migrate before the end of the week.

I try, for the moment, to keep the OSD similars. Replication is set to 2.

No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are 
in 1 place, 2 in another, and the 4 last in the principal place.
I try to group host together to avoid problem when I loose a location 
(electrical problem, for example). Not sure I really customized the 
crush map as I should have.

here is the map :
 begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
    id -5        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.2 weight 1.000
}
host hazelburn {
    id -6        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.3 weight 1.000
}
rack loire {
    id -3        # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0    # rjenkins1
    item karuizawa weight 1.000
    item hazelburn weight 1.000
}
host carsebridge {
    id -8        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.5 weight 1.000
}
host cameronbridge {
    id -9        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.6 weight 1.000
}
rack chantrerie {
    id -7        # do not change unnecessarily
    # weight 2.000
    alg straw
    hash 0    # rjenkins1
    item carsebridge weight 1.000
    item cameronbridge weight 1.000
}
host chichibu {
    id -2        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.0 weight 1.000
}
host glenesk {
    id -4        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.1 weight 1.000
}
host braeval {
    id -10        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.7 weight 1.000
}
host hanyu {
    id -11        # do not change unnecessarily
    # weight 1.000
    alg straw
    hash 0    # rjenkins1
    item osd.8 weight 1.000
}
rack lombarderie {
    id -12        # do not change unnecessarily
    # weight 4.000
    alg straw
    hash 0    # rjenkins1
    item chichibu weight 1.000
    item glenesk weight 1.000
    item braeval weight 1.000
    item hanyu weight 1.000
}
pool default {
    id -1        # do not change unnecessarily
    # weight 8.000
    alg straw
    hash 0    # rjenkins1
    item loire weight 2.000
    item chantrerie weight 2.000
    item lombarderie weight 4.000
}

# rules
rule data {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule metadata {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule rbd {
    ruleset 2
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

Hope it helps,
cheers

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@xxxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html