Re: newstore performance update

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 29 Apr 2015 08:08:42 -0500

Hi,

ceph.conf file attached.  It's a little ugly because I've been playing 
with various parameters.  You'll probably want to enable debug newstore 
= 30 if you plan to do any debugging.  Also, the code has been changing 
quickly so performance may have changed if you haven't tested within the 
last week.

Mark

On 04/28/2015 09:59 PM, kernel neophyte wrote:
Hi Mark,

I am trying to measure 4k RW performance on Newstore, and I am not
anywhere close to the numbers you are getting!

Could you share your ceph.conf for these test ?

-Neo

On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Nothing official, though roughly from memory:

~1.7GB/s and something crazy like 100K IOPS for the SSD.

~150MB/s and ~125-150 IOPS for the spinning disk.

Mark

On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:

Thanks for sharing; newstore numbers look lot better;

Wondering if we have any base line numbers to put things into perspective.
like what is it on XFS or on librados?

JV

On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hi Guys,

Sage has been furiously working away at fixing bugs in newstore and
improving performance.  Specifically we've been focused on write
performance
as newstore was lagging filestore but quite a bit previously.  A lot of
work
has gone into implementing libaio behind the scenes and as a result
performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
improved pretty dramatically. It's now often beating filestore:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

On the other hand, sequential writes are slower than random writes when
the
OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
In this situation newstore does better with random writes and sometimes
beats filestore (such as in the everything-on-spinning disk tests, and
when
IO sizes are small in the everything-on-ssd tests).

Newstore is changing daily so keep in mind that these results are almost
assuredly going to change.  An interesting area of investigation will be
why
sequential writes are slower than random writes, and whether or not we
are
being limited by rocksdb ingest speed and how.

I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
32KB sequential write test to see if rocksdb was starving one of the
cores,
but found something that looks quite a bit different:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[global]
        osd pool default size = 1

        osd crush chooseleaf type = 0
        enable experimental unrecoverable data corrupting features = newstore rocksdb
        osd objectstore = newstore
#        newstore aio max queue depth = 4096 
#        newstore overlay max length = 8388608 
#        rocksdb wal dir = "/wal"
#        newstore db path = "/wal"
        newstore overlay max = 0
        newstore_wal_threads = 8
        rocksdb_write_buffer_size = 536870912
        rocksdb_write_buffer_num = 4
        rocksdb_min_write_buffer_number_to_merge = 2
        rocksdb_log = /home/nhm/tmp/cbt/ceph/log/rocksdb.log
        rocksdb_max_background_compactions = 4
        rocksdb_compaction_threads = 4
        rocksdb_level0_file_num_compaction_trigger = 4
        rocksdb_max_bytes_for_level_base = 104857600 //100MB
        rocksdb_target_file_size_base = 10485760      //10MB
        rocksdb_num_levels = 3
        rocksdb_compression = none

        keyring = /home/nhm/tmp/cbt/ceph/keyring
        osd pg bits = 8  
        osd pgp bits = 8
	auth supported = none
        log to syslog = false
        log file = /home/nhm/tmp/cbt/ceph/log/$name.log
        filestore xattr use omap = true
        auth cluster required = none
        auth service required = none
        auth client required = none

        public network = 192.168.10.0/24
        cluster network = 192.168.10.0/24
        rbd cache = true
        osd scrub load threshold = 0.01
        osd scrub min interval = 137438953472
        osd scrub max interval = 137438953472
        osd deep scrub interval = 137438953472
        osd max scrubs = 16

        filestore merge threshold = 40
        filestore split multiple = 8
        osd op threads = 8

        debug newstore = "0/0" 

        debug_lockdep = "0/0" 
        debug_context = "0/0"
        debug_crush = "0/0"
        debug_mds = "0/0"
        debug_mds_balancer = "0/0"
        debug_mds_locker = "0/0"
        debug_mds_log = "0/0"
        debug_mds_log_expire = "0/0"
        debug_mds_migrator = "0/0"
        debug_buffer = "0/0"
        debug_timer = "0/0"
        debug_filer = "0/0"
        debug_objecter = "0/0"
        debug_rados = "0/0"
        debug_rbd = "0/0"
        debug_journaler = "0/0"
        debug_objectcacher = "0/0"
        debug_client = "0/0"
        debug_osd = "0/0"
        debug_optracker = "0/0"
        debug_objclass = "0/0"
        debug_filestore = "0/0"
        debug_journal = "0/0"
        debug_ms = "0/0"
        debug_mon = "0/0"
        debug_monc = "0/0"
        debug_paxos = "0/0"
        debug_tp = "0/0"
        debug_auth = "0/0"
        debug_finisher = "0/0"
        debug_heartbeatmap = "0/0"
        debug_perfcounter = "0/0"
        debug_rgw = "0/0"
        debug_hadoop = "0/0"
        debug_asok = "0/0"
        debug_throttle = "0/0"

        mon pg warn max object skew = 100000
        mon pg warn min per osd = 0
        mon pg warn max per osd = 32768

#        debug optracker = 30
#        debug tp = 5
#        objecter infilght op bytes = 1073741824
#        objecter inflight ops = 8192

#        filestore wbthrottle enable = false
#        debug osd = 20

#        filestore wbthrottle xfs ios start flusher = 500
#        filestore wbthrottle xfs ios hard limit = 5000
#        filestore wbthrottle xfs inodes start flusher = 500
#        filestore wbthrottle xfs inodes hard limit = 5000
#        filestore wbthrottle xfs bytes start flusher = 41943040
#        filestore wbthrottle xfs bytes hard limit = 419430400

#        filestore wbthrottle btrfs ios start flusher = 500
#        filestore wbthrottle btrfs ios hard limit = 5000
#        filestore wbthrottle btrfs inodes start flusher = 500
#        filestore wbthrottle btrfs inodes hard limit = 5000
#        filestore wbthrottle btrfs bytes start flusher = 41943040
#        filestore wbthrottle btrfs bytes hard limit = 419430400

[mon]
	mon data = /home/nhm/tmp/cbt/ceph/mon.$id

[mon.a]
	host = burnupiX 
        mon addr = 127.0.0.1:6789

[osd.0]
	host = burnupiX
        osd data = /home/nhm/tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal
#        osd journal = /dev/sds1