Re: Maintaining write performance under a steady intake of small objects

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 2 May 2017 08:19:01 -0500

I used to advocate that users favor dentry/inode cache, but it turns out 
that it's not necessarily a good idea if you also are using syncfs.  It 
turns out that when syncfs is used, the kernel will iterate through all 
cached inodes, rather than just dirty inodes.  With high numbers of 
cached inodes, it can impact performance enough that it ends up being a 
problem.  See Sage's post here:

http://www.spinics.net/lists/ceph-devel/msg25644.html

I don't remember if we ended up ripping syncfs out completely. 
Bluestore ditches the filesystem so we don't have to deal with this 
anymore regardless.  It's something to be aware of though.

Mark

On 05/02/2017 07:24 AM, George Mihaiescu wrote:
Hi Patrick,

You could add more RAM to the servers witch will not increase the cost
too much, probably.

You could change swappiness value or use something
like https://hoytech.com/vmtouch/ to pre-cache inodes entries.

You could tarball the smaller files before loading them into Ceph maybe.

How are the ten clients accessing Ceph by the way?

On May 1, 2017, at 14:23, Patrick Dinnen <pdinnen@xxxxxxxxx
<mailto:pdinnen@xxxxxxxxx>> wrote:

One additional detail, we also did filestore testing using Jewel and
saw substantially similar results to those on Kraken.

On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdinnen@xxxxxxxxx
<mailto:pdinnen@xxxxxxxxx>> wrote:

    Hello Ceph-users,

    Florian has been helping with some issues on our proof-of-concept
    cluster, where we've been experiencing these issues. Thanks for
    the replies so far. I wanted to jump in with some extra details.

    All of our testing has been with scrubbing turned off, to remove
    that as a factor.

    Our use case requires a Ceph cluster to indefinitely store ~10
    billion files 20-60KB in size. We’ll begin with 4 billion files
    migrated from a legacy storage system. Ongoing writes will be
    handled by ~10 client machines and come in at a fairly steady
    10-20 million files/day. Every file (excluding the legacy 4
    billion) will be read once by a single client within hours of it’s
    initial write to the cluster. Future file read requests will come
    from a single server and with a long-tail distribution, with
    popular files read thousands of times a year but most read never
    or virtually never.

    Our “production” design has 6-nodes, 24-OSDs (expandable to 48
    OSDs). SSD journals at a 1:4 ratio with HDDs, Each node looks like
    this:

     *
        2 x E5-2660 8-core Xeons
     *
        64GB RAM DDR-3 PC1600
     *
        10Gb ceph-internal network (SFP+)
     *
        LSI 9210-8i controller (IT mode)
     *
        4 x OSD 8TB HDDs, mix of two types
         o
            Seagate ST8000DM002
         o
            HGST HDN728080ALE604
         o
            Mount options = xfs (rw,noatime,attr2,inode64,noquota)
     *
        1 x SSD journal Intel 200GB DC S3700

    Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done
    with a replication level 2. We’re using rados bench to shotgun a
    lot of files into our test pools. Specifically following these two
    steps:
    ceph osd pool create poolofhopes 2048 2048 replicated ""
    replicated_ruleset 500000000
    rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup

    We leave the bench running for days at a time and watch the
    objects in cluster count. We see performance that starts off
    decent and degrades over time. There’s a very brief initial surge
    in write performance after which things settle into the downward
    trending pattern.

    1st hour - 2 million objects/hour
    20th hour - 1.9 million objects/hour
    40th hour - 1.7 million objects/hour

    This performance is not encouraging for us. We need to be writing
    40 million objects per day (20 million files, duplicated twice).
    The rates we’re seeing at the 40th hour of our bench would be
    suffecient to achieve that. Those write rates are still falling
    though and we’re only at a fraction of the number of objects in
    cluster that we need to handle. So, the trends in performance
    suggests we shouldn’t count on having the write performance we
    need for too long.

    If we repeat the process of creating a new pool and running the
    bench the same pattern holds, good initial performance that
    gradually degrades.

    https://postimg.org/image/ovymk7n2d/
    <https://postimg.org/image/ovymk7n2d/>
    [caption:90 million objects written to a brand new, pre-split pool
    (poolofhopes). There are already 330 million objects on the
    cluster in other pools.]

    Our working theory is that the degradation over time may be
    related to inode or dentry lookups that miss cache and lead to
    additional disk reads and seek activity. There’s a suggestion that
    filestore directory splitting may exacerbate that problem as
    additional/longer disk seeks occur related to what’s in which XFS
    assignment group. We have found pre-split pools useful in one
    major way, they avoid periods of near-zero write performance that
    we have put down to the active splitting of directories (the
    "thundering herd" effect). The overall downward curve seems to
    remain the same whether we pre-split or not.

    The thundering herd seems to be kept in check by an appropriate
    pre-split. Bluestore may or may not be a solution, but
    uncertainty and stability within our fairly tight timeline don't
    recommend it to us. Right now our big question is "how can we
    avoid the gradual degradation in write performance over time?".

    Thank you, Patrick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com