Re: Maintaining write performance under a steady intake of small objects

Patrick Dinnen <pdinnen@xxxxxxxxx> · Tue, 2 May 2017 13:08:06 -0400

That's interesting Mark. It would be great if anyone has a definitive
answer on the potential syncfs-related downside of caching a lot of
inodes. A lot of our testing so far has been on the assumption that
more cached inodes is a pure good.

On Tue, May 2, 2017 at 9:19 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> I used to advocate that users favor dentry/inode cache, but it turns out
> that it's not necessarily a good idea if you also are using syncfs.  It
> turns out that when syncfs is used, the kernel will iterate through all
> cached inodes, rather than just dirty inodes.  With high numbers of cached
> inodes, it can impact performance enough that it ends up being a problem.
> See Sage's post here:
>
> http://www.spinics.net/lists/ceph-devel/msg25644.html
>
> I don't remember if we ended up ripping syncfs out completely. Bluestore
> ditches the filesystem so we don't have to deal with this anymore
> regardless.  It's something to be aware of though.
>
> Mark
>
> On 05/02/2017 07:24 AM, George Mihaiescu wrote:
>>
>> Hi Patrick,
>>
>> You could add more RAM to the servers witch will not increase the cost
>> too much, probably.
>>
>> You could change swappiness value or use something
>> like https://hoytech.com/vmtouch/ to pre-cache inodes entries.
>>
>> You could tarball the smaller files before loading them into Ceph maybe.
>>
>> How are the ten clients accessing Ceph by the way?
>>
>> On May 1, 2017, at 14:23, Patrick Dinnen <pdinnen@xxxxxxxxx
>> <mailto:pdinnen@xxxxxxxxx>> wrote:
>>
>>> One additional detail, we also did filestore testing using Jewel and
>>> saw substantially similar results to those on Kraken.
>>>
>>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdinnen@xxxxxxxxx
>>> <mailto:pdinnen@xxxxxxxxx>> wrote:
>>>
>>>     Hello Ceph-users,
>>>
>>>     Florian has been helping with some issues on our proof-of-concept
>>>     cluster, where we've been experiencing these issues. Thanks for
>>>     the replies so far. I wanted to jump in with some extra details.
>>>
>>>     All of our testing has been with scrubbing turned off, to remove
>>>     that as a factor.
>>>
>>>     Our use case requires a Ceph cluster to indefinitely store ~10
>>>     billion files 20-60KB in size. We’ll begin with 4 billion files
>>>     migrated from a legacy storage system. Ongoing writes will be
>>>     handled by ~10 client machines and come in at a fairly steady
>>>     10-20 million files/day. Every file (excluding the legacy 4
>>>     billion) will be read once by a single client within hours of it’s
>>>     initial write to the cluster. Future file read requests will come
>>>     from a single server and with a long-tail distribution, with
>>>     popular files read thousands of times a year but most read never
>>>     or virtually never.
>>>
>>>     Our “production” design has 6-nodes, 24-OSDs (expandable to 48
>>>     OSDs). SSD journals at a 1:4 ratio with HDDs, Each node looks like
>>>     this:
>>>
>>>      *
>>>         2 x E5-2660 8-core Xeons
>>>      *
>>>         64GB RAM DDR-3 PC1600
>>>      *
>>>         10Gb ceph-internal network (SFP+)
>>>      *
>>>         LSI 9210-8i controller (IT mode)
>>>      *
>>>         4 x OSD 8TB HDDs, mix of two types
>>>          o
>>>             Seagate ST8000DM002
>>>          o
>>>             HGST HDN728080ALE604
>>>          o
>>>             Mount options = xfs (rw,noatime,attr2,inode64,noquota)
>>>      *
>>>
>>>         1 x SSD journal Intel 200GB DC S3700
>>>
>>>
>>>     Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done
>>>     with a replication level 2. We’re using rados bench to shotgun a
>>>     lot of files into our test pools. Specifically following these two
>>>     steps:
>>>     ceph osd pool create poolofhopes 2048 2048 replicated ""
>>>     replicated_ruleset 500000000
>>>     rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup
>>>
>>>     We leave the bench running for days at a time and watch the
>>>     objects in cluster count. We see performance that starts off
>>>     decent and degrades over time. There’s a very brief initial surge
>>>     in write performance after which things settle into the downward
>>>     trending pattern.
>>>
>>>     1st hour - 2 million objects/hour
>>>     20th hour - 1.9 million objects/hour
>>>     40th hour - 1.7 million objects/hour
>>>
>>>     This performance is not encouraging for us. We need to be writing
>>>     40 million objects per day (20 million files, duplicated twice).
>>>     The rates we’re seeing at the 40th hour of our bench would be
>>>     suffecient to achieve that. Those write rates are still falling
>>>     though and we’re only at a fraction of the number of objects in
>>>     cluster that we need to handle. So, the trends in performance
>>>     suggests we shouldn’t count on having the write performance we
>>>     need for too long.
>>>
>>>     If we repeat the process of creating a new pool and running the
>>>     bench the same pattern holds, good initial performance that
>>>     gradually degrades.
>>>
>>>     https://postimg.org/image/ovymk7n2d/
>>>     <https://postimg.org/image/ovymk7n2d/>
>>>     [caption:90 million objects written to a brand new, pre-split pool
>>>     (poolofhopes). There are already 330 million objects on the
>>>     cluster in other pools.]
>>>
>>>     Our working theory is that the degradation over time may be
>>>     related to inode or dentry lookups that miss cache and lead to
>>>     additional disk reads and seek activity. There’s a suggestion that
>>>     filestore directory splitting may exacerbate that problem as
>>>     additional/longer disk seeks occur related to what’s in which XFS
>>>     assignment group. We have found pre-split pools useful in one
>>>     major way, they avoid periods of near-zero write performance that
>>>     we have put down to the active splitting of directories (the
>>>     "thundering herd" effect). The overall downward curve seems to
>>>     remain the same whether we pre-split or not.
>>>
>>>     The thundering herd seems to be kept in check by an appropriate
>>>     pre-split. Bluestore may or may not be a solution, but
>>>     uncertainty and stability within our fairly tight timeline don't
>>>     recommend it to us. Right now our big question is "how can we
>>>     avoid the gradual degradation in write performance over time?".
>>>
>>>     Thank you, Patrick
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com