Our “production”
design has 6-nodes, 24-OSDs (expandable to 48 OSDs).
SSD journals at a 1:4 ratio with HDDs, Each node looks
like this:
-
2
x E5-2660 8-core Xeons
-
64GB
RAM DDR-3 PC1600
-
10Gb
ceph-internal network (SFP+)
-
LSI
9210-8i controller (IT mode)
-
4
x OSD 8TB HDDs, mix of two types
-
Seagate ST8000DM002
-
HGST HDN728080ALE604
-
Mount options = xfs
(rw,noatime,attr2,inode64,noquota)
-
1
x SSD journal Intel 200GB DC S3700
Running Kraken 11.2.0
on Ubuntu 16.04. All testing has been done with a
replication level 2. We’re using rados bench to
shotgun a lot of files into our test pools.
Specifically following these two steps:
ceph osd pool create
poolofhopes 2048 2048 replicated "" replicated_ruleset
500000000
rados -p poolofhopes
bench -t 32 -b 20000 30000000 write --no-cleanup
We leave the bench
running for days at a time and watch the objects in
cluster count. We see performance that starts off
decent and degrades over time. There’s a very brief
initial surge in write performance after which things
settle into the downward trending pattern.
1st hour - 2 million
objects/hour
20th hour - 1.9
million objects/hour
40th hour - 1.7
million objects/hour
This performance is not
encouraging for us. We need to be writing 40 million
objects per day (20 million files, duplicated
twice). The rates we’re seeing at the 40th hour of
our bench would be suffecient to achieve that.
Those write rates are still falling though and we’re
only at a fraction of the number of objects in
cluster that we need to handle. So, the trends in
performance suggests we shouldn’t count on having
the write performance we need for too long.
If we repeat the process of creating a new pool
and running the bench the same pattern holds, good
initial performance that gradually degrades.
[caption:90 million objects written
to a brand new, pre-split pool (poolofhopes).
There are already 330 million objects on the
cluster in other pools.]
Our
working theory is that the degradation over time
may be related to inode or dentry lookups that
miss cache and lead to additional disk reads and
seek activity. There’s a suggestion that filestore
directory splitting may exacerbate that problem as
additional/longer disk seeks occur related to
what’s in which XFS assignment group. We have
found pre-split pools useful in one major way,
they avoid periods of near-zero write performance
that we have put down to the active splitting of
directories (the "thundering herd" effect). The
overall downward curve seems to remain the same
whether we pre-split or not.
The
thundering herd seems to be kept in check by an
appropriate pre-split. Bluestore may or may not be
a solution, but uncertainty and stability within
our fairly tight timeline don't recommend it
to us. Right now our big
question is "how can we avoid the gradual
degradation in write performance over time?".
Thank you, Patrick