That's interesting Mark. It would be great if anyone has a definitive answer on the potential syncfs-related downside of caching a lot of inodes. A lot of our testing so far has been on the assumption that more cached inodes is a pure good. On Tue, May 2, 2017 at 9:19 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > I used to advocate that users favor dentry/inode cache, but it turns out > that it's not necessarily a good idea if you also are using syncfs. It > turns out that when syncfs is used, the kernel will iterate through all > cached inodes, rather than just dirty inodes. With high numbers of cached > inodes, it can impact performance enough that it ends up being a problem. > See Sage's post here: > > http://www.spinics.net/lists/ceph-devel/msg25644.html > > I don't remember if we ended up ripping syncfs out completely. Bluestore > ditches the filesystem so we don't have to deal with this anymore > regardless. It's something to be aware of though. > > Mark > > On 05/02/2017 07:24 AM, George Mihaiescu wrote: >> >> Hi Patrick, >> >> You could add more RAM to the servers witch will not increase the cost >> too much, probably. >> >> You could change swappiness value or use something >> like https://hoytech.com/vmtouch/ to pre-cache inodes entries. >> >> You could tarball the smaller files before loading them into Ceph maybe. >> >> How are the ten clients accessing Ceph by the way? >> >> On May 1, 2017, at 14:23, Patrick Dinnen <pdinnen@xxxxxxxxx >> <mailto:pdinnen@xxxxxxxxx>> wrote: >> >>> One additional detail, we also did filestore testing using Jewel and >>> saw substantially similar results to those on Kraken. >>> >>> On Mon, May 1, 2017 at 2:07 PM, Patrick Dinnen <pdinnen@xxxxxxxxx >>> <mailto:pdinnen@xxxxxxxxx>> wrote: >>> >>> Hello Ceph-users, >>> >>> Florian has been helping with some issues on our proof-of-concept >>> cluster, where we've been experiencing these issues. Thanks for >>> the replies so far. I wanted to jump in with some extra details. >>> >>> All of our testing has been with scrubbing turned off, to remove >>> that as a factor. >>> >>> Our use case requires a Ceph cluster to indefinitely store ~10 >>> billion files 20-60KB in size. We’ll begin with 4 billion files >>> migrated from a legacy storage system. Ongoing writes will be >>> handled by ~10 client machines and come in at a fairly steady >>> 10-20 million files/day. Every file (excluding the legacy 4 >>> billion) will be read once by a single client within hours of it’s >>> initial write to the cluster. Future file read requests will come >>> from a single server and with a long-tail distribution, with >>> popular files read thousands of times a year but most read never >>> or virtually never. >>> >>> Our “production” design has 6-nodes, 24-OSDs (expandable to 48 >>> OSDs). SSD journals at a 1:4 ratio with HDDs, Each node looks like >>> this: >>> >>> * >>> 2 x E5-2660 8-core Xeons >>> * >>> 64GB RAM DDR-3 PC1600 >>> * >>> 10Gb ceph-internal network (SFP+) >>> * >>> LSI 9210-8i controller (IT mode) >>> * >>> 4 x OSD 8TB HDDs, mix of two types >>> o >>> Seagate ST8000DM002 >>> o >>> HGST HDN728080ALE604 >>> o >>> Mount options = xfs (rw,noatime,attr2,inode64,noquota) >>> * >>> >>> 1 x SSD journal Intel 200GB DC S3700 >>> >>> >>> Running Kraken 11.2.0 on Ubuntu 16.04. All testing has been done >>> with a replication level 2. We’re using rados bench to shotgun a >>> lot of files into our test pools. Specifically following these two >>> steps: >>> ceph osd pool create poolofhopes 2048 2048 replicated "" >>> replicated_ruleset 500000000 >>> rados -p poolofhopes bench -t 32 -b 20000 30000000 write --no-cleanup >>> >>> We leave the bench running for days at a time and watch the >>> objects in cluster count. We see performance that starts off >>> decent and degrades over time. There’s a very brief initial surge >>> in write performance after which things settle into the downward >>> trending pattern. >>> >>> 1st hour - 2 million objects/hour >>> 20th hour - 1.9 million objects/hour >>> 40th hour - 1.7 million objects/hour >>> >>> This performance is not encouraging for us. We need to be writing >>> 40 million objects per day (20 million files, duplicated twice). >>> The rates we’re seeing at the 40th hour of our bench would be >>> suffecient to achieve that. Those write rates are still falling >>> though and we’re only at a fraction of the number of objects in >>> cluster that we need to handle. So, the trends in performance >>> suggests we shouldn’t count on having the write performance we >>> need for too long. >>> >>> If we repeat the process of creating a new pool and running the >>> bench the same pattern holds, good initial performance that >>> gradually degrades. >>> >>> https://postimg.org/image/ovymk7n2d/ >>> <https://postimg.org/image/ovymk7n2d/> >>> [caption:90 million objects written to a brand new, pre-split pool >>> (poolofhopes). There are already 330 million objects on the >>> cluster in other pools.] >>> >>> Our working theory is that the degradation over time may be >>> related to inode or dentry lookups that miss cache and lead to >>> additional disk reads and seek activity. There’s a suggestion that >>> filestore directory splitting may exacerbate that problem as >>> additional/longer disk seeks occur related to what’s in which XFS >>> assignment group. We have found pre-split pools useful in one >>> major way, they avoid periods of near-zero write performance that >>> we have put down to the active splitting of directories (the >>> "thundering herd" effect). The overall downward curve seems to >>> remain the same whether we pre-split or not. >>> >>> The thundering herd seems to be kept in check by an appropriate >>> pre-split. Bluestore may or may not be a solution, but >>> uncertainty and stability within our fairly tight timeline don't >>> recommend it to us. Right now our big question is "how can we >>> avoid the gradual degradation in write performance over time?". >>> >>> Thank you, Patrick >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com