Hello, this has crept up before, find my thread "Bluestore caching, flawed by design?" for starters, if you haven't already. I'll have to build a new Ceph cluster next year and am also less than impressed with the choices at this time: 1. Bluestore is the new shiny, filestore is going to die (and did already with regards to the MUCH better in my experience and use case for EXT4 compared to XFS). Never mind the very bleeding edge vibe I'm still getting from bluestore two major releases in. 2. Caching is currently not up to snuff. To get the same level of caching as with pagecache AND being safe in heavy recovery/rebalance situations one needs a lot more RAM, 30% at least would be my guess. 3. Small IOPS (for my use case) can't be guaranteed to have similar behavior as with filestore and journals (some IOPS will go to disk directly). 4. Cache tiers are deprecated. Despite working beautifully for my use case and probably for yours as well. The proposed alternatives leave me cold, I tried them all a year ago. LVM dm-cache was a nightmare to configure (documentation and complexity) and performed abysmally compared to bcache. While bcache is fast (using it on 2 non-Ceph systems) it has other issues, some load spikes (often at near idle times), can be crashed by querying its sysfs counters and doesn't honor IO priorities... In short, if you're willing to basically treat your Ceph cluster as an appliance that cant be disposed off after 5 years instead of a continuously upgraded and expanded storage solution, you can do just that and install filestore OSDs and/or cache-tiering now and never upgrade (at least to a version where it becomes unsupported). In my case I'm currently undecided between doing something like the above or go all SSD (if that's affordable, maybe the RAM savings will help) and thus bypass all the bluestore performance issues at least. Regards, Christian On Tue, 2 Oct 2018 19:28:13 +0200 jesper@xxxxxxxx wrote: > Hi. > > Based on some recommendations we have setup our CephFS installation using > bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS > server - 100TB-ish size. > > Current setup is - a sizeable Linux host with 512GB of memory - one large > Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server. > > Since our "hot" dataset is < 400GB we can actually serve the hot data > directly out of the host page-cache and never really touch the "slow" > underlying drives. Except when new bulk data are written where a Perc with > BBWC is consuming the data. > > In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host > OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts > it is really hard to create a synthetic test where they hot data does not > end up being read out of the underlying disks. Yes, the > client side page cache works very well, but in our scenario we have 30+ > hosts pulling the same data over NFS. > > Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is > the recommendation to make an SSD "overlay" on the slow drives? > > Thoughts? > > Jesper > > * Bluestore should be the new and shiny future - right? > ** Total mem 1TB+ > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com