We’re prototyping a native SMR object store now to run alongside Ceph, which has been our only object store backend for the last 3 years and is somewhat problematic on some metrics. I believe trying to use SMR drives with a file system
in the architecture (as in Ceph) is a non starter, the solution is really to treat them like tape drives. Using a sequential access model across the volume is everything. So the tape metaphor is important down to how you handle free space collection.
Unfortunately, trying to expose this new architecture using a libRADOS API isn’t something we’re interested in doing since we don’t use Ceph for anything above the libRADOS layer of the stack, but our code will be open source if someone
else wants to take a crack at it. It wouldn’t be easy. Our implementation is based on reusing the sequential access version of the SCSI protocol, and our host transport is iSER (iSCSI over RDMA), so it’s not a natural overlay to libRADOS to put it mildly,
particularly since we’re doing EC well above this layer of our application stack.
Steve Cranage
Principal Architect, Co-Founder
DeepSpace Storage
Dear Janne,
Am 06.05.20 um 09:18 schrieb Janne Johansson:
> Den ons 6 maj 2020 kl 00:58 skrev Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>:
>
> Dear Cephalopodians,
> seeing the recent moves of major HDD vendors to sell SMR disks targeted for use in consumer NAS devices (including RAID systems),
> I got curious and wonder what the current status of SMR support in Bluestore is.
> Of course, I'd expect disk vendors to give us host-managed SMR disks for data center use cases (and to tell us when actually they do so...),
> but in that case, Bluestore surely needs some new intelligence for best performance in the shingled ages.
>
>
> I've only done filestore on SMRs, and it did work for a while, in normal cases for us, but it broke down horribly as soon as recovery needed to be done.
> I have no idea if filestore was doing the worst ever for SMRs, or if bluestore will do better or if patches are going to help bluestore become useful, but all in all, I can't say anything else to people wanting to experiment with SMRs than "if you must use
SMRs, make sure you test the most evil of corner cases".
Thanks for the input and especially the hands-on experience! That's very helpful (and "expensive" to gather), so thanks for sharing!
After my "small-scale" experiences, I would indeed have expected exactly that. My sincere hope is that this hardware will become useable by making use of Copy-on-Write semantics
to align writes into larger, consecutive batches.
>
> As you noted, one can easily get into <1M/s with SMRs by doing something else than long linear writes, and you don't want to be in a place where several hundred TBs of data is doing recovery at that speed.
>
> To me, SMR is a con, its a trick to sell cheap crap to people who can't or won't test properly. Doesn't matter if its ceph recovery/backfill, btrfs deletes or someones NAS raid sync job that places the final straw on the camels back and breaks it, it's the
fact that filesystems do lots more than just easy nice long linear writes. No matter if it is fsck, defrags or ceph PG splits/reshardings, there will be disk meta-operations that needs to be done which includes tons of random small writes, and SMR drives will
punish you for them when you need the drive up the most. 8-(
>
> If I had some very special system which used cheap disks to pretend to be a tape device and only did 10G sized reads/writes like a tape would do, then I could see a use case for SMR.
I agree that in many cases SMR is not the correct hardware to use, and never will be. Indeed, I also agree that in most cases the "trick to sell cheap crap to people who can't or won't test properly"
applies, even more with disk-managed SMR which in some cases gives you zero control and maximum frustration.
Still, my hope would be that especially for archiving purposes (think of a pure Ceph-RGW cluster fed with Restic, Duplicati or similar tools), we can make good use of the cheaper hardware
(but then, this would of course need to be host-managed SMR, and the file system should know about it). I currently only know of Dropbox who are actively doing that
(and I guess they can easily, since they deduplicate data and probably rarely delete), and they seem to have developed their own file system to deal with this essentially.
It would be cool to have this with Ceph. You might also think about having a separate pool for "colder" objects which is SMR-backed
(likely coupled with SSDs / NVMes for WAL / BlockDB). In short, we'd never even think about using it with CephFS in our HPC cluster
(unless some admin-controllable write-once-read-many use cases evolve, which we could think about for centrally managed high-energy physics data),
or RBD in our virtualization cluster.
We're more interested in it for our backup cluster which mostly sees data ingest and the chunking into larger batches is even done client-side (Restic, Duplicati etc.).
Of course, your point about resharding and PG splits fully applies, so this for sure needs careful development (and testing!) to reduce the randomness as far as possible
(if we want to make use of this hardware for the use cases it may fit).
Cheers and thanks for your input,
Oliver
>
> --
> May the most significant bit of your life be positive.