Re: State of SMR support in Ceph?

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Wed, 6 May 2020 13:28:05 +0200

Dear Janne,

Am 06.05.20 um 09:18 schrieb Janne Johansson:
Den ons 6 maj 2020 kl 00:58 skrev Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>>:

    Dear Cephalopodians,
    seeing the recent moves of major HDD vendors to sell SMR disks targeted for use in consumer NAS devices (including RAID systems),
    I got curious and wonder what the current status of SMR support in Bluestore is.
    Of course, I'd expect disk vendors to give us host-managed SMR disks for data center use cases (and to tell us when actually they do so...),
    but in that case, Bluestore surely needs some new intelligence for best performance in the shingled ages.

I've only done filestore on SMRs, and it did work for a while, in normal cases for us, but it broke down horribly as soon as recovery needed to be done.
I have no idea if filestore was doing the worst ever for SMRs, or if bluestore will do better or if patches are going to help bluestore become useful, but all in all, I can't say anything else to people wanting to experiment with SMRs than "if you must use SMRs, make sure you test the most evil of corner cases".

Thanks for the input and especially the hands-on experience! That's very helpful (and "expensive" to gather), so thanks for sharing!

After my "small-scale" experiences, I would indeed have expected exactly that. My sincere hope is that this hardware will become useable by making use of Copy-on-Write semantics
to align writes into larger, consecutive batches.

As you noted, one can easily get into <1M/s with SMRs by doing something else than long linear writes, and you don't want to be in a place where several hundred TBs of data is doing recovery at that speed.

To me, SMR is a con, its a trick to sell cheap crap to people who can't or won't test properly. Doesn't matter if its ceph recovery/backfill, btrfs deletes or someones NAS raid sync job that places the final straw on the camels back and breaks it, it's the fact that filesystems do lots more than just easy nice long linear writes. No matter if it is fsck, defrags or ceph PG splits/reshardings, there will be disk meta-operations that needs to be done which includes tons of random small writes, and SMR drives will punish you for them when you need the drive up the most. 8-(

If I had some very special system which used cheap disks to pretend to be a tape device and only did 10G sized reads/writes like a tape would do, then I could see a use case for SMR.

I agree that in many cases SMR is not the correct hardware to use, and never will be. Indeed, I also agree that in most cases the "trick to sell cheap crap to people who can't or won't test properly"
applies, even more with disk-managed SMR which in some cases gives you zero control and maximum frustration.

Still, my hope would be that especially for archiving purposes (think of a pure Ceph-RGW cluster fed with Restic, Duplicati or similar tools), we can make good use of the cheaper hardware
(but then, this would of course need to be host-managed SMR, and the file system should know about it). I currently only know of Dropbox who are actively doing that
(and I guess they can easily, since they deduplicate data and probably rarely delete), and they seem to have developed their own file system to deal with this essentially.

It would be cool to have this with Ceph. You might also think about having a separate pool for "colder" objects which is SMR-backed
(likely coupled with SSDs / NVMes for WAL / BlockDB). In short, we'd never even think about using it with CephFS in our HPC cluster
(unless some admin-controllable write-once-read-many use cases evolve, which we could think about for centrally managed high-energy physics data),
or RBD in our virtualization cluster.
We're more interested in it for our backup cluster which mostly sees data ingest and the chunking into larger batches is even done client-side (Restic, Duplicati etc.).

Of course, your point about resharding and PG splits fully applies, so this for sure needs careful development (and testing!) to reduce the randomness as far as possible
(if we want to make use of this hardware for the use cases it may fit).

Cheers and thanks for your input,
	Oliver

--
May the most significant bit of your life be positive.

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx