On 04/02/2021 12:08, Lionel Bouton wrote: > Hi, > > Le 04/02/2021 à 08:41, Loïc Dachary a écrit : >> Hi Frederico, >> >> On 04/02/2021 05:51, Federico Lucifredi wrote: >>> Hi Loïc, >>> I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes. >> Excellent question! >>> Is the desire to use RBD only due to minimum allocation sizes? >> I *assume* that since RGW does have > If I understand correctly I assume that you are missing a "not" here. Yes :-) > >> specific strategies to take advantage of the fact that objects are immutable and will never be removed: >> >> * It will be slower to add artifacts in RGW than in an RBD image + index >> * The metadata in RGW will be larger than an RBD image + index >> >> However I have not verified this and if you have an opinion I'd love to hear it :-) > Reading the exchanges I believe you are focused on the reading speed and > space efficiency. Did you consider the writing speed with such a scheme ? Yes and the goal is to achieve 100MB/s write speed. > > Depending on how you store the index, you could block on each write and > would have to consider Ceph latency (ie: if your writer fails recovering > can be tricky without having waited for writes to update your index). > With your 100TB target and 3kb artifact size a 1ms latency and blocking > writes translate to a whole year spent writing. If you manage to get to > a 0.1ms latency (not sure if this is achievable with Ceph yet) you end > with a month. Depending on how you plan to populate the store this could > be a problem. You'll have to consider if the artifact writing rate limit > can become a bottleneck during normal use too. > > You can probably design a scheme supporting storing multiple values in a > single write but it seems to add complexity which might come with > unwanted performance problems and space use itself. > > I'm not familiar with space efficiency on modern Ceph versions (still > using filestore on Hammer...), do you have a ballpark estimation of the > costs of storing artifacts as simple objects ? Unless you already worked > out the whole design that would be my first concern : it could end up > being an inefficiency worth the trade-off for simplicity. I did not measure the overhead and I'm assuming it is significant enough to justify RGW implemented packing. > > I'm unfamiliar with the gateway and how well and easily it can scale so > my first impulse was to bypass RGW to use the librados interface > directly. Using librados directly would work but the caller would have to implement packing in the same way RBD or RGW does. It is a lot of work to do that properly. > You can definitely begin with a RGW solution as it is a bit > easier to implement and switch to librados later if RGW ever becomes a > bottleneck. If you need speed either writing or reading, both RGW and > librados would work : you can have as many clients managing objects in > parallel without any lock on writes on your end to manage. This is a > very simple storage design and simplicity can't be overrated :-) > The only potential downside (in addition to space inefficiency) that I > can see would be walking the list of objects. This is doable but with > billions of them this could be very slow. Not sure if it could become a > need given your use case though. I'll research more and try to figure out a way to compare write/read speed in both cases. > For reference, I just found the results of a test with a moderately > comparable test set : > https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond. > I didn't finish reading it yet but the volume seems comparable to your > use case although with 64kB objects. That's a significant difference but the benchmarks results are still very useful. > Note : I've seen questions about 100TB RBDs in the thread. We use such > beasts in two clusters : they work fine but are a pain when deleting or > downsizing them. During one downsize on the slowest cluster we had to > pause the operation manually (SIGSTOP to the rbd process) during periods > of high loads and let it continue after. This took about a week (but the > cluster was admittedly underpowered for its use at the time). Intersting ! In this use case having a single RBD image does not seem to be a good idea. Ceph is designed to scale out. But RBD images are not designed to grow indefinitely. Having multiple 1TB images sounds like a sane tradeoff. > > Best regards, Thanks for taking the time to think about this use case :-) Cheers > -- > Lionel Bouton > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx