Re: Using RBD to pack billions of small files

Loïc Dachary <loic@xxxxxxxxxxx> · Thu, 4 Feb 2021 13:36:14 +0100



On 04/02/2021 12:08, Lionel Bouton wrote:
> Hi,
>
> Le 04/02/2021 à 08:41, Loïc Dachary a écrit :
>> Hi Frederico,
>>
>> On 04/02/2021 05:51, Federico Lucifredi wrote:
>>> Hi Loïc,
>>>    I am intrigued, but am missing something: why not using RGW, and store the source code files as objects? RGW has native compression and can take care of that behind the scenes.
>> Excellent question!
>>>    Is the desire to use RBD only due to minimum allocation sizes?
>> I *assume* that since RGW does have
> If I understand correctly I assume that you are missing a "not" here.
Yes :-)
>
>>  specific strategies to take advantage of the fact that objects are immutable and will never be removed:
>>
>> * It will be slower to add artifacts in RGW than in an RBD image + index
>> * The metadata in RGW will be larger than an RBD image + index
>>
>> However I have not verified this and if you have an opinion I'd love to hear it :-)
> Reading the exchanges I believe you are focused on the reading speed and
> space efficiency. Did you consider the writing speed with such a scheme ?
Yes and the goal is to achieve 100MB/s write speed.
>
> Depending on how you store the index, you could block on each write and
> would have to consider Ceph latency (ie: if your writer fails recovering
> can be tricky without having waited for writes to update your index).
> With your 100TB target and 3kb artifact size a 1ms latency and blocking
> writes translate to a whole year spent writing. If you manage to get to
> a 0.1ms latency (not sure if this is achievable with Ceph yet) you end
> with a month. Depending on how you plan to populate the store this could
> be a problem. You'll have to consider if the artifact writing rate limit
> can become a bottleneck during normal use too.

>
> You can probably design a scheme supporting storing multiple values in a
> single write but it seems to add complexity which might come with
> unwanted performance problems and space use itself.
>
> I'm not familiar with space efficiency on modern Ceph versions (still
> using filestore on Hammer...), do you have a ballpark estimation of the
> costs of storing artifacts as simple objects ? Unless you already worked
> out the whole design that would be my first concern : it could end up
> being an inefficiency worth the trade-off for simplicity.
I did not measure the overhead and I'm assuming it is significant
enough to justify RGW implemented packing.
>
> I'm unfamiliar with the gateway and how well and easily it can scale so
> my first impulse was to bypass RGW to use the librados interface
> directly. 
Using librados directly would work but the caller would have to implement
packing in the same way RBD or RGW does. It is a lot of work to do that
properly.
> You can definitely begin with a RGW solution as it is a bit
> easier to implement and switch to librados later if RGW ever becomes a
> bottleneck. If you need speed either writing or reading, both RGW and
> librados would work : you can have as many clients managing objects in
> parallel without any lock on writes on your end to manage. This is a
> very simple storage design and simplicity can't be overrated :-)
> The only potential downside (in addition to space inefficiency) that I
> can see would be walking the list of objects. This is doable but with
> billions of them this could be very slow. Not sure if it could become a
> need given your use case though.
I'll research more and try to figure out a way to compare write/read speed in both
cases.
> For reference, I just found the results of a test with a moderately
> comparable test set :
> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond.
> I didn't finish reading it yet but the volume seems comparable to your
> use case although with 64kB objects.
That's a significant difference but the benchmarks results are still
very useful.
> Note : I've seen questions about 100TB RBDs in the thread. We use such
> beasts in two clusters : they work fine but are a pain when deleting or
> downsizing them. During one downsize on the slowest cluster we had to
> pause the operation manually (SIGSTOP to the rbd process) during periods
> of high loads and let it continue after. This took about a week (but the
> cluster was admittedly underpowered for its use at the time).
Intersting ! In this use case having a single RBD image does not
seem to be a good idea. Ceph is designed to scale out. But RBD images
are not designed to grow indefinitely. Having multiple 1TB images sounds like
a sane tradeoff.
>
> Best regards,
Thanks for taking the time to think about this use case :-)

Cheers
> --
> Lionel Bouton
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Loïc Dachary, Artisan Logiciel Libre


Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx