Re: librmb: Mail storage on RADOS with Dovecot

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 22 Sep 2017 14:56:51 -0700

On Fri, Sep 22, 2017 at 2:49 PM, Danny Al-Gaaf <danny.al-gaaf@xxxxxxxxx> wrote:
> Am 22.09.2017 um 22:59 schrieb Gregory Farnum:
> [..]
>> This is super cool! Is there anything written down that explains this
>> for Ceph developers who aren't familiar with the workings of Dovecot?
>> I've got some questions I see going through it, but they may be very
>> dumb.
>>
>> *) Why are indexes going on CephFS? Is this just about wanting a local
>> cache, or about the existing Dovecot implementations, or something
>> else? Almost seems like you could just store the whole thing in a
>> CephFS filesystem if that's safe. ;)
>
> This is, if everything works as expected, only an intermediate step. An
> idea is
> (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/status-3)
> be to use omap to store the index/meta data.
>
> We chose a step-by-step approach and since we are currently not sure if
> using omap would work performance wise, we use CephFS (also since this
> requires no changes in Dovecot). Currently we put our focus on the
> development of the first version of librmb, but the code to use omap is
> already there. It needs integration, testing, and performance tuning to
> verify if it would work with our requirements.
>
>> *) It looks like each email is getting its own object in RADOS, and I
>> assume those are small messages, which leads me to
>
> The mail distribution looks like this:
> https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-dist
>
>
> Yes, the majority of the mails are under 500k, but the most objects are
> around 50k. Not so many very small objects.

Ah, that slide makes more sense with that context — I was paging
through it in bed last night and thought it was about the number of
emails per user or something weird.

So those mail objects are definitely bigger than I expected; interesting.

>
>>   *) is it really cost-acceptable to not use EC pools on email data?
>
> We will use EC pools for the mail objects and replication for CephFS.
>
> But even without EC there would be a cost case compared to the current
> system. We will save a large amount of IOPs in the new platform since
> the (NFS) POSIX layer is removed from the IO path (at least for the mail
> objects). And we expect with Ceph and commodity hardware we can compete
> with a traditional enterprise NAS/NFS anyway.
>
>>   *) isn't per-object metadata overhead a big cost compared to the
>> actual stored data?
>
> I assume not. The metadata/index is not so much compared to the size of
> the mails (currently with NFS around 10% I would say). In the classic
> NFS based dovecot the number of index/cache/metadata files is an issue
> anyway. With 6.7 billion mails we have 1.2 billion index/cache/metadata
> files
> (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-nums).

I was unclear; I meant the RADOS metadata cost of storing an object. I
haven't quantified that in a while but it was big enough to make 4KB
objects pretty expensive, which I was incorrectly assuming would be
the case for most emails.
EC pools have the same issue; if you want to erasure-code a 40KB
object into 5+3 then you pay the metadata overhead for each 8KB
(40KB/5) of data, but again that's more on the practical side of
things than my initial assumptions placed it.

This is super cool!
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com