Re: librmb: Mail storage on RADOS with Dovecot

Wido den Hollander <wido@xxxxxxxx> · Mon, 25 Sep 2017 09:19:47 +0200 (CEST)

> Op 22 september 2017 om 23:56 schreef Gregory Farnum <gfarnum@xxxxxxxxxx>:
> 
> 
> On Fri, Sep 22, 2017 at 2:49 PM, Danny Al-Gaaf <danny.al-gaaf@xxxxxxxxx> wrote:
> > Am 22.09.2017 um 22:59 schrieb Gregory Farnum:
> > [..]
> >> This is super cool! Is there anything written down that explains this
> >> for Ceph developers who aren't familiar with the workings of Dovecot?
> >> I've got some questions I see going through it, but they may be very
> >> dumb.
> >>
> >> *) Why are indexes going on CephFS? Is this just about wanting a local
> >> cache, or about the existing Dovecot implementations, or something
> >> else? Almost seems like you could just store the whole thing in a
> >> CephFS filesystem if that's safe. ;)
> >
> > This is, if everything works as expected, only an intermediate step. An
> > idea is
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/status-3)
> > be to use omap to store the index/meta data.
> >
> > We chose a step-by-step approach and since we are currently not sure if
> > using omap would work performance wise, we use CephFS (also since this
> > requires no changes in Dovecot). Currently we put our focus on the
> > development of the first version of librmb, but the code to use omap is
> > already there. It needs integration, testing, and performance tuning to
> > verify if it would work with our requirements.
> >
> >> *) It looks like each email is getting its own object in RADOS, and I
> >> assume those are small messages, which leads me to
> >
> > The mail distribution looks like this:
> > https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-dist
> >
> >
> > Yes, the majority of the mails are under 500k, but the most objects are
> > around 50k. Not so many very small objects.
> 
> Ah, that slide makes more sense with that context — I was paging
> through it in bed last night and thought it was about the number of
> emails per user or something weird.
> 
> So those mail objects are definitely bigger than I expected; interesting.
> 
> >
> >>   *) is it really cost-acceptable to not use EC pools on email data?
> >
> > We will use EC pools for the mail objects and replication for CephFS.
> >
> > But even without EC there would be a cost case compared to the current
> > system. We will save a large amount of IOPs in the new platform since
> > the (NFS) POSIX layer is removed from the IO path (at least for the mail
> > objects). And we expect with Ceph and commodity hardware we can compete
> > with a traditional enterprise NAS/NFS anyway.
> >
> >>   *) isn't per-object metadata overhead a big cost compared to the
> >> actual stored data?
> >
> > I assume not. The metadata/index is not so much compared to the size of
> > the mails (currently with NFS around 10% I would say). In the classic
> > NFS based dovecot the number of index/cache/metadata files is an issue
> > anyway. With 6.7 billion mails we have 1.2 billion index/cache/metadata
> > files
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatform-mails-nums).
> 
> I was unclear; I meant the RADOS metadata cost of storing an object. I
> haven't quantified that in a while but it was big enough to make 4KB
> objects pretty expensive, which I was incorrectly assuming would be
> the case for most emails.
> EC pools have the same issue; if you want to erasure-code a 40KB
> object into 5+3 then you pay the metadata overhead for each 8KB
> (40KB/5) of data, but again that's more on the practical side of
> things than my initial assumptions placed it.

Yes, it is. But combining object isn't easy either. RGW also has this limitation where objects are striped in RADOS and the EC overhead can become large.

At this moment the price/GB (correct me if needed Danny!) isn't th biggest problem. It could be that all mails will be stored on a replicated pool.

There also might be some overhead in BlueStore per object, but the number of Deutsche Telekom show that mails usually aren't 4kb. Only a small portion of e-mails is 4kb.

We will see how this turns out.

Wido

> 
> This is super cool!
> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com