Re: librmb: Mail storage on RADOS with Dovecot

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Thu, 30 Aug 2018 16:26:00 +0200

How is it going with this? Are we getting close to a state where we can 
store a mailbox on ceph with this librmb?

-----Original Message-----
From: Wido den Hollander [mailto:wido@xxxxxxxx] 
Sent: maandag 25 september 2017 9:20
To: Gregory Farnum; Danny Al-Gaaf
Cc: ceph-users
Subject: Re:  librmb: Mail storage on RADOS with Dovecot

> Op 22 september 2017 om 23:56 schreef Gregory Farnum 
<gfarnum@xxxxxxxxxx>:
> 
> 
> On Fri, Sep 22, 2017 at 2:49 PM, Danny Al-Gaaf 
<danny.al-gaaf@xxxxxxxxx> wrote:
> > Am 22.09.2017 um 22:59 schrieb Gregory Farnum:
> > [..]
> >> This is super cool! Is there anything written down that explains 
> >> this for Ceph developers who aren't familiar with the workings of 
Dovecot?
> >> I've got some questions I see going through it, but they may be 
> >> very dumb.
> >>
> >> *) Why are indexes going on CephFS? Is this just about wanting a 
> >> local cache, or about the existing Dovecot implementations, or 
> >> something else? Almost seems like you could just store the whole 
> >> thing in a CephFS filesystem if that's safe. ;)
> >
> > This is, if everything works as expected, only an intermediate step. 

> > An idea is
> > (https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/status-
> > 3) be to use omap to store the index/meta data.
> >
> > We chose a step-by-step approach and since we are currently not sure 

> > if using omap would work performance wise, we use CephFS (also since 

> > this requires no changes in Dovecot). Currently we put our focus on 
> > the development of the first version of librmb, but the code to use 
> > omap is already there. It needs integration, testing, and 
> > performance tuning to verify if it would work with our requirements.
> >
> >> *) It looks like each email is getting its own object in RADOS, and 

> >> I assume those are small messages, which leads me to
> >
> > The mail distribution looks like this:
> > https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplat
> > form-mails-dist
> >
> >
> > Yes, the majority of the mails are under 500k, but the most objects 
> > are around 50k. Not so many very small objects.
> 
> Ah, that slide makes more sense with that context — I was paging 
> through it in bed last night and thought it was about the number of 
> emails per user or something weird.
> 
> So those mail objects are definitely bigger than I expected; 
interesting.
> 
> >
> >>   *) is it really cost-acceptable to not use EC pools on email 
data?
> >
> > We will use EC pools for the mail objects and replication for 
CephFS.
> >
> > But even without EC there would be a cost case compared to the 
> > current system. We will save a large amount of IOPs in the new 
> > platform since the (NFS) POSIX layer is removed from the IO path (at 

> > least for the mail objects). And we expect with Ceph and commodity 
> > hardware we can compete with a traditional enterprise NAS/NFS 
anyway.
> >
> >>   *) isn't per-object metadata overhead a big cost compared to the 
> >> actual stored data?
> >
> > I assume not. The metadata/index is not so much compared to the size 

> > of the mails (currently with NFS around 10% I would say). In the 
> > classic NFS based dovecot the number of index/cache/metadata files 
> > is an issue anyway. With 6.7 billion mails we have 1.2 billion 
> > index/cache/metadata files 
> > 
(https://dalgaaf.github.io/CephMeetUpBerlin20170918-librmb/#/mailplatfor
m-mails-nums).
> 
> I was unclear; I meant the RADOS metadata cost of storing an object. I 

> haven't quantified that in a while but it was big enough to make 4KB 
> objects pretty expensive, which I was incorrectly assuming would be 
> the case for most emails.
> EC pools have the same issue; if you want to erasure-code a 40KB 
> object into 5+3 then you pay the metadata overhead for each 8KB
> (40KB/5) of data, but again that's more on the practical side of 
> things than my initial assumptions placed it.

Yes, it is. But combining object isn't easy either. RGW also has this 
limitation where objects are striped in RADOS and the EC overhead can 
become large.

At this moment the price/GB (correct me if needed Danny!) isn't th 
biggest problem. It could be that all mails will be stored on a 
replicated pool.

There also might be some overhead in BlueStore per object, but the 
number of Deutsche Telekom show that mails usually aren't 4kb. Only a 
small portion of e-mails is 4kb.

We will see how this turns out.

Wido

> 
> This is super cool!
> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com