Re: Ceph for online file storage

Christian Balzer <chibi@xxxxxxx> · Mon, 11 Jul 2016 11:27:33 +0900

Hello,

On Sun, 10 Jul 2016 14:33:36 +0000 (GMT) m.danai@xxxxxxxxxx wrote:

> Hello,
> 
> >Those 2 servers are running Ceph?
> >If so, be more specific, what's the HW like, CPU, RAM. network, journal
> >SSDs?
> 
> Yes, I was hesitating between GlusterFS and Ceph but the latter is much
> more scalable and is future-proof.
> 
> Both have the same configuration, namely E5 2628L (6c/12t @ 1.9GHz),
> 8x16G 2133MHz, 2x10G bonded (we only use 10G and fiber links), multiple
> 120G SSDs avaailable for journals and caching.
> 
With two of these CPUs (and SSD journals) definitely not more than 24 OSDs
per node.
RAM is plentiful.

Which exact SSD models?
None of the 120GB ones I can think of would make good journal ones.

> >Also, 2 servers indicate a replication of 2, something I'd avoid in
> >production.
> 
> This is true. I was thinking about EC instead of replication.
> 
With EC you need to keep several things in mind:

1. Performance, especially IOPS, is worse than replicated.
2. More CPU power is needed.
3. A cache tier is mandatory. 
4. Most importantly, you can't start small. 
   With something akin to RAID6 levels of redundancy, you probably want
   nothing smaller than 8 nodes (K=6,M=2). 

> >Your first and foremost way to improve IOPS is to have SSD journals,
> >everybody who deployed Ceph w/o them in any serious production
> >environment came to regret it.
> 
> I think it is clear that journal are a must, especially since many small
> files will be read and written to.
> 
> >Doubling the OSDs while halving the size will give you the same
> >space but at a much better performance.
> 
> It's true, but then the $/TB or even $/PB ratio is much higher. It would
> be interesting to compare the outcome with more lower-density disks vs
> less higher-density disks but with more (agressive) caching/journaling.
> 
You may find that it's a zero-sum game, more or less.

Basically you have the costs for chassis/MB/network cards per node that
push you towards higher density nodes to save costs.
OTOH cache-tier nodes (SSDs, NVMEs, CPUs) don't come cheap either.

> Your overview of the whole system definitely helps sorting things out.
> As you suggested, it's best I try some combinations to find what suits
> my use case best.
> 
> >If you were to use CephFS for storage, putting the metadata on SSDs will
> >be beneficial, too.
> 
> All OS drives are SSDs, and considering the system will never use the
> SSD in full I think it would be safe to partition it for MDS, cache and
> journal data.
> 
Again, needs to be right kind of SSD for this to work, but in general,
yes.
I do share OS/journal SSDs all the time.
Note that MDS in and by itself doesn't hold any persistent (on-disk) data,
the metadata is all in the Ceph meta-data pool and that's the one you want
to put on SSDs.

Christian
> --
> Sincères salutations,
> 
> Moïn Danai.
> ----Original Message----
> From : chibi@xxxxxxx
> Date : 01/07/2016 - 04:26 (CEST)
> To : ceph-users@xxxxxxxxxxxxxx
> Cc : m.danai@xxxxxxxxxx
> Subject : Re:  Ceph for online file storage
> 
> 
> Hello,
> 
> On Thu, 30 Jun 2016 08:34:12 +0000 (GMT) m.danai@xxxxxxxxxx wrote:
> 
> > Thank you all for your prompt answers.
> > 
> > >firstly, wall of text, makes things incredibly hard to read.
> > >Use paragraphs/returns liberally.
> > 
> > I actually made sure to use paragraphs. For some reason, the formatting
> > was removed.
> > 
> > >Is that your entire experience with Ceph, ML archives and docs?
> > 
> > Of course not, I have already been through the whole documentation many
> > times. It's just that I couldn't really decide between the choices I
> > was given.
> > 
> > >What's an "online storage"?
> > >I assume you're talking about what is is commonly referred as "cloud
> > storage".
> > 
> > I try not to use the term "cloud", but if you must, then yes that's the
> > idea behind it. Basically an online hard disk.
> > 
> While I can certainly agree that "cloud" is overused and often mis-used
> as well, it makes things clearer in this context.
> 
> > >10MB is not a small file in my book, 1-4KB (your typical mail) are
> > >small files.
> > >How much data (volume/space) are you looking at initially and within a
> > >year of deployment?
> > 
> > 10MB is small compared to the larger files, but it is indeed bigger
> > that smaller, IOPS-intensive files (like the emails you pointed out).
> > 
> > Right now there are two servers, each with 12x8TB. I expect a growth
> > rate of about the same size every 2-3 months.
> > 
> Those 2 servers are running Ceph?
> If so, be more specific, what's the HW like, CPU, RAM. network, journal
> SSDs?
> 
> Also, 2 servers indicate a replication of 2, something I'd avoid in
> production.
> 
> 
> > >What usage patterns are you looking at, expecting?
> > 
> > Since my customers will put their files on this "cloud", it's generally
> > write once, read many (or at least more reads than writes). As they
> > most likely will store private documents, but some bigger files too,
> > the smaller files are predominant.
> >
> Reads are helped by having plenty of RAM in your storage servers.
>  
> > >That's quite the blanket statement and sounds like from A sales
> > >brochure. SSDs for OSD journals are always a good idea.
> > >Ceph scales first and foremost by adding more storage nodes and OSDs.
> > 
> > What I meant by scaling is that as the number of customers grows, the
> > more small files there will be, and so in order to have decent
> > performance at that point, SSDs are a must. I can add many OSDs, but if
> > they are all struggling with IOPS then it's no use (except having more
> > space).
> > 
> You seem to grasp the fact that IOPS are likely to be your bottleneck,
> yet are going for 8TB HDDs.
> Which as Oliver mentioned and plenty of experience shared on this ML
> shows is a poor choice unless it's for very low IOPS, large data use
> cases.
> 
> Now while I certainly understand the appeal of dense storage nodes from
> cost/space perspective you will want to run several scenarios and
> calculations to see what actually turns out to be the best fit.
> 
> Your HDDs can do about 150 IOPS, half of that if they have no SSD
> journals and then some 30% more lost to FS journals, LevelDB updates,
> etc. Let's call it 60 IOPS w/o SSD journals and 120 with.
> 
> Your first and foremost way to improve IOPS is to have SSD journals,
> everybody who deployed Ceph w/o them in any serious production
> environment came to regret it.
> 
> After that match your IOPS needs to your space needs. 
> So your 2 x 12 HDD cluster up there (with SSDs) can hope to achieve
> about 1500 IOPS. Doubling the OSDs while halving the size will give you
> the same space but at a much better performance.
> 
> Cache-tiering can help immensely, but as I said it depends on the usage
> patterns and cache size.
> Since your use case involves relatively large reads or writes, normal
> straightforward caching (especially before Jewel) would quickly dirty
> your cache w/o much gain.
> Very refined tuning and selection of cache modes my work out well with
> your scenario, but it won't be trivial.
> 
> For example if your typical writes per day were to be 2TB, I'd put a
> 5-6TB cache tier in place, and run it in read-forward mode. 
> So all reads would come from the HDD OSDs, but they'd be undisturbed
> (full IOPS/bandwidth) by writes, which all go the cache. 
> That is, until the cache pool gets full of course.
> And if you have low usage times, you could drop free ratios at those
> times and flush the cache w/o impacting performance too much.
> 
> > >Are we talking about existing HW or what you're planning?
> > 
> > That is existing hardware. Given the high capacity of the drives, I
> > went with a more powerful CPU to avoid myself future headaches.
> > 
> The CPU power required is mostly tied to the IOPS capacity of the device,
> which is basically the same for all HDD sizes. 
> 
> > >Also, avoid large variations in your storage nodes if anyhow possible,
> > especially in your OSD sizes.
> > 
> > Say I have two nodes, one with 12 OSDs and  the other with 24. All
> > drives are the same size. Would that cause any issue ? (except for the
> > failure domain)
> > 
> Identical drive (OSD) size helps, but now node B (the large one) gets
> twice the action as node A.
> Meaning it needs double the CPU power, RAM and in many cases more
> importantly double the network bandwidth. 
> The later part can be particular tricky, expensive.
>  
> > I think it is clear that native calls are the way to go, even the docs
> > point you in that direction. Now the issue is that the clients needs to
> > have a file directory structure.
> > 
> > The access topology is as follows:
> > 
> > Customer <-> customer application <-> server application <-> Ceph
> > cluster
> > 
> > The customer has to be able to make directories, as with an FTP server
> > for example. Using CephFS would make this task very easy, though at the
> > expense of some performance. With natives calls, since everything is
> > considered as an object, it gets trickier to provide this feature.
> > Perhaps some naming scheme would make this possible.
> > 
> That's pretty much beyond me, basically a question of the effort you want
> to put into the application, but there are examples for these approaches
> out there.
> 
> If you were to use CephFS for storage, putting the metadata on SSDs will
> be beneficial, too.
> 
> Christian
> 
> > Kind regards,
> > 
> > Moïn Danai.
> > 
> > ----Original Message----
> > From : chibi@xxxxxxx
> > Date : 27/06/2016 - 02:45 (CEST)
> > To : ceph-users@xxxxxxxxxxxxxx
> > Cc : m.danai@xxxxxxxxxx
> > Subject : Re:  Ceph for online file storage
> > 
> > 
> > Hello,
> > 
> > firstly, wall of text, makes things incredibly hard to read.
> > Use paragraphs/returns liberally.
> > 
> > Secondly, what Yang wrote.
> > 
> > More inline.
> > On Sun, 26 Jun 2016 18:30:35 +0000 (GMT+00:00) m.danai@xxxxxxxxxx
> > wrote:
> > 
> > > Hi all,
> > > After a quick review of the mailing list archive, I have a question
> > > that is left unanswered: 
> > 
> > Is that your entire experience with Ceph, ML archives and docs?
> > 
> > >Is Ceph suitable for online file storage, and if
> > > yes, shall I use RGW/librados or CephFS ? 
> > 
> > What's an "online storage"? 
> > I assume you're talking about what is is commonly referred as "cloud
> > storage".
> > Which also typically tends to use HTTP, S3 and thus RGW would be the
> > classic fit. 
> > 
> > But that's up to you really.
> > 
> > For example OwnCloud (and thus NextCloud) can use Ceph RGW as a storage
> > backend. 
> > 
> > >The typical workload here is
> > > mostly small files 50kB-10MB and some bigger ones 100MB+ up to 4TB
> > > max (roughly 70/30 split). 
> > 10MB is not a small file in my book, 1-4KB (your typical mail) are
> > small files.
> > How much data (volume/space) are you looking at initially and within a
> > year of deployment?
> > 
> > What usage patterns are you looking at, expecting?
> > 
> > >Caching with SSDs is critical in achieving
> > > scalable performance as OSD hosts increase (and files as well). 
> > 
> > That's quite the blanket statement and sounds like from A sales
> > brochure. SSDs for OSD journals are always a good idea.
> > Ceph scales first and foremost by adding more storage nodes and OSDs.
> > 
> > SSD based cache-tiers (quite a different beast to journals) can help,
> > but that's highly dependent on your usage patterns as well as correct
> > sizing and configuration of the cache pool.
> > 
> > For example one of your 4TB files above could potentially wreck havoc
> > with a cache pool of similar size.
> > 
> > >OSD
> > > nodes have between 12 and 48 8TB drives. 
> > 
> > Are we talking about existing HW or what you're planning?
> > 12 OSDs per node are a good start and what I aim for usually, 24 are
> > feasible if you have some idea what you're doing.
> > More than 24 OSDs per node requires quite the insight and significant
> > investments in CPU and RAM. Tons of threads about this here.
> > 
> > Read the current thread "Dramatic performance drop at certain number of
> > objects in pool" for example.
> > 
> > Also, avoid large variations in your storage nodes if anyhow possible,
> > especially in your OSD sizes.
> > 
> > Christian
> > 
> > >If using CephFS, the hierarchy
> > > would include alphabet letters at the root and then a user's
> > > directory in the appropriate subfolder folder. With native calls,
> > > I'm not quite sure on how to retrieve file A from user A and not
> > > user B. Note that the software which processes user data is written
> > > in Java and deployed on multiple client-facing servers, so rados
> > > integration should be easy. Kind regards, Moïn Danai.
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com