Re: Ceph for online file storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

On Thu, 30 Jun 2016 08:34:12 +0000 (GMT) m.danai@xxxxxxxxxx wrote:

> Thank you all for your prompt answers.
> 
> >firstly, wall of text, makes things incredibly hard to read.
> >Use paragraphs/returns liberally.
> 
> I actually made sure to use paragraphs. For some reason, the formatting
> was removed.
> 
> >Is that your entire experience with Ceph, ML archives and docs?
> 
> Of course not, I have already been through the whole documentation many
> times. It's just that I couldn't really decide between the choices I was
> given.
> 
> >What's an "online storage"?
> >I assume you're talking about what is is commonly referred as "cloud
> storage".
> 
> I try not to use the term "cloud", but if you must, then yes that's the
> idea behind it. Basically an online hard disk.
> 
While I can certainly agree that "cloud" is overused and often mis-used as
well, it makes things clearer in this context.

> >10MB is not a small file in my book, 1-4KB (your typical mail) are small
> >files.
> >How much data (volume/space) are you looking at initially and within a
> >year of deployment?
> 
> 10MB is small compared to the larger files, but it is indeed bigger that
> smaller, IOPS-intensive files (like the emails you pointed out).
> 
> Right now there are two servers, each with 12x8TB. I expect a growth
> rate of about the same size every 2-3 months.
> 
Those 2 servers are running Ceph?
If so, be more specific, what's the HW like, CPU, RAM. network, journal
SSDs?

Also, 2 servers indicate a replication of 2, something I'd avoid in
production.


> >What usage patterns are you looking at, expecting?
> 
> Since my customers will put their files on this "cloud", it's generally
> write once, read many (or at least more reads than writes). As they most
> likely will store private documents, but some bigger files too, the
> smaller files are predominant.
>
Reads are helped by having plenty of RAM in your storage servers.
 
> >That's quite the blanket statement and sounds like from A sales
> >brochure. SSDs for OSD journals are always a good idea.
> >Ceph scales first and foremost by adding more storage nodes and OSDs.
> 
> What I meant by scaling is that as the number of customers grows, the
> more small files there will be, and so in order to have decent
> performance at that point, SSDs are a must. I can add many OSDs, but if
> they are all struggling with IOPS then it's no use (except having more
> space).
> 
You seem to grasp the fact that IOPS are likely to be your bottleneck, yet
are going for 8TB HDDs.
Which as Oliver mentioned and plenty of experience shared on this ML shows
is a poor choice unless it's for very low IOPS, large data use cases.

Now while I certainly understand the appeal of dense storage nodes from
cost/space perspective you will want to run several scenarios and
calculations to see what actually turns out to be the best fit.

Your HDDs can do about 150 IOPS, half of that if they have no SSD journals
and then some 30% more lost to FS journals, LevelDB updates, etc.
Let's call it 60 IOPS w/o SSD journals and 120 with.

Your first and foremost way to improve IOPS is to have SSD journals,
everybody who deployed Ceph w/o them in any serious production environment
came to regret it.

After that match your IOPS needs to your space needs. 
So your 2 x 12 HDD cluster up there (with SSDs) can hope to achieve about 
1500 IOPS. Doubling the OSDs while halving the size will give you the same
space but at a much better performance.

Cache-tiering can help immensely, but as I said it depends on the usage
patterns and cache size.
Since your use case involves relatively large reads or writes, normal
straightforward caching (especially before Jewel) would quickly dirty your
cache w/o much gain.
Very refined tuning and selection of cache modes my work out well with
your scenario, but it won't be trivial.

For example if your typical writes per day were to be 2TB, I'd put a 5-6TB
cache tier in place, and run it in read-forward mode. 
So all reads would come from the HDD OSDs, but they'd be undisturbed (full
IOPS/bandwidth) by writes, which all go the cache. 
That is, until the cache pool gets full of course.
And if you have low usage times, you could drop free ratios at those times
and flush the cache w/o impacting performance too much.

> >Are we talking about existing HW or what you're planning?
> 
> That is existing hardware. Given the high capacity of the drives, I went
> with a more powerful CPU to avoid myself future headaches.
> 
The CPU power required is mostly tied to the IOPS capacity of the device,
which is basically the same for all HDD sizes. 

> >Also, avoid large variations in your storage nodes if anyhow possible,
> especially in your OSD sizes.
> 
> Say I have two nodes, one with 12 OSDs and  the other with 24. All
> drives are the same size. Would that cause any issue ? (except for the
> failure domain)
> 
Identical drive (OSD) size helps, but now node B (the large one) gets
twice the action as node A.
Meaning it needs double the CPU power, RAM and in many cases more
importantly double the network bandwidth. 
The later part can be particular tricky, expensive.
 
> I think it is clear that native calls are the way to go, even the docs
> point you in that direction. Now the issue is that the clients needs to
> have a file directory structure.
> 
> The access topology is as follows:
> 
> Customer <-> customer application <-> server application <-> Ceph cluster
> 
> The customer has to be able to make directories, as with an FTP server
> for example. Using CephFS would make this task very easy, though at the
> expense of some performance. With natives calls, since everything is
> considered as an object, it gets trickier to provide this feature.
> Perhaps some naming scheme would make this possible.
> 
That's pretty much beyond me, basically a question of the effort you want
to put into the application, but there are examples for these approaches
out there.

If you were to use CephFS for storage, putting the metadata on SSDs will
be beneficial, too.

Christian

> Kind regards,
> 
> Moïn Danai.
> 
> ----Original Message----
> From : chibi@xxxxxxx
> Date : 27/06/2016 - 02:45 (CEST)
> To : ceph-users@xxxxxxxxxxxxxx
> Cc : m.danai@xxxxxxxxxx
> Subject : Re:  Ceph for online file storage
> 
> 
> Hello,
> 
> firstly, wall of text, makes things incredibly hard to read.
> Use paragraphs/returns liberally.
> 
> Secondly, what Yang wrote.
> 
> More inline.
> On Sun, 26 Jun 2016 18:30:35 +0000 (GMT+00:00) m.danai@xxxxxxxxxx wrote:
> 
> > Hi all,
> > After a quick review of the mailing list archive, I have a question
> > that is left unanswered: 
> 
> Is that your entire experience with Ceph, ML archives and docs?
> 
> >Is Ceph suitable for online file storage, and if
> > yes, shall I use RGW/librados or CephFS ? 
> 
> What's an "online storage"? 
> I assume you're talking about what is is commonly referred as "cloud
> storage".
> Which also typically tends to use HTTP, S3 and thus RGW would be the
> classic fit. 
> 
> But that's up to you really.
> 
> For example OwnCloud (and thus NextCloud) can use Ceph RGW as a storage
> backend. 
> 
> >The typical workload here is
> > mostly small files 50kB-10MB and some bigger ones 100MB+ up to 4TB max
> > (roughly 70/30 split). 
> 10MB is not a small file in my book, 1-4KB (your typical mail) are small
> files.
> How much data (volume/space) are you looking at initially and within a
> year of deployment?
> 
> What usage patterns are you looking at, expecting?
> 
> >Caching with SSDs is critical in achieving
> > scalable performance as OSD hosts increase (and files as well). 
> 
> That's quite the blanket statement and sounds like from A sales
> brochure. SSDs for OSD journals are always a good idea.
> Ceph scales first and foremost by adding more storage nodes and OSDs.
> 
> SSD based cache-tiers (quite a different beast to journals) can help, but
> that's highly dependent on your usage patterns as well as correct sizing
> and configuration of the cache pool.
> 
> For example one of your 4TB files above could potentially wreck havoc
> with a cache pool of similar size.
> 
> >OSD
> > nodes have between 12 and 48 8TB drives. 
> 
> Are we talking about existing HW or what you're planning?
> 12 OSDs per node are a good start and what I aim for usually, 24 are
> feasible if you have some idea what you're doing.
> More than 24 OSDs per node requires quite the insight and significant
> investments in CPU and RAM. Tons of threads about this here.
> 
> Read the current thread "Dramatic performance drop at certain number of
> objects in pool" for example.
> 
> Also, avoid large variations in your storage nodes if anyhow possible,
> especially in your OSD sizes.
> 
> Christian
> 
> >If using CephFS, the hierarchy
> > would include alphabet letters at the root and then a user's directory
> > in the appropriate subfolder folder. With native calls, I'm not quite
> > sure on how to retrieve file A from user A and not user B. Note that
> > the software which processes user data is written in Java and deployed
> > on multiple client-facing servers, so rados integration should be easy.
> > Kind regards, Moïn Danai.
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux