Hello, On Sun, 10 Jul 2016 14:33:36 +0000 (GMT) m.danai@xxxxxxxxxx wrote: > Hello, > > >Those 2 servers are running Ceph? > >If so, be more specific, what's the HW like, CPU, RAM. network, journal > >SSDs? > > Yes, I was hesitating between GlusterFS and Ceph but the latter is much > more scalable and is future-proof. > > Both have the same configuration, namely E5 2628L (6c/12t @ 1.9GHz), > 8x16G 2133MHz, 2x10G bonded (we only use 10G and fiber links), multiple > 120G SSDs avaailable for journals and caching. > With two of these CPUs (and SSD journals) definitely not more than 24 OSDs per node. RAM is plentiful. Which exact SSD models? None of the 120GB ones I can think of would make good journal ones. > >Also, 2 servers indicate a replication of 2, something I'd avoid in > >production. > > This is true. I was thinking about EC instead of replication. > With EC you need to keep several things in mind: 1. Performance, especially IOPS, is worse than replicated. 2. More CPU power is needed. 3. A cache tier is mandatory. 4. Most importantly, you can't start small. With something akin to RAID6 levels of redundancy, you probably want nothing smaller than 8 nodes (K=6,M=2). > >Your first and foremost way to improve IOPS is to have SSD journals, > >everybody who deployed Ceph w/o them in any serious production > >environment came to regret it. > > I think it is clear that journal are a must, especially since many small > files will be read and written to. > > >Doubling the OSDs while halving the size will give you the same > >space but at a much better performance. > > It's true, but then the $/TB or even $/PB ratio is much higher. It would > be interesting to compare the outcome with more lower-density disks vs > less higher-density disks but with more (agressive) caching/journaling. > You may find that it's a zero-sum game, more or less. Basically you have the costs for chassis/MB/network cards per node that push you towards higher density nodes to save costs. OTOH cache-tier nodes (SSDs, NVMEs, CPUs) don't come cheap either. > Your overview of the whole system definitely helps sorting things out. > As you suggested, it's best I try some combinations to find what suits > my use case best. > > >If you were to use CephFS for storage, putting the metadata on SSDs will > >be beneficial, too. > > All OS drives are SSDs, and considering the system will never use the > SSD in full I think it would be safe to partition it for MDS, cache and > journal data. > Again, needs to be right kind of SSD for this to work, but in general, yes. I do share OS/journal SSDs all the time. Note that MDS in and by itself doesn't hold any persistent (on-disk) data, the metadata is all in the Ceph meta-data pool and that's the one you want to put on SSDs. Christian > -- > Sincères salutations, > > Moïn Danai. > ----Original Message---- > From : chibi@xxxxxxx > Date : 01/07/2016 - 04:26 (CEST) > To : ceph-users@xxxxxxxxxxxxxx > Cc : m.danai@xxxxxxxxxx > Subject : Re: Ceph for online file storage > > > Hello, > > On Thu, 30 Jun 2016 08:34:12 +0000 (GMT) m.danai@xxxxxxxxxx wrote: > > > Thank you all for your prompt answers. > > > > >firstly, wall of text, makes things incredibly hard to read. > > >Use paragraphs/returns liberally. > > > > I actually made sure to use paragraphs. For some reason, the formatting > > was removed. > > > > >Is that your entire experience with Ceph, ML archives and docs? > > > > Of course not, I have already been through the whole documentation many > > times. It's just that I couldn't really decide between the choices I > > was given. > > > > >What's an "online storage"? > > >I assume you're talking about what is is commonly referred as "cloud > > storage". > > > > I try not to use the term "cloud", but if you must, then yes that's the > > idea behind it. Basically an online hard disk. > > > While I can certainly agree that "cloud" is overused and often mis-used > as well, it makes things clearer in this context. > > > >10MB is not a small file in my book, 1-4KB (your typical mail) are > > >small files. > > >How much data (volume/space) are you looking at initially and within a > > >year of deployment? > > > > 10MB is small compared to the larger files, but it is indeed bigger > > that smaller, IOPS-intensive files (like the emails you pointed out). > > > > Right now there are two servers, each with 12x8TB. I expect a growth > > rate of about the same size every 2-3 months. > > > Those 2 servers are running Ceph? > If so, be more specific, what's the HW like, CPU, RAM. network, journal > SSDs? > > Also, 2 servers indicate a replication of 2, something I'd avoid in > production. > > > > >What usage patterns are you looking at, expecting? > > > > Since my customers will put their files on this "cloud", it's generally > > write once, read many (or at least more reads than writes). As they > > most likely will store private documents, but some bigger files too, > > the smaller files are predominant. > > > Reads are helped by having plenty of RAM in your storage servers. > > > >That's quite the blanket statement and sounds like from A sales > > >brochure. SSDs for OSD journals are always a good idea. > > >Ceph scales first and foremost by adding more storage nodes and OSDs. > > > > What I meant by scaling is that as the number of customers grows, the > > more small files there will be, and so in order to have decent > > performance at that point, SSDs are a must. I can add many OSDs, but if > > they are all struggling with IOPS then it's no use (except having more > > space). > > > You seem to grasp the fact that IOPS are likely to be your bottleneck, > yet are going for 8TB HDDs. > Which as Oliver mentioned and plenty of experience shared on this ML > shows is a poor choice unless it's for very low IOPS, large data use > cases. > > Now while I certainly understand the appeal of dense storage nodes from > cost/space perspective you will want to run several scenarios and > calculations to see what actually turns out to be the best fit. > > Your HDDs can do about 150 IOPS, half of that if they have no SSD > journals and then some 30% more lost to FS journals, LevelDB updates, > etc. Let's call it 60 IOPS w/o SSD journals and 120 with. > > Your first and foremost way to improve IOPS is to have SSD journals, > everybody who deployed Ceph w/o them in any serious production > environment came to regret it. > > After that match your IOPS needs to your space needs. > So your 2 x 12 HDD cluster up there (with SSDs) can hope to achieve > about 1500 IOPS. Doubling the OSDs while halving the size will give you > the same space but at a much better performance. > > Cache-tiering can help immensely, but as I said it depends on the usage > patterns and cache size. > Since your use case involves relatively large reads or writes, normal > straightforward caching (especially before Jewel) would quickly dirty > your cache w/o much gain. > Very refined tuning and selection of cache modes my work out well with > your scenario, but it won't be trivial. > > For example if your typical writes per day were to be 2TB, I'd put a > 5-6TB cache tier in place, and run it in read-forward mode. > So all reads would come from the HDD OSDs, but they'd be undisturbed > (full IOPS/bandwidth) by writes, which all go the cache. > That is, until the cache pool gets full of course. > And if you have low usage times, you could drop free ratios at those > times and flush the cache w/o impacting performance too much. > > > >Are we talking about existing HW or what you're planning? > > > > That is existing hardware. Given the high capacity of the drives, I > > went with a more powerful CPU to avoid myself future headaches. > > > The CPU power required is mostly tied to the IOPS capacity of the device, > which is basically the same for all HDD sizes. > > > >Also, avoid large variations in your storage nodes if anyhow possible, > > especially in your OSD sizes. > > > > Say I have two nodes, one with 12 OSDs and the other with 24. All > > drives are the same size. Would that cause any issue ? (except for the > > failure domain) > > > Identical drive (OSD) size helps, but now node B (the large one) gets > twice the action as node A. > Meaning it needs double the CPU power, RAM and in many cases more > importantly double the network bandwidth. > The later part can be particular tricky, expensive. > > > I think it is clear that native calls are the way to go, even the docs > > point you in that direction. Now the issue is that the clients needs to > > have a file directory structure. > > > > The access topology is as follows: > > > > Customer <-> customer application <-> server application <-> Ceph > > cluster > > > > The customer has to be able to make directories, as with an FTP server > > for example. Using CephFS would make this task very easy, though at the > > expense of some performance. With natives calls, since everything is > > considered as an object, it gets trickier to provide this feature. > > Perhaps some naming scheme would make this possible. > > > That's pretty much beyond me, basically a question of the effort you want > to put into the application, but there are examples for these approaches > out there. > > If you were to use CephFS for storage, putting the metadata on SSDs will > be beneficial, too. > > Christian > > > Kind regards, > > > > Moïn Danai. > > > > ----Original Message---- > > From : chibi@xxxxxxx > > Date : 27/06/2016 - 02:45 (CEST) > > To : ceph-users@xxxxxxxxxxxxxx > > Cc : m.danai@xxxxxxxxxx > > Subject : Re: Ceph for online file storage > > > > > > Hello, > > > > firstly, wall of text, makes things incredibly hard to read. > > Use paragraphs/returns liberally. > > > > Secondly, what Yang wrote. > > > > More inline. > > On Sun, 26 Jun 2016 18:30:35 +0000 (GMT+00:00) m.danai@xxxxxxxxxx > > wrote: > > > > > Hi all, > > > After a quick review of the mailing list archive, I have a question > > > that is left unanswered: > > > > Is that your entire experience with Ceph, ML archives and docs? > > > > >Is Ceph suitable for online file storage, and if > > > yes, shall I use RGW/librados or CephFS ? > > > > What's an "online storage"? > > I assume you're talking about what is is commonly referred as "cloud > > storage". > > Which also typically tends to use HTTP, S3 and thus RGW would be the > > classic fit. > > > > But that's up to you really. > > > > For example OwnCloud (and thus NextCloud) can use Ceph RGW as a storage > > backend. > > > > >The typical workload here is > > > mostly small files 50kB-10MB and some bigger ones 100MB+ up to 4TB > > > max (roughly 70/30 split). > > 10MB is not a small file in my book, 1-4KB (your typical mail) are > > small files. > > How much data (volume/space) are you looking at initially and within a > > year of deployment? > > > > What usage patterns are you looking at, expecting? > > > > >Caching with SSDs is critical in achieving > > > scalable performance as OSD hosts increase (and files as well). > > > > That's quite the blanket statement and sounds like from A sales > > brochure. SSDs for OSD journals are always a good idea. > > Ceph scales first and foremost by adding more storage nodes and OSDs. > > > > SSD based cache-tiers (quite a different beast to journals) can help, > > but that's highly dependent on your usage patterns as well as correct > > sizing and configuration of the cache pool. > > > > For example one of your 4TB files above could potentially wreck havoc > > with a cache pool of similar size. > > > > >OSD > > > nodes have between 12 and 48 8TB drives. > > > > Are we talking about existing HW or what you're planning? > > 12 OSDs per node are a good start and what I aim for usually, 24 are > > feasible if you have some idea what you're doing. > > More than 24 OSDs per node requires quite the insight and significant > > investments in CPU and RAM. Tons of threads about this here. > > > > Read the current thread "Dramatic performance drop at certain number of > > objects in pool" for example. > > > > Also, avoid large variations in your storage nodes if anyhow possible, > > especially in your OSD sizes. > > > > Christian > > > > >If using CephFS, the hierarchy > > > would include alphabet letters at the root and then a user's > > > directory in the appropriate subfolder folder. With native calls, > > > I'm not quite sure on how to retrieve file A from user A and not > > > user B. Note that the software which processes user data is written > > > in Java and deployed on multiple client-facing servers, so rados > > > integration should be easy. Kind regards, Moïn Danai. > > > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com