Re: Fast Ceph a Cluster with PB storage

Александр Пивушков <pivu@xxxxxxx> · Wed, 10 Aug 2016 10:10:08 +0300

Hello

	Вторник,  9 августа 2016, 14:56 +03:00 от Christian Balzer <chibi@xxxxxxx>:

Hello,

[re-added the list]

Also try to leave a line-break, paragraph between quoted and new text,

your mail looked like it was all written by me...

On Tue, 09 Aug 2016 11:00:27 +0300 Александр Пивушков wrote:

                                 >  Thank you for your response!

> 

> 

> >Вторник,  9 августа 2016, 5:11 +03:00 от Christian Balzer <chibi@xxxxxxx>:

> >

> >

> >Hello,

> >

> >On Mon, 08 Aug 2016 17:39:07 +0300 Александр Пивушков wrote:

> >

> >> 

> >> Hello dear community!

> >> I'm new to the Ceph and not long ago took up the theme of building clusters.

> >> Therefore it is very important to your opinion.

> >> It is necessary to create a cluster from 1.2 PB storage and very rapid access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, but with increase of volume of storage, the price of such cluster very strongly grows and therefore there was an idea to use Ceph.

> >

> >You may want to tell us more about your environment, use case and in

> >particular what your clients are.

> >Large amounts of data usually means graphical or scientific data,

> >extremely high speed (IOPS) requirements usually mean database

> >like applications, which one is it, or is it a mix? 

>

>This is a mixed project, with combined graphics and science. Project linking the vast array of image data. Like google MAP :)

> Previously, customers were Windows that are connected to powerful servers directly. 

> Ceph cluster connected on FC to servers of the virtual machines is now planned. Virtualization - oVirt. 

Stop right there. oVirt, despite being from RedHat, doesn't really support

Ceph directly all that well, last I checked.

That is probably where you get the idea/need for FC from.

If anyhow possible, you do NOT want another layer and protocol conversion

between Ceph and the VMs, like a FC gateway or iSCSI or NFS.

So if you're free to choose your Virtualization platform, use KVM/qemu at

the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with

KVM resource agents on top.

I have worked with proxmox

>Clients on 40 GB ethernet are connected to servers of virtualization.

Your VM clients (if using RBD instead of FC) and the end-users could use

the same network infrastructure.

>Clients on Windows.

> Customers use their software. It is written by them. About the base I do not know, probably not. The processing results are stored in conventional files. In total about 160 GB.

1 image file being 160GB?

No, a lot of files of different sizes. From 1 GB to 1 MB

> We need very quickly to process these images, so as not to cause dissatisfaction among customers. :) Per minute.

Explain. 

Writing 160GB/minute is going to be a challenge on many levels.

Even with 40Gb/s networks this assumes no contention on the network OR the

storage backend...

For Fiber channel
speed -16 GB / s size -160 GB obtained 
160*8/16=80 second 1 canal (theoretical speed)

> >

> >

> >For example, how were the above NVMes deployed and how did they serve data

> >to the clients?

> >The fiber channel bit in your HW list below makes me think you're using

> >VMware, FC and/or iSCSI right now. 

>

>Data is stored on the SSD disk 1.6TB NVMe, and processed and stored directly on it. In one powerful server. Gave for this task. Used 40 GB ethernet. Server - CentOS 7 

So you're going from a single server with all NVMe storage to a

distributed storage. 

You will be disappointed by the cost/performance in direct comparison.

Nevertheless, you need as many users. They need to provide a single repository for their data.

> 

> >

> >

> >> There are following requirements:

> >> - The amount of data 160 GB should be read and written at speeds of SSD P3608

> >Again, how are they serving data now?

> >The speeds (latency!) a local NVMe can reach is of course impossible with

> >a network attached SDS like Ceph. 

>

>It is sad. Not helping matters is paralleling to 13 servers? and the FC?

>

Ceph does not FC internally.

I only uses IP (so you can use IPoIB if you want).

Never mind that the problem is that the replication (x3) is causing the

largest part of the latency.

can be customized to replication occurs in the background?

> >

> >160GB is tiny, are you sure about this number? 

>

>Yes, it's small, and it is exactly. But, it is the most sensitive data processing time. Even in the background and can be a slower process more data. Their treatment is not so nervous clients.

Still no getting it, but it seems more and more like 160GB/s.

no, 160 GB.

> >

> >

> >> - There must be created a high-speed storage of the SSD drives 36 TB volume with read / write speed tends to SSD P3608

> >How is that different to the point above? The data of this volume can be processed in the background, running in parallel with the processing of 160 GB. The speed of processing is not so important. Previously, the entire amount was placed in a server on Ssd disk lesser performance. Therefore, I declare Ceph cluster ssd drives of the same of volume that can quickly read and write data.

> >

> >

> >> - Must be created store 1.2 PB with the access speed than the bigger, the better ...

> >Ceph scales well.

> >> - Must have triple redundancy

> >Also not an issue, depending on how you define this. 

>

>Standard means Ceph . The simplest configuration. As far as I understand the default it has dual redundant.

Default is 3 replicas for about 2 years now. And trust me, you don't want

less when dealing with HDDs.

> >

> >

> >> I do not really understand yet, so to create a configuration with SSD P3608 Disk. Of course, the configuration needs to be changed, it is very expensive.

> >

> >There are HW guides and plenty of discussion about how to design largish

> >clusters, find and read them.

> >Like the ML threads:

> >"800TB - Ceph Physical Architecture Proposal"

> >"dense storage nodes" Thank you. I found and read

> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008775.html

> 

> >

> >

> >Also read up on Ceph cache-tiering. 

> >

> >> InfiniBand will be used, and 40 GB Ethernet.

> >> We will also use virtualization to high-performance hardware to optimize the number of physical servers.

> >

> >What VM stack/environment?

> >If it is VMware, Ceph is a bad fit as the most stable way to export Ceph

> >storage to this platform is NFS, which is also the least performant

> >(AFAIK). It is used oVirt

> >

> >

> >> I'm not tied to a specific server models and manufacturers. I create only the cluster scheme which should be criticized :) 

> >> 

> >> 1. OSD - 13 pieces.

> >>      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces

> >

> >For starters, that's not how you'd likely deploy high speed storage, see

> >CPU below. 

> >Also this gives you 36TB un-replicated capacity, so you'll need 2-3 times

> >the amount to be safe. 

>

>this amount is only a treatment, there is no storage

Don't understand what you mean here.

To increase the rate of return and reception of data to the user, I plan to use 36 TB SSD drives. It's like a second-level cache ...
The data must be stored on disk RTD. And for RTD drives need (3) replication.
For 36 TB cache replication is not needed data is not stored there. Only spooled to return to the user. Or temporarily taken a large amount of the user.

> >

> >

> >>      b. Fiber Channel 16 Gbit / c - 2 port.

> >What for?

> >If you need a FC GW (bad idea), 2 (dedicated if possible) machines will do. one for Ceph servers. Another connection to virtualization servers (clients). To ensure that the network was not a bottleneck.

> >

See above, it doesn't work like this.

> >

> >And where/what is your actual network HW? cluster is under creation. All will be purchased.

> Brocade 6510 FC SAN

> >

> >

> >>      c. An array (not RAID) to 284 TB of SATA-based drives (36 drives for 8TB);

> >

> >Ceph works better with not overly large storage nodes and OSDs. 

> >I know you're trying to minimize rack space and cost, but something with

> >less OSDs per node and 4TB per OSD is going to be easier to get right.

> 

> well yes....

> I understood

> >

> >

> >>      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece

> >What is that for?

> >Ceph only performs decently (with the current filestore) when using SSDs

> >as journals for the HDD based OSDs, a singe SSD won't cut that and a 3500

> >has likely insufficient endurance anyway.

> >

> >For 36 OSDs you're looking at 7 400GB DC S3710s or 3 400GB DC P3700s... 

>I do not, do not understand. I in fact have a large store to the HDD. For Ceph need a separate disk for logs. That's why I singled him out for logs. Relatively inexpensive enough of volume. Approximately 10 GB - 1 OSD. SATA drives it drives HDD.

Logs aren't that big really with OSDs. 

Read up on "Ceph SSD journals".

ок

> >

> >

> >>      e. SATA drive 40 GB for installation of the operating system (or booting from the network, which is preferable)

> >>      f. RAM 288 GB

> >Generous, but will help reads. ""

> however, during recovery they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon). Generally, more RAM is better.

> http://docs.ceph.com/docs/master/start/hardware-recommendations/   and how much is enough?

>

For 36 ODS? 128GB if you're feeling lucky, 256GB if you want to play it

safe and have enhanced read performance.

That rule is for worst case scenarios and when 2TB disk were big.

ok

> >

> >

> >>      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4

> >Firstly that's a 10 core, 2.2GHz CPU.  Oh, yeah, like 10-core ... But the price! All of Freud.

> >

> >Secondly, most likely underpowered if serving both NVMes and 36 HDD OSDs.

> >A 400GB DC S3610 (so slower SATA, not NVMe) will eat about 3 2.2GHz cores

> >when doing small write IOPS. I took into account the recommendations of Ceph 1 GHz, 1 core per OSD

> 

That's for pure HDD. Which will give you nowhere the performance you want.

HDD plus SSD journal I figure 1-2 GHz, pure SSD or NVMe, as much as you

can afford.

ok

> >

> >

> >There are several saner approaches I can think of, but these depend on the

> >answers to the questions above.

> >

> >

> >> 2. MON - 3 pieces. All virtual server:

> >Virtual server can work, I prefer real (even if shared) HW.

> >3 is the absolute minimum, 5 would be a good match. ok

> >

> >

> >>      a. 1 Gbps Ethernet / c - 1 port.

> >While the MONs don't have much data traffic, the lower latency of a faster

> >network would be helpful. 

>what, for example?

Information exchange between the MONs (or MONs and OSDs) will have more

latency at 1Gb/s, thus be slower.

which network is sufficient for MONs anyway

> >

> >

> >If you actually need MDS, make those (real) servers also MONs and put

> >the rest on OSD nodes or VMs. no, really, I do not know yet whether CephFS need. It is open to question.

From all I can you don't need it.

And thus no MDS.

We want to mount the user folder. The folder is located on Ceph. We do not want to use CephFS. Can this be done and if not needed MDS really?

-- 
Александр Пивушков

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com