Re: Fast Ceph a Cluster with PB storage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Christian, 

  thanks a lot for your time. Please see below.


2016-08-17 5:41 GMT+05:00 Christian Balzer <chibi@xxxxxxx>:

Hello,

On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:

>   So demands look like these:
>
> 1. He has a number of clients which need to periodically write a set of
> data as big as 160GB to a storage. The acceptable write speed is about a
> minute for the such amount, so it is around 2700-2800MB per second. Each
> write session will happend in a dedicated manner.

Let me confirm that "dedicated" here means non-concurrent, sequential.
So not more than one client at a time, the cluster and network would be
good if doing 3GB/s?
Yes, this is what I meant.
 

Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
that's with a single client of course.
I understand, thank you. Alexander doesn't have any setup yet and would like to build a cost-effective one (not exactly 'cheap', but with minimal costs to satify requirements), so I've recommended him QDR IB as a minimal setup if they will be able to live with the used hardware (which is pretty cheap in general and would allow to make inexpensive multi-port per server setup with bonding, but hardly to get in Russia) or FDR if it is possible to get new network hardware only.
 

>Data read should also be
> pretty fast. The written data must be shared after the write.
Fast reading might be achieved by these factors:
a) lots of RAM, to hold all FS SLAB data and of course page cache.
b) splitting writes and reads amongst the pools by using readfoward cache
mode, so writes go (primarily, initially) to the SSD cache pool and
(cold) reads come from the HDD base pool.
c) having a large cache pool.

>Clients OS -
> Windows.
So what server(s) are they writing to?
I don't think that Windows RBD port (dokan) is a well tested
implementation, besides not being updated for a year or so.
This is the question I haven't asked (I hope Alexander will read this and write me an answer, and I answer here), but I believe they use local P3608 for this at the moment. The main problem is that P3608s are pretty expensive, and local setup doesn't provide enough reliability, so they would like to build a cost-effective reliable setup with more inxepensive drives as well as providing a network storage for another data as well.
The situation with dokan is exactly what I thought and told Alexander. So the only way is to setup intermediate servers which will significantly reduce speed.


> 2. It is necessary to have a regular storage as well. He thinks about 1.2TB
> HDD storage with 34TB SSD cache tier at the moment.
>
A 34TB cache pool with (at the very least) 2x replication will not be
cheap.

> The main question with an answer I don't have is how to calculate\predict
> per client write speed for a ceph cluster?
This question has been asked before and in fact quite recently, see the
very short lived "Ceph performance calculator" thread.
Thank you, I've founded it. I've been following for the list for a pretty long time but seems that I missed this discussion.
 

In short, too many variables.

>For example, if there will be a
> cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
> SM863 drives - how to get approximation for the write speed? Concurent
> writes to the 6-8 good SSD drives could probably give such speed, but is it
> true for the cluster in general?

Since we're looking here at one of the relatively few use case where
bandwidth/throughput is the main factor and not IOPS, this calculation
becomes a bit easier and predictable.
For an example, see my recent post:
"Better late than never, some XFS versus EXT4 test results"
Found it too, thanks! Very useful tests. Beside of the current topic, wouldn't btrfs give some advantages in case of pure SSD pool with inline (on the same drive) journals?
 
Which basically shows that with sufficient network bandwidth all available
drive speed can be utilized.

With fio randwrite and 4MB blocks the above setup gives me 440MB/s and
with 4K blocks 8000 IOPS.
So throughput wise, 100% utilization, full speed present.
IOPS, less than a third (the SSDs are at 33% utilization, the delays are
caused by Ceph and network latencies).

>3 sets per 8 drives in 13 servers (with an
> additional overhead for the network operations, ACKs and placement
> calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is
> there a formula exists to calculate speed expectations from the raw speed
> and/or IOPS point of view?
>

Lets look at a simplified example:
10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes),
40Gb/s (QDR, Ether) interconnects.
Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes speeds.
Of course journals needs to go somewhere, so the effective speed is half
of that.
Thus we get a top speed per node of 2GB/s.
With a replication of 2 we would get a 10GB/s write capable cluster, with
3 it's down to a theoretical 6.6GB/s.

I'm ignoring the latency, ACK overhead up there, which has a significantly
lower impact on throughput than on IOPS. 

Having a single client or intermediary file server write all that to the
Ceph cluster over a single link is the bit I'd be more worried about.
I totally agree. 
 

Christian

> Or, from another side, if there are pre-requisites exist, how to be sure
> the projected cluster meets them? I'm pretty sure it's a typical task, how
> would you solve it?
>
> Thanks a lot in advance and best regards,
> Vladimir
>
>
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 2222192
>
> Аппаратное и программное обеспечение
> IBM, Microsoft, Eset
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
>
> 2016-08-08 19:39 GMT+05:00 Александр Пивушков <pivu@xxxxxxx>:
>
> > Hello dear community!
> > I'm new to the Ceph and not long ago took up the theme of building
> > clusters.
> > Therefore it is very important to your opinion.
> >
> > It is necessary to create a cluster from 1.2 PB storage and very rapid
> > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe
> > PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, but
> > with increase of volume of storage, the price of such cluster very strongly
> > grows and therefore there was an idea to use Ceph.
> > There are following requirements:
> >
> > - The amount of data 160 GB should be read and written at speeds of SSD
> > P3608
> > - There must be created a high-speed storage of the SSD drives 36 TB
> > volume with read / write speed tends to SSD P3608
> > - Must be created store 1.2 PB with the access speed than the bigger, the
> > better ...
> > - Must have triple redundancy
> > I do not really understand yet, so to create a configuration with SSD
> > P3608 Disk. Of course, the configuration needs to be changed, it is very
> > expensive.
> >
> > InfiniBand will be used, and 40 GB Ethernet.
> > We will also use virtualization to high-performance hardware to optimize
> > the number of physical servers.
> > I'm not tied to a specific server models and manufacturers. I create only
> > the cluster scheme which should be criticized :)
> >
> > 1. OSD - 13 pieces.
> >      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces
> >      b. Fiber Channel 16 Gbit / c - 2 port.
> >      c. An array (not RAID) to 284 TB of SATA-based drives (36 drives for
> > 8TB);
> >      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
> >      e. SATA drive 40 GB for installation of the operating system (or
> > booting from the network, which is preferable)
> >      f. RAM 288 GB
> >      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
> > 2. MON - 3 pieces. All virtual server:
> >      a. 1 Gbps Ethernet / c - 1 port.
> >      b. SATA drive 40 GB for installation of the operating system (or
> > booting from the network, which is preferable)
> >      c. SATA drive 40 GB
> >      d. 6GB RAM
> >      e. 1 x CPU - 2 cores at 1.9 Ghz
> > 3. MDS - 2 pcs. All virtual server:
> >      a. 1 Gbps Ethernet / c - 1 port.
> >      b. SATA drive 40 GB for installation of the operating system (or
> > booting from the network, which is preferable)
> >      c. SATA drive 40 GB
> >      d. 6GB RAM
> >      e. 1 x CPU - min. 2 cores at 1.9 Ghz
> >
> > I assume to use for an acceleration SSD for a cache and a log of OSD.
> >
> > --
> > Alexander Pushkov
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >


--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/

--
Best regards,
Vladimir
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux