Re: Fast Ceph a Cluster with PB storage

Christian Balzer <chibi@xxxxxxx> · Mon, 22 Aug 2016 16:57:30 +0900

On Mon, 22 Aug 2016 10:18:51 +0300 Александр Пивушков wrote:

>  Hello,
> Several answers below
> 
> >Среда, 17 августа 2016, 8:57 +03:00 от Christian Balzer <chibi@xxxxxxx>:
> >
> >
> >Hello,
> >
> >On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote:
> >
> >> Christian,
> >> 
> >>   thanks a lot for your time. Please see below.
> >> 
> >> 
> >> 2016-08-17 5:41 GMT+05:00 Christian Balzer < chibi@xxxxxxx >:
> >> 
> >> >
> >> > Hello,
> >> >
> >> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:
> >> >
> >> > >   So demands look like these:
> >> > >
> >> > > 1. He has a number of clients which need to periodically write a set of
> >> > > data as big as 160GB to a storage. The acceptable write speed is about a
> >> > > minute for the such amount, so it is around 2700-2800MB per second. Each
> >> > > write session will happend in a dedicated manner.
> >> >
> >> > Let me confirm that "dedicated" here means non-concurrent, sequential.
> >> > So not more than one client at a time, the cluster and network would be
> >> > good if doing 3GB/s?
> >> >
> >> Yes, this is what I meant.
> >>
> >That's good to know, it makes that data dump from a single client/server
> >at least marginally possible, without resorting to even more expensive
> >network infrastructure.
> >
> >> 
> >> >
> >> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
> >> > that's with a single client of course.
> >> >
> >> I understand, thank you. Alexander doesn't have any setup yet and would
> >> like to build a cost-effective one (not exactly 'cheap', but with minimal
> >> costs to satify requirements), so I've recommended him QDR IB as a minimal
> >> setup if they will be able to live with the used hardware (which is pretty
> >> cheap in general and would allow to make inexpensive multi-port per server
> >> setup with bonding, but hardly to get in Russia) or FDR if it is possible
> >> to get new network hardware only.
> >> 
> >Single link QDR should do the trick.
> >Bonding via a Linux bondn: interface with IPoIB currently only supports
> >failover (active-standby), not load balancing.
> >Never mind that load balancing may still not improve bandwidth for a
> >single client talking to a single target (it would help on a server
> >talking to Ceph, thus multiple OSD nodes).
> >
> >There are of course other ways of using 2 interfaces to achieve higher
> >bandwidth, like using routing to the host. 
> >But that gets more involved. 
> We decided to test, buy the 40GbE.
> There will be two link. One on the external network. Another on the internal network.

Splitting Ceph into an internal (cluster, replication) and external
(client) network only makes sense in your case if you have more than that
bandwidth on your local storage. 
Which would mean more than 4x 1.6TB DC P3608s per node, 4GB/s.
Don't think you need or want to afford that.

Also having just 1 link w/o failover and 2 switches (active-active with
MC-LAG or active-backup) is a bad idea.

> >
> >
> >> 
> >> >
> >> > >Data read should also be
> >> > > pretty fast. The written data must be shared after the write.
> >> > Fast reading might be achieved by these factors:
> >> > a) lots of RAM, to hold all FS SLAB data and of course page cache.
> >> > b) splitting writes and reads amongst the pools by using readfoward cache
> >> > mode, so writes go (primarily, initially) to the SSD cache pool and
> What is "readfoward cache mode "
> 
This (read the tracker link on that page), unfortunately still
un-documented.

> >
> >> > (cold) reads come from the HDD base pool.
> >> > c) having a large cache pool.
> >> >
> >> > >Clients OS -
> >> > > Windows.
> >> > So what server(s) are they writing to?
> >> > I don't think that Windows RBD port (dokan) is a well tested
> >> > implementation, besides not being updated for a year or so.
> Now everything is written to local Intel NVE 3608
> 
Yes, I gathered that. 
The question is, what servers between the Windows clients and the final
Ceph storage are you planning to use.

> >
> >> >
> >> This is the question I haven't asked (I hope Alexander will read this and
> >> write me an answer, and I answer here), but I believe they use local P3608
> >> for this at the moment. The main problem is that P3608s are pretty
> >> expensive, and local setup doesn't provide enough reliability, so they
> >> would like to build a cost-effective reliable setup with more inxepensive
> >> drives as well as providing a network storage for another data as well.
> >> The situation with dokan is exactly what I thought and told Alexander. So
> >> the only way is to setup intermediate servers which will significantly
> >> reduce speed.
> >> 
> >I haven't even tried to use Samba or NFS on top of RBD or CephFS, but
> >given that fio (with direct=1!) gives me the full speed of the OSDs, same
> >as with a "cp -ar", I'd hope that such file servers wouldn't be
> >significantly slower than their storage system.
> Can you tell us more about the use of SAMBA?
> We use something special, or all of the default?
> 
As I said, haven't used it (with Ceph), but it should be able to use most
of the speed, based my own experience with local disks and all the tuning
guides out there like this one:
http://www.eggplant.pro/blog/faster-samba-smb-cifs-share-performance/

> >
> >
> >> 
> >> > > 2. It is necessary to have a regular storage as well. He thinks about
> >> > 1.2TB
> >> > > HDD storage with 34TB SSD cache tier at the moment.
> >> > >
> >> > A 34TB cache pool with (at the very least) 2x replication will not be
> >> > cheap.
> >> >
> >> > > The main question with an answer I don't have is how to calculate\predict
> >> > > per client write speed for a ceph cluster?
> >> > This question has been asked before and in fact quite recently, see the
> >> > very short lived "Ceph performance calculator" thread.
> >> >
> >> Thank you, I've founded it. I've been following for the list for a pretty
> >> long time but seems that I missed this discussion.
> >> 
> >> 
> >> >
> >> > In short, too many variables.
> >> >
> >> > >For example, if there will be a
> >> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
> >> > > SM863 drives - how to get approximation for the write speed? Concurent
> >> > > writes to the 6-8 good SSD drives could probably give such speed, but is
> >> > it
> >> > > true for the cluster in general?
> >> >
> >> > Since we're looking here at one of the relatively few use case where
> >> > bandwidth/throughput is the main factor and not IOPS, this calculation
> >> > becomes a bit easier and predictable.
> >> > For an example, see my recent post:
> >> > "Better late than never, some XFS versus EXT4 test results"
> >> >
> >> Found it too, thanks! Very useful tests. Beside of the current topic,
> >> wouldn't btrfs give some advantages in case of pure SSD pool with inline
> >> (on the same drive) journals?
> >> 
> >In theory yes, but I think the bigger win here is with IOPS, as opposed to
> >throughput.
> >With BTRFS you could use filestore_journal_parallel, but AFAIK that will
> >still result in 2 writes, so the the full speed of the drive won't be
> >available either. 
> >Main advantage here would be that either a successful journal or FS
> >write will result in an ACK, so if the FS if faster you get some speedup. 
> >
> >The question is, how well tested is this code path, by the automatic Ceph
> >build tests and users out there?
> >At least fragmentation wouldn't matter with SSDs. ^o^
> >
> >At this point in time, I'd go with "well supported" and migrate to
> >Bluestore once that becomes trustworthy.
> Do I understand that now can be safely and advantageously used Bluestore  to Productions?
> 
Definitely not, BlueStore is not production ready and won't be for at
least 1-2 more releases, so sometimes next year at the earliest.

Christian
> >
> >
> >Christian
> >> 
> >> > Which basically shows that with sufficient network bandwidth all available
> >> > drive speed can be utilized.
> >> >
> >> > With fio randwrite and 4MB blocks the above setup gives me 440MB/s and
> >> > with 4K blocks 8000 IOPS.
> >> > So throughput wise, 100% utilization, full speed present.
> >> > IOPS, less than a third (the SSDs are at 33% utilization, the delays are
> >> > caused by Ceph and network latencies).
> >> >
> >> > >3 sets per 8 drives in 13 servers (with an
> >> > > additional overhead for the network operations, ACKs and placement
> >> > > calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is
> >> > > there a formula exists to calculate speed expectations from the raw speed
> >> > > and/or IOPS point of view?
> >> > >
> >> >
> >> > Lets look at a simplified example:
> >> > 10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes),
> >> > 40Gb/s (QDR, Ether) interconnects.
> >> > Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes speeds.
> >> > Of course journals needs to go somewhere, so the effective speed is half
> >> > of that.
> >> > Thus we get a top speed per node of 2GB/s.
> >> > With a replication of 2 we would get a 10GB/s write capable cluster, with
> >> > 3 it's down to a theoretical 6.6GB/s.
> >> >
> >> > I'm ignoring the latency, ACK overhead up there, which has a significantly
> >> > lower impact on throughput than on IOPS.
> >> 
> >> 
> >> > Having a single client or intermediary file server write all that to the
> >> > Ceph cluster over a single link is the bit I'd be more worried about.
> >> >
> >> I totally agree.
> >> 
> >> 
> >> >
> >> > Christian
> >> >
> >> > > Or, from another side, if there are pre-requisites exist, how to be sure
> >> > > the projected cluster meets them? I'm pretty sure it's a typical task,
> >> > how
> >> > > would you solve it?
> >> > >
> >> > > Thanks a lot in advance and best regards,
> >> > > Vladimir
> >> > >
> >> > >
> >> > > С уважением,
> >> > > Дробышевский Владимир
> >> > > Компания "АйТи Город"
> >> > > +7 343 2222192
> >> > >
> >> > > Аппаратное и программное обеспечение
> >> > > IBM, Microsoft, Eset
> >> > > Поставка проектов "под ключ"
> >> > > Аутсорсинг ИТ-услуг
> >> > >
> >> > > 2016-08-08 19:39 GMT+05:00 Александр Пивушков < pivu@xxxxxxx >:
> >> > >
> >> > > > Hello dear community!
> >> > > > I'm new to the Ceph and not long ago took up the theme of building
> >> > > > clusters.
> >> > > > Therefore it is very important to your opinion.
> >> > > >
> >> > > > It is necessary to create a cluster from 1.2 PB storage and very rapid
> >> > > > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe
> >> > > > PCIe 3.0 x4 Solid State Drive" were used, their speed of all
> >> > satisfies, but
> >> > > > with increase of volume of storage, the price of such cluster very
> >> > strongly
> >> > > > grows and therefore there was an idea to use Ceph.
> >> > > > There are following requirements:
> >> > > >
> >> > > > - The amount of data 160 GB should be read and written at speeds of SSD
> >> > > > P3608
> >> > > > - There must be created a high-speed storage of the SSD drives 36 TB
> >> > > > volume with read / write speed tends to SSD P3608
> >> > > > - Must be created store 1.2 PB with the access speed than the bigger,
> >> > the
> >> > > > better ...
> >> > > > - Must have triple redundancy
> >> > > > I do not really understand yet, so to create a configuration with SSD
> >> > > > P3608 Disk. Of course, the configuration needs to be changed, it is
> >> > very
> >> > > > expensive.
> >> > > >
> >> > > > InfiniBand will be used, and 40 GB Ethernet.
> >> > > > We will also use virtualization to high-performance hardware to
> >> > optimize
> >> > > > the number of physical servers.
> >> > > > I'm not tied to a specific server models and manufacturers. I create
> >> > only
> >> > > > the cluster scheme which should be criticized :)
> >> > > >
> >> > > > 1. OSD - 13 pieces.
> >> > > >      a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces
> >> > > >      b. Fiber Channel 16 Gbit / c - 2 port.
> >> > > >      c. An array (not RAID) to 284 TB of SATA-based drives (36 drives
> >> > for
> >> > > > 8TB);
> >> > > >      d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
> >> > > >      e. SATA drive 40 GB for installation of the operating system (or
> >> > > > booting from the network, which is preferable)
> >> > > >      f. RAM 288 GB
> >> > > >      g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
> >> > > > 2. MON - 3 pieces. All virtual server:
> >> > > >      a. 1 Gbps Ethernet / c - 1 port.
> >> > > >      b. SATA drive 40 GB for installation of the operating system (or
> >> > > > booting from the network, which is preferable)
> >> > > >      c. SATA drive 40 GB
> >> > > >      d. 6GB RAM
> >> > > >      e. 1 x CPU - 2 cores at 1.9 Ghz
> >> > > > 3. MDS - 2 pcs. All virtual server:
> >> > > >      a. 1 Gbps Ethernet / c - 1 port.
> >> > > >      b. SATA drive 40 GB for installation of the operating system (or
> >> > > > booting from the network, which is preferable)
> >> > > >      c. SATA drive 40 GB
> >> > > >      d. 6GB RAM
> >> > > >      e. 1 x CPU - min. 2 cores at 1.9 Ghz
> >> > > >
> >> > > > I assume to use for an acceleration SSD for a cache and a log of OSD.
> >> > > >
> >> > > > --
> >> > > > Alexander Pushkov
> >> > > >
> >> > > > _______________________________________________
> >> > > > ceph-users mailing list
> >> > > >  ceph-users@xxxxxxxxxxxxxx
> >> > > >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > >
> >> > > >
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> >  chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >> >  http://www.gol.com/
> >> >
> >> 
> >> --
> >> Best regards,
> >> Vladimir
> >
> >
> >-- 
> >Christian Balzer        Network/Systems Engineer 
> >chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/
> >_______________________________________________
> >ceph-users mailing list
> >ceph-users@xxxxxxxxxxxxxx
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com