rados bench -p rbd 60 write -b 4M -t 1

"wr@xxxxxxxx" <wr@xxxxxxxx> · Thu, 21 Jul 2016 15:04:57 +0200



    That can not be correct.
    Check it at your cluster with dstat as i said...
    You will see at every node parallel IO on every OSD and
      journal....

    
    Am 21.07.16 um 15:02 schrieb Jake
      Young:

    
    I think the answer is that with 1 thread you can only
      ever write to one journal at a time. Theoretically, you would need
      10 threads to be able to write to 10 nodes at the same time. 
      

      Jake

        
        On Thursday, July 21, 2016, wr@xxxxxxxx <wr@xxxxxxxx>
        wrote:

        
            What i not really undertand is:
            Lets say the Intel P3700 works with 200 MByte/s rados
              bench one thread... See Nicks results below...

            
            If we have multiple OSD Nodes. For example 10 Nodes.
            Every Node has exactly 1x P3700 NVMe built in.
            Why is the single Thread performance exactly at 200
              MByte/s on the rbd client with 10 OSD Node Cluster???
            I think it must be at 10 Nodes * 200 MByte/s = 2000
              MByte/s.
            

            Everyone look yourself at your cluster. 

            
            dstat -D sdb,sdc,sdd,sdX ....
            You will see that Ceph stripes the data over all OSD's in
              the cluster if you test at the client side with rados
              bench...
            rados bench -p rbd 60 write -b 4M -t 1
            

            Am 21.07.16 um 14:38 schrieb wr@xxxxxxxx:

            
            Is there not a way to enable Linux
              page Cache? So do not user D_Sync... 

              
              Then we would the dramatically performance improve. 

              
              Am 21.07.16 um 14:33 schrieb Nick Fisk: 

              
                -----Original Message----- 

                  From: wr@xxxxxxxx [mailto:wr@xxxxxxxx] 

                  Sent: 21 July 2016 13:23 

                  To: nick@xxxxxxxxxx; 'Horace Ng' <horace@xxxxxxxxx> 

                  Cc: ceph-users@xxxxxxxxxxxxxx 

                  Subject: Re:  Ceph + VMware + Single
                  Thread Performance 

                  
                  Okay and what is your plan now to speed up ? 

                
                Now I have come up with a lower latency hardware design,
                there is not much further improvement until persistent
                RBD caching is implemented, as you will be moving the
                SSD/NVME closer to the client. But I'm happy with what I
                can achieve at the moment. You could also experiment
                with bcache on the RBD. 

                
                Would it help to put in multiple
                  P3700 per OSD Node to improve performance for a single
                  Thread (example Storage VMotion) ? 

                
                Most likely not, it's all the other parts of the puzzle
                which are causing the latency. ESXi was designed for
                storage arrays that service IO's in 100us-1ms range,
                Ceph is probably about 10x slower than this, hence the
                problem. Disable the BBWC on a RAID controller or SAN
                and you will the same behaviour. 

                
                Regards 

                  
                  Am 21.07.16 um 14:17 schrieb Nick Fisk: 

                  
                    -----Original Message----- 

                      From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                      On Behalf 

                      Of wr@xxxxxxxx 

                      Sent: 21 July 2016 13:04 

                      To: nick@xxxxxxxxxx; 'Horace Ng'
                      <horace@xxxxxxxxx> 

                      Cc: ceph-users@xxxxxxxxxxxxxx 

                      Subject: Re:  Ceph + VMware + Single
                      Thread Performance 

                      
                      Hi, 

                      
                      hmm i think 200 MByte/s is really bad. Is your
                      Cluster in production right now? 

                    
                    It's just been built, not running yet. 

                    
                    So if you start a storage
                      migration you get only 200 MByte/s right? 

                    
                    I wish. My current cluster (not this new one) would
                    storage migrate at 

                    ~10-15MB/s. Serial latency is the problem, without
                    being able to 

                    buffer, ESXi waits on an ack for each IO before
                    sending the next. Also it submits the migrations in
                    64kb chunks, unless you get VAAI 

                  
                  working. I think esxi will try and do them in
                  parallel, which will help as well. 

                  
                    I think it would be awesome
                      if you get 1000 MByte/s 

                      
                      Where is the Bottleneck? 

                    
                    Latency serialisation, without a buffer, you can't
                    drive the devices 

                    to 100%. With buffered IO (or high queue depths) I
                    can max out the journals. 

                    
                    A FIO Test from Sebastien
                      Han give us 400 MByte/s raw performance from the
                      P3700. 

                      
                      https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
                      

                      -ssd-is-suitable-as-a-journal-device/ 

                      
                      How could it be that the rbd client performance is
                      50% slower? 

                      
                      Regards 

                      
                      Am 21.07.16 um 12:15 schrieb Nick Fisk: 

                      I've had a lot of pain
                        with this, smaller block sizes are even worse. 

                        You want to try and minimize latency at every
                        point as there is no 

                        buffering happening in the iSCSI stack. This
                        means:- 

                        
                        1. Fast journals (NVME or NVRAM) 

                        2. 10GB or better networking 

                        3. Fast CPU's (Ghz) 

                        4. Fix CPU c-state's to C1 

                        5. Fix CPU's Freq to max 

                        
                        Also I can't be sure, but I think there is a
                        metadata update 

                        happening with VMFS, particularly if you are
                        using thin VMDK's, this 

                        can also be a major bottleneck. For my use case,
                        I've switched over to NFS as it has given much
                        more performance at scale and 

                      
                  less headache. 

                  
                      For the RADOS Run, here
                        you go (400GB P3700): 

                        
                        Total time run:         60.026491 

                        Total writes made:      3104 

                        Write size:             4194304 

                        Object size:            4194304 

                        Bandwidth (MB/sec):     206.842 

                        Stddev Bandwidth:       8.10412 

                        Max bandwidth (MB/sec): 224 

                        Min bandwidth (MB/sec): 180 

                        Average IOPS:           51 

                        Stddev IOPS:            2 

                        Max IOPS:               56 

                        Min IOPS:               45 

                        Average Latency(s):     0.0193366 

                        Stddev Latency(s):      0.00148039 

                        Max latency(s):         0.0377946 

                        Min latency(s):         0.015909 

                        
                        Nick 

                        
                        -----Original
                          Message----- 

                          From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                          On 

                          Behalf Of Horace 

                          Sent: 21 July 2016 10:26 

                          To: wr@xxxxxxxx 

                          Cc: ceph-users@xxxxxxxxxxxxxx
                          

                          Subject: Re:  Ceph + VMware +
                          Single Thread Performance 

                          
                          Hi, 

                          
                          Same here, I've read some blog saying that
                          vmware will frequently 

                          verify the locking on VMFS over iSCSI, hence
                          it will have much slower performance than NFS
                          (with different locking mechanism). 

                          
                          Regards, 

                          Horace Ng 

                          
                          ----- Original Message ----- 

                          From: wr@xxxxxxxx 

                          To: ceph-users@xxxxxxxxxxxxxx
                          

                          Sent: Thursday, July 21, 2016 5:11:21 PM 

                          Subject:  Ceph + VMware + Single
                          Thread Performance 

                          
                          Hi everyone, 

                          
                          we see at our cluster relatively slow Single
                          Thread Performance on the iscsi Nodes. 

                          
                          Our setup: 

                          
                          3 Racks: 

                          
                          18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway
                          Nodes with tgt (rbd cache off). 

                          
                          2x Samsung SM863 Enterprise SSD for Journal (3
                          OSD per SSD) and 6x 

                          WD Red 1TB per Data Node as OSD. 

                          
                          Replication = 3 

                          
                          chooseleaf = 3 type Rack in the crush map 

                          
                          We get only ca. 90 MByte/s on the iscsi
                          Gateway Servers with: 

                          
                          rados bench -p rbd 60 write -b 4M -t 1 

                          
                          If we test with: 

                          
                          rados bench -p rbd 60 write -b 4M -t 32 

                          
                          we get ca. 600 - 700 MByte/s 

                          
                          We plan to replace the Samsung SSD with Intel
                          DC P3700 PCIe NVM'e 

                          for the Journal to get better Single Thread
                          Performance. 

                          
                          Is anyone of you out there who has an Intel
                          P3700 for Journal an 

                          can give me back test results with: 

                          
                          rados bench -p rbd 60 write -b 4M -t 1 

                          
                          Thank you very much !! 

                          
                          Kind Regards !! 

                          
_______________________________________________ 

                          ceph-users mailing list 

                          ceph-users@xxxxxxxxxxxxxx
                          

                          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                          

_______________________________________________ 

                          ceph-users mailing list 

                          ceph-users@xxxxxxxxxxxxxx
                          

                          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                          

                      _______________________________________________ 

                      ceph-users mailing list 

                      ceph-users@xxxxxxxxxxxxxx 

                      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                      

              _______________________________________________ 

              ceph-users mailing list 

              ceph-users@xxxxxxxxxxxxxx 

              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
              

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: Ceph + VMware + Single Thread Performance