Re: 1256 OSD/21 server ceph cluster performance issues.

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Mon, 22 Dec 2014 18:04:11 -0600



    Awesome! I have yet to hear of any zfs in ceph chat nor have I seen
    it on the mailing lists that I have caught. I would assume it would
    function pretty well considering how long it has been in use along
    some production systems I have seen. I have little to no experience
    with it personally though. 

    
    I thought the rados issue was weird as well. Even with a degraded
    cluster I feel like I should be getting better throughput unless I
    hit an object with a bunch of bad PGs or something. We are using 2x
    2x10G  cards in LACP to get over 10G on average and have separate
    gateway nodes (Went with the Supermicro kit after all) so CPU on
    those nodes shouldn't be an issue. It is extremely low as it is
    currently which is again surprising. 

    
    I honestly think that this is some kind of radosgw bug in giant as I
    have another giant cluster with the exact same config that is
    performing much better with much less hardware. Hopefully it is
    indeed a bug of somesort and not yet another screw up on my end.
    Furthermore hopefully I find the bug and fix it for others to find
    and profit from ^_^. 

    
    Thanks for all of your help! 

    
    On 12/22/2014 05:26 PM, Craig Lewis
      wrote:

    
          On Mon, Dec 22, 2014 at 2:57 PM, Sean
            Sullivan <seapasulli@xxxxxxxxxxxx>
            wrote:

            
               Thanks Craig!

                
                I think that this may very well be my issue with osds
                dropping out but I am still not certain as I had the
                cluster up for a small period while running rados bench
                for a few days without any status changes.
            
            
            Mine were fine for a while too, through several
              benchmarks and a large RadosGW import.  My problems were
              memory pressure plus an XFS bug, so it took a while to
              manifest.  When it did, all of the ceph-osd processes on
              that node would have periods of ~30 seconds with 100%
              CPU.  Some OSDs would get kicked out.  Once that started,
              it was a downward spiral of recovery causing increasing
              load causing more OSDs to get kicked out...
            

            Once I found the memory problem, I cronned a buffer
              flush, and that usually kept things from getting too bad.
            

            I was able to see on the CPU graphs that CPU was
              increasing before the problems started.  Once CPU got
              close to 100% usage on all cores, that's when the OSDs
              started dropping out.  Hard to say if it was the CPU
              itself, or if the CPU was just a symptom of the memory
              pressure plus XFS bug.
            

               The real big issue
                that I have is the radosgw one currently. After I figure
                out the root cause of the slow radosgw performance and
                correct that, it should hopefully buy me enough time to
                figure out the osd slow issue. 

                
                It just doesn't make sense that I am getting 8mbps per
                client no matter 1 or 60 clients while rbd and rados
                shoot well above 600MBs (above 1000 as well). 

              
            That is strange.  I was able to get >300 Mbps per
              client, on a 3 node cluster with GigE.  I expected that
              each client would saturate the GigE on their own, but 300
              Mbps is more than enough for now.
            

            I am using the Ceph apache and fastcgi module, but
              otherwise it's a pretty standard apache setup.  My RadosGW
              processes are using a fair amount of CPU, but as long as
              you have some idle CPU, that shouldn't be the bottleneck.
             
            
                May I ask how you are monitoring your clusters logs? Are
                you just using rsyslog or do you have a logstash type
                system set up? Load wise I do not see a spike until I
                pull an osd out of the cluster or stop then start an osd
                without marking nodown. 

              
            I'm monitoring the cluster with Zabbix, and that gives
              me pretty much the same info that I'd get in the logs.  I
              am planning to start pushing the logs to Logstash soon, as
              soon as I get my logstash is able to handle the extra
              load.
             
            
                I do think that CPU is probably the cause of the osd
                slow issue though as it makes the most logical sense.
                Did you end up dropping ceph and moving to zfs or did
                you stick with it and try to mitigate it via file
                flusher/ other tweaks? 

                
            I'm still on Ceph.  I worked around the memory pressure
              by reformatting my XFS filesystems to use regular sized
              inodes.  It was a rough couple of months, but everything
              has been stable for the last two months.
            

            I do still want to use ZFS on my OSDs.  It's got all
              the features of BtrFS, with the extra feature of being
              production ready.  It's just not production ready in Ceph
              yet.  It's coming along nicely though, and I hope to
              reformat one node to be all ZFS sometime next year.
          
        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com