Re: Managing larger ceph clusters

Steve Anthony <sma310@xxxxxxxxxx> · Fri, 17 Apr 2015 15:07:34 -0400



    For reference, I'm currently running 26 nodes (338 OSDs); will be 35
    nodes (455 OSDs) in the near future.

    
    Node/OSD provisioning and replacements:

    
    Mostly I'm using ceph-deploy, at least to do node/osd adds and
    replacements. Right now the process is:

    
    Use FAI (http://fai-project.org) to setup software RAID1/LVM for the
    OS disks, and do a minimal installation, including the salt-minion.

    
    Accept the new minion on the salt-master node and deploy the
    configuration. LDAP auth, nrpe, diamond collector, udev
    configuration, custom python disk add script, and everything on the
    Ceph preflight page
    (http://ceph.com/docs/firefly/start/quick-start-preflight/)

    
    Insert the journals into the case. Udev triggers my python code,
    which partitions the SSDs and fires a Prowl alert
    (http://www.prowlapp.com/) to my phone when it's finished.

    
    Insert the OSDs into the case. Same thing, udev triggers the python
    code, which selects the next available partition on the journals so
    OSDs go on journal1partA, journal2partA, journal3partA,
    journal1partB,... for the three journals in each node. The code then
    fires a salt event at the master node with the OSD dev path, journal
    /dev/by-id/ path and node hostname. The salt reactor on the master
    node takes this event and runs a script on the admin node which
    passes those parameters to ceph-deploy, which does the OSD
    deployment. Send Prowl alert on success or fail with details.

    
    Similarity, when an OSD fails, I remove it, and insert the new OSD.
    The same process as above occurs. Logical removal I do manually,
    since I'm not at a scale where it's common yet. Eventually, I
    imagine I'll write code to trigger OSD removal on certain events
    using the same event/reactor Salt framework.

    
    Pool/CRUSH management:

    
    Pool configuration and CRUSH management are mostly one-time
    operations. That is, I'll make a change rarely and when I do it will
    persist in that new state for a long time. Given that and the fact
    that I can make the changes from one node and inject them into the
    cluster, I haven't needed to automate that portion of Ceph as I've
    added more nodes, at least not yet.

    
    Replacing journals:

    
    I haven't had to do this yet; I'd probably remove/readd all the OSDs
    if it happened today, but will be reading the post you linked.

    
    Upgrading releases:

    
    Change the configuration of /etc/apt/source.list.d/ceph.list to
    point at new release and push to all the nodes with Salt. Then salt
    -N 'ceph' pkg.upgrade to upgrade the packages on all the nodes in
    the ceph nodegroup. Then, use Salt to restart the monitors, then the
    OSDs on each node, one by one. Finally run the following command on
    all nodes with Salt to verify all monitors/OSDs are using the new
    version:

    
    for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph
    --admin-daemon $i version;done

    
    Node decommissioning:

    
    I have a script which enumerates all the OSDs on a given host and
    stores that list in a file. Another script (run by cron every 10
    minutes) checks if the cluster health is OK, and if so pops the next
    OSD from that file and executes the steps to remove it from the
    host, trickling the node out of service.

    
    On 04/17/2015 02:18 PM, Craig Lewis
      wrote:

    
      I'm running a small cluster, but I'll chime in
        since nobody else has.
        

        Cern had a presentation a while ago (dumpling time-frame)
          about their deployment.  They go over some of your questions: http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
        

          My philosophy on Config Management is that it should save
            me time.  If it's going to take me longer to write a recipe
            to do something, I'll just do it by hand. Since my cluster
            is small, there are many things I can do faster by hand. 
            This may or may not work for you, depending on your
            documentation / repeatability requirements.  For things that
            need to be documented, I'll usually write the recipe anyway
            (I accept Chef recipes as documentation).
        
        
        For my clusters, I'm using Chef to setups all nodes and
          manage ceph.conf.  I manually manage my pools, CRUSH map,
          RadosGW users, and disk replacement.  I was using Chef to add
          new disks, but I ran into load problems due to my small
          cluster size.  I'm currently adding disks manually, to manage
          cluster load better.  As my cluster gets larger, that'll be
          less important.
        

        I'm also doing upgrades manually, because it's less work
          than writing the Chef recipe to do a cluster upgrade.  Since
          Chef isn't cluster aware, it would be a a pain to make the
          recipe cluster aware enough to handle the upgrade.  And I
          figure if I stall long enough, somebody else will write it :-)
           Ansible, with it's cluster wide coordination, looks like it
          would handle that a bit better.
        

        On Wed, Apr 15, 2015 at 2:05 PM,
          Stillwell, Bryan <bryan.stillwell@xxxxxxxxxxx>
          wrote:

          I'm
            curious what people managing larger ceph clusters are doing
            with

            configuration management and orchestration to simplify their
            lives?

            
            We've been using ceph-deploy to manage our ceph clusters so
            far, but

            feel that moving the management of our clusters to standard
            tools would

            provide a little more consistency and help prevent some
            mistakes that

            have happened while using ceph-deploy.

            
            We're looking at using the same tools we use in our
            OpenStack

            environment (puppet/ansible), but I'm interested in hearing
            from people

            using chef/salt/juju as well.

            
            Some of the cluster operation tasks that I can think of
            along with

            ideas/concerns I have are:

            
            Keyring management

              Seems like hiera-eyaml is a natural fit for storing the
            keyrings.

            
            ceph.conf

              I believe the puppet ceph module can be used to manage
            this file, but

              I'm wondering if using a template (erb?) might be better
            method to

              keeping it organized and properly documented.

            
            Pool configuration

              The puppet module seems to be able to handle managing
            replicas and the

              number of placement groups, but I don't see support for
            erasure coded

              pools yet.  This is probably something we would want the
            initial

              configuration to be set up by puppet, but not something we
            would want

              puppet changing on a production cluster.

            
            CRUSH maps

              Describing the infrastructure in yaml makes sense.  Things
            like which

              servers are in which rows/racks/chassis.  Also describing
            the type of

              server (model, number of HDDs, number of SSDs) makes
            sense.

            
            CRUSH rules

              I could see puppet managing the various rules based on the
            backend

              storage (HDD, SSD, primary affinity, erasure coding, etc).

            
            Replacing a failed HDD disk

              Do you automatically identify the new drive and start
            using it right

              away?  I've seen people talk about using a combination of
            udev and

              special GPT partition IDs to automate this.  If you have a
            cluster

              with thousands of drives I think automating the
            replacement makes

              sense.  How do you handle the journal partition on the
            SSD?  Does

              removing the old journal partition and creating a new one
            create a

              hole in the partition map (because the old partition is
            removed and

              the new one is created at the end of the drive)?

            
            Replacing a failed SSD journal

              Has anyone automated recreating the journal drive using
            Sebastien

              Han's instructions, or do you have to rebuild all the OSDs
            as well?

            
            http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou

              rnal-failure/

            
            Adding new OSD servers

              How are you adding multiple new OSD servers to the
            cluster?  I could

              see an ansible playbook which disables nobackfill,
            noscrub, and

              nodeep-scrub followed by adding all the OSDs to the
            cluster being

              useful.

            
            Upgrading releases

              I've found an ansible playbook for doing a rolling upgrade
            which looks

              like it would work well, but are there other methods
            people are using?

            
            http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi

              ble/

            
            Decommissioning hardware

              Seems like another ansible playbook for reducing the OSDs
            weights to

              zero, marking the OSDs out, stopping the service, removing
            the OSD ID,

              removing the CRUSH entry, unmounting the drives, and
            finally removing

              the server would be the best method here.  Any other ideas
            on how to

              approach this?

            
            That's all I can think of right now.  Is there any other
            tasks that

            people have run into that are missing from this list?

            
            Thanks,

            Bryan

            
            This E-mail and any of its attachments may contain Time
            Warner Cable proprietary information, which is privileged,
            confidential, or subject to copyright belonging to Time
            Warner Cable. This E-mail is intended solely for the use of
            the individual or entity to which it is addressed. If you
            are not the intended recipient of this E-mail, you are
            hereby notified that any dissemination, distribution,
            copying, or action taken in relation to the contents of and
            attachments to this E-mail is strictly prohibited and may be
            unlawful. If you have received this E-mail in error, please
            notify the sender immediately and permanently delete the
            original and any copy of this E-mail and any printout.

            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma310@xxxxxxxxxx
  

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com