Re: Safely Upgrading OS on a live Ceph Cluster

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Tue, 28 Feb 2017 10:51:32 +0100



    On 02/27/17 18:01, Heller, Chris wrote:

    
      First I bring down the Ceph FS via `ceph mds
        cluster_down`.
      Second, to prevent OSDs from trying to repair data,
        I run `ceph osd set noout`
      Finally I stop the ceph processes in the following
        order: ceph-mds, ceph-mon, ceph-osd
      

    This is the wrong procedure. Likely it will just involve more cpu
    and memory usage on startup, not broken behavior (unless you run out
    of RAM). After all, it has to recover from power outages, so any
    order ought to work, just some are better. 

    
    I am unsure on the cephfs part... but I would think you have it
    right, except I wouldn't do `ceph mds cluster_down` (but don't know
    if it's right to)... maybe try without that. I never used that
    except when I want to remove all mds nodes and destroy all the
    cephfs data. And I didn't find any docs on what it really even does,
    except it won't let you remove all your mds and destroy the cephfs
    without it.

    
    The correct procedure as far as I know is:

    
    ## 1. cluster must be healthy and to set noout, norecover,
    norebalance, nobackfill

    ceph -s

    for s in noout norecover norebalance nobackfill; do ceph osd set $s;
    done

    
    ## 2. shut down all OSDs and then the all MONs - not MONs before
    OSDs

    # all nodes

    service ceph stop osd

    
    # see that all osds are down

    ceph osd tree

    
    # all nodes again

    ceph -s

    service ceph stop

    
    ## 3. start MONs before OSDs. 

    # This already happens on boot per node, but not cluster wide. But
    with the flags set, it likely doesn't matter. It seems unnecessary
    on a small cluster.

    
    ## 4. unset the flags

    # see that all osds are up

    ceph -s

    ceph osd tree

    for s in noout norecover norebalance nobackfill; do ceph osd unset
    $s; done

    
      Note my cluster has 1 mds and 1 mon, and 7 osd.
      

      I then install the new OS and then bring the cluster
        back up by walking the steps in reverse:
      

      First I start the ceph processes in the following
        order: ceph-osd, ceph-mon, ceph-mds
      Second I restore OSD functionality with `ceph osd
        unset noout`
      Finally I bring up the Ceph FS via `ceph mds
        cluster_up`
      

    adjust those steps too... mons start first

    
      Everything works smoothly except the Ceph FS bring
        up.[...snip...]
    
    
      How can I safely stop a Ceph cluster, so that it
        will cleanly start back up again?
      

    Don't know about the cephfs problem... all I can say is try the
    right general procedure and see if the result changes.

    
    (and I'd love to cite a source on why that's the right procedure and
    yours isn't, but don't know what to cite... for example
    http://docs.ceph.com/docs/jewel/rados/operations/operating/#id8 says
    to use -a in the arguments, but doesn't say whether that's systemd
    or not, or what it does exactly. I have only seen it discussed a few
    places, like the mailing list and IRC)

    
        -Chris

          
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com