Backups

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Thu, 18 Apr 2013 13:22:30 -0700



    I'm new to Ceph, and considering using it to store a
        bunch of static files in the RADOS Gateway.  My files are all
        versioned, so we never modify files.  We only add new files, and
        delete unused files.
    

    I'm trying to figure out how to back everything up, to
        protect against administrative and application errors.
    

    I'm thinking about building one Ceph cluster that spans
        my primary and backup datacenters, with CRUSH rules that would
        store 2 replicas in each datacenter. I want to use BtrFS
        snapshots, like http://blog.rot13.org/2010/02/using_btrfs_snapshots_for_incremental_backup.html, but automated and with cleanup.  I'm
        doing something similar now, on my NFS servers with ZFS and a
        tool called zfs-snapshot-mgmt.

      
    I read that only XFS is recommended for production
        clusters, since BtrFS itself is still beta.  Any idea how long
        until BtrFS is usable in production?

        
    I'd prefer to run Ceph on ZFS, but I see there are some
        outstanding issues in tracker.  Is anybody doing Ceph on ZFS in
        production?  ZFS itself seems to be father along than BtrFS. 
        Are there plans to make ZFS a first class supported filesystem
        for Ceph?
      

      Assuming that BtrFS and ZFS are not recommended for production, I'm thinking about XFS in the primary datacenter, and
        BtrFS + snapshots in the backup datacenter.  Once BtrFS or ZFS
        is production ready, I'd slowly migrate all partitions off XFS.

      
      Once the backups are made, using them is a bit tricky.

      
    In the event of an
        operator or code error, I would mount the correct BtrFS snapshot
        on all nodes in the backup datacenter, someplace like
        /var/lib/ceph.restore/.  Then I'd make a copy of ceph.conf, and
        start building a temporary cluster that runs on a non-standard
        port, made up of only the backup datacenter machines.  The
        normal cluster would stay up and running.  Once the temporary
        cluster is up, I'd manually restore the RADOS Gateway objects
        that needed to be restored.
    

    If there was ever a
        full cluster problem, like I did something stupid like rados
          rmpool metadata.  I'd shut down the whole cluster, and
        revert all of the BtrFS partitions to the last known good
        snapshot, and re-format all of the XFS partitions.  Start the
        cluster up again, and let Ceph replicate everything back to
        freshly formatted partitions.  I'd lose recent data, but it's
        better than losing all of the data.
     
    Obviously, both of
        these scenarios would need a lot of testing and many practice
        runs before they're viable.  Has anybody tried this before?  If
        not, do you see any problems with the theory?

      
        Thanks for the help.

        
    -- 

       
            Craig Lewis
            

             Senior Systems Engineer

              Office +1.714.602.1309

              Email clewis@xxxxxxxxxxxxxxxxxx
             
            Central Desktop.
                Work together in ways you never thought possible.
               

                 Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com