Sepia Long Running Cluster mishap

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Around 18JUN2020 0700 UTC, an errant `sudo rm -rf ceph` from the root
directory on a senta unfortunately wiped out almost all data on the Ceph
cluster in our upstream Sepia lab (AKA Long Running Cluster or LRC).
Only teuthology job logs were preserved.

I would guess because teuthology workers were actively writing jobs logs
and files, the /teuthology-archive directory didn't get entirely wiped out.

Here is a list of directories we lost:
    bz
    cephdrop (drop.ceph.com)
    cephfs-perf
    chacra (chacra.ceph.com)
    containers (quay.ceph.io)
    dgalloway
    diskprediction_config.txt
    doug-is-great
    el8
    filedump.ceph.com
    firmware
    home.backup01
    home.gitbuilder-archive
    job1.0.0
    jspray.senta02.home.tar.gz
    old.repos
    post (files submitted using ceph-post-file)
    sftp (drop.ceph.com/qa)
    shaman
    signer (signed upstream release packages)
    tmp
    traces

While I /did/ have backups of chacra.ceph.com binaries, the amount of
data (> 1TB) backed up was too much to keep snapshots of.  My daily
backup script performs an `rsync --delete-delay` so if files are gone on
the source, they get deleted from the backup.  This is fine (and
preferred) for backups we have snapshots of.  However, the backup script
ran *after* the errant `rm -rf` so unfortunately everything on
chacra.ceph.com is gone.  I have patched the backup script to *not*
--delete-delay backups that we don't keep snapshots of.

I restored the vagrant and valgrind chacra.ceph.com repos because I saw
teuthology jobs failing because of those missing repos.  Kefu also
rebuilt and pushed ceph-libboost 1.72.  (THANK YOU, KEFU!)

We started using the quay.ceph.io registry (instead of quay.io) on June
17.  Containers pushed to that registry were stored on the LRC as well
so I had to delete the repo and start over this morning.  Anything you
see in the web UI should pull without issue:
https://quay.ceph.io/repository/ceph-ci/ceph?tab=tags

To prevent data loss in the future, Patrick graciously set up new
filesystems and client credentials on the LRC.  Because senta{02..04}
are considered developer playgrounds, all users have sudo access.  The
sentas now mount /teuthology-archive read-only at /teuthology.  If you
need to unzip and inspect log files on a senta, you can do so in
/scratch (another new filesystem on the LRC).

It will likely take weeks of "where did X go" e-mails to mailing lists,
job and build failures, bugs filed, IRC pings, etc. for me to find and
restore everything that was used on a regular basis.  I appreciate your
patience and understanding in the meantime.

Take care & be well,
-- 
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux