On Thursday 02 February 2012 wrote Gregory Farnum: > On Wed, Feb 1, 2012 at 9:02 AM, Amon Ott <a.ott@xxxxxxxxxxxx> wrote: > > ceph should have recovered here. Might also be caused by this setting > > that I tried for a while, it is off now: > > mds standby replay = true > > With this setting, if the active mds gets killed, no mds is able to > > become active, so everything hangs. Had to reboot again. > > Hrm. That setting simply tells the non-active MDSes that they should > follow the journal of the active MDS(es). They should still go active > if the MDS they're following fails — although it does slightly > increase the chances of them running into the same bugs in code and > dying at the same time. Well, in this test the following MDS did not become active, it kept stuck in replay mode. As waking up without that following is also fast enough for us, I quickly turned it off again. Just wanted to mention that there might be another bug lurking. > > Then killed the active mds, another takes over and suddenly the missing > > file appears: > > .tmp/tiny14/.config/pcmanfm/LXDE/insgesamt 1 > > drwxr-xr-x 1 32252 users 393 1. Feb 15:19 . > > drwxr-xr-x 1 32252 users 0 1. Feb 17:21 .. > > -rw-r--r-- 1 root root 393 24. Jan 15:55 pcmanfm.conf > > So were you able to delete that file once it reappeared? It got deleted correctly by nightly cleanup cron job. > > Restarted the original mds, it does not appear in "ceph mds dump", > > although it is running at 100% cpu. Same happened with other mds > > processes after killing and starting, now I have only one left that is > > working correctly. > > Remind me, you do only have one active MDS, correct? > Did you look at the logs and see what the MDS was doing with 100% cpu? Four MDS defined, but using max mds = 1 > > Will leave the cluster in this state now and have another look tomorrow - > > maybe the spinning mds processes recover by some miracle. After yet another complete reboot it seems to be stable again now. So far I can say that Ceph FS runs pretty stable with the following conditions: - 4 nodes with each 8 to 12 CPU cores, 12 GB of RAM (CPUs and RAM mostly unused by Ceph) - Ceph 0.41 - 3 mon, 4 mds (max mds 1), 4 osd, all 4 nodes are both server and client - Kernel 3.1.10 with some hand integrated patches for Ceph fixes (3.2.2 was unstable, need to check) - OSD storage on normal hard drive, ext4 without journal (btrfs crashes about once per day) - OSD journals on separate SSD - MDS does not get restarted without reboot (yes, only reboot helps) >From our experience, it could be a bit faster when writing into many different files, which is our major work load. However, the last few versions already brought significant improvements on stability and speed. The main problem we see is the MDS takeover and recovery, which has given a lot of trouble so far. Thanks once more for your good work! Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1 Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html