On Tue, Dec 20, 2011 at 10:07 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: > Hi, > all test were made with kill -9, killing the active mds (and sometimes > other processes).I waited a couple of minutes between each test to > make sure that the cluster reached a stable state.(btw: how can I > check this programmatically?) You can run "ceph health", which has only a few different values you can look for. :) > # KILLED result1. mds @ beta OK2. mds @ alpha > OK3. mds+osd @ beta FAILED switch ok > {0=alpha=up:active}, but FS not readable FS > permanently freezed rebooted the whole cluster4. > mds+mon @ alpha OK (32 sec) rebooted the whole > cluster5. mds+osd @ beta OK (25 sec) rebooted the > whole cluster6. mds+osd @ beta OK (24 sec)7. mds+osd @ alpha OK (30 > sec)8. mds+mon+osd @ beta OK (27 sec)9. power unplug @ alpha FAILED > stuck in {0=beta=up:replay} for a long time > finally it's switching to {0=alpha=up:active}, but FS not > readable FS permanently freezed, even when bringing > up alpha... Your formatting got pretty mangled here, and I'm still not sure what's going on. Did you restart all the daemons between each kill attempt? (for instance, it looks like '1' is to kill mds.beta; '2' is to kill mds.alpha, and then '3' is to kill mds.beta — but you already did that) > I uploaded test results here: > http://www.4shared.com/file/5nXMw_sM/cephlogs_mds_test.html? > If you need any other configuration options changed, let me know Sorry, I should have been clearer when I said turn on mds logging. Add "debug mds = 20" and "debug ms = 1" lines to your ceph.conf MDS sections. This will spit out a lot more information about what's going on internally, which will help us diagnose this. :) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html