Sorry about the formatting, here it is again, I hope it's readable now. for each test it shows which services I killed on which node. after each tests I restored all services. 1. mds @ beta OK 2. mds @ alpha OK 3. mds+osd @ beta FAILED switch ok {0=alpha=up:active}, but FS not readable FS permanently freezed rebooted the whole cluster 4. mds+mon @ alpha OK (32 sec) rebooted the whole cluster 5. mds+osd @ beta OK (25 sec) rebooted the whole cluster 6. mds+osd @ beta OK (24 sec) 7. mds+osd @ alpha OK (30 sec) 8. mds+mon+osd @ beta OK (27 sec) 9. power unplug @ alpha FAILED stuck in {0=beta=up:replay} for a long time finally it's switching to {0=alpha=up:active}, but FS not readable FS permanently freezed, even when bringing up alpha... I included all the tests to show what worked and what didn't. note that the mds+osd kill worked most of the time but there was also a problematic test. also note that the power unplug test FAILED all the time, I included only one test. On Tue, Dec 20, 2011 at 10:50 PM, Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> wrote: > On Tue, Dec 20, 2011 at 10:07 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: >> Hi, >> all test were made with kill -9, killing the active mds (and sometimes >> other processes).I waited a couple of minutes between each test to >> make sure that the cluster reached a stable state.(btw: how can I >> check this programmatically?) > You can run "ceph health", which has only a few different values you > can look for. :) > >> # KILLED result1. mds @ beta OK2. mds @ alpha >> OK3. mds+osd @ beta FAILED switch ok >> {0=alpha=up:active}, but FS not readable FS >> permanently freezed rebooted the whole cluster4. >> mds+mon @ alpha OK (32 sec) rebooted the whole >> cluster5. mds+osd @ beta OK (25 sec) rebooted the >> whole cluster6. mds+osd @ beta OK (24 sec)7. mds+osd @ alpha OK (30 >> sec)8. mds+mon+osd @ beta OK (27 sec)9. power unplug @ alpha FAILED >> stuck in {0=beta=up:replay} for a long time >> finally it's switching to {0=alpha=up:active}, but FS not >> readable FS permanently freezed, even when bringing >> up alpha... > Your formatting got pretty mangled here, and I'm still not sure what's > going on. Did you restart all the daemons between each kill attempt? > (for instance, it looks like '1' is to kill mds.beta; '2' is to kill > mds.alpha, and then '3' is to kill mds.beta — but you already did > that) > >> I uploaded test results here: >> http://www.4shared.com/file/5nXMw_sM/cephlogs_mds_test.html? >> If you need any other configuration options changed, let me know > Sorry, I should have been clearer when I said turn on mds logging. Add > "debug mds = 20" and "debug ms = 1" lines to your ceph.conf MDS > sections. This will spit out a lot more information about what's going > on internally, which will help us diagnose this. :) I had those lines, the log seemed to be quite verbose... let me know if it didn't work. -- Karoly Horvath rhswdev@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html