On Tue, Dec 20, 2011 at 3:42 PM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote: > Sorry about the formatting, here it is again, I hope it's readable now > > for each test it shows which services I killed on which node. after the test > I restored all services. > > 1. mds @ beta OK > > 2. mds @ alpha OK > > > 3. mds+osd @ beta FAILED > switch ok {0=alpha=up:active}, but FS not readable > FS permanently freezed > > rebooted the whole cluster > > 4. mds+mon @ alpha OK (32 sec) > > > rebooted the whole cluster > > 5. mds+osd @ beta OK (25 sec) > > rebooted the whole cluster > > 6. mds+osd @ beta OK (24 sec) > > 7. mds+osd @ alpha OK (30 sec) > > 8. mds+mon+osd @ beta OK (27 sec) > > 9. power unplug @ alpha FAILED > stuck in {0=beta=up:replay} for a long time > finally it's switching to {0=alpha=up:active}, but FS not readable > FS permanently freezed, even when bringing up alpha... > > I included all the tests to show what worked and what didn't. > note that the mds+osd worked most of the time but there was also a > problematic test. > also note that the power unplug test FAILED all the time, I included only > one test. > > > On Tue, Dec 20, 2011 at 10:50 PM, Gregory Farnum > <gregory.farnum@xxxxxxxxxxxxx> wrote: >> On Tue, Dec 20, 2011 at 10:07 AM, Karoly Horvath <rhswdev@xxxxxxxxx> >> wrote: >>> Hi, >>> all test were made with kill -9, killing the active mds (and sometimes >>> other processes).I waited a couple of minutes between each test to >>> make sure that the cluster reached a stable state.(btw: how can I >>> check this programmatically?) >> You can run "ceph health", which has only a few different values you >> can look for. :) >> >>> # KILLED result1. mds @ beta OK2. mds @ alpha >>> OK3. mds+osd @ beta FAILED switch ok >>> {0=alpha=up:active}, but FS not readable FS >>> permanently freezed rebooted the whole cluster4. >>> mds+mon @ alpha OK (32 sec) rebooted the whole >>> cluster5. mds+osd @ beta OK (25 sec) rebooted the >>> whole cluster6. mds+osd @ beta OK (24 sec)7. mds+osd @ alpha OK (30 >>> sec)8. mds+mon+osd @ beta OK (27 sec)9. power unplug @ alpha FAILED >>> stuck in {0=beta=up:replay} for a long time >>> finally it's switching to {0=alpha=up:active}, but FS not >>> readable FS permanently freezed, even when bringing >>> up alpha... >> Your formatting got pretty mangled here, and I'm still not sure what's >> going on. Did you restart all the daemons between each kill attempt? >> (for instance, it looks like '1' is to kill mds.beta; '2' is to kill >> mds.alpha, and then '3' is to kill mds.beta — but you already did >> that) >> >>> I uploaded test results here: >>> http://www.4shared.com/file/5nXMw_sM/cephlogs_mds_test.html? >>> If you need any other configuration options changed, let me know >> Sorry, I should have been clearer when I said turn on mds logging. Add >> "debug mds = 20" and "debug ms = 1" lines to your ceph.conf MDS >> sections. This will spit out a lot more information about what's going >> on internally, which will help us diagnose this. :) > > I had those lines, the log seemed to be quite verbose... let me know if it > didn't work. It looks like maybe you got it turned on in the mon section rather than the mds or global sections. :) However, as I look at these a little more it generally looks good, even in trial 3. The only alarming thing that's present is a note that two of your clients failed to reconnect to the MDS in time and were cut off. Did you try establishing a new connection to the cluster and seeing if that worked? It's possible there's a client bug, or that there was some sort of network error that interfered with them. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html