Re: how can I achieve HA with ceph?

Karoly Horvath <rhswdev@xxxxxxxxx> · Tue, 20 Dec 2011 23:45:29 +0000

Sorry about the formatting, here it is again, I hope it's readable now.

for each test it shows which services I killed on which node. after
each tests I restored all services.

1. mds @ beta       OK

2. mds @ alpha      OK

3. mds+osd @ beta  FAILED
   switch ok {0=alpha=up:active}, but FS not readable
   FS permanently freezed

rebooted the whole cluster

4. mds+mon @ alpha  OK (32 sec)

rebooted the whole cluster

5. mds+osd @ beta   OK (25 sec)

rebooted the whole cluster

6. mds+osd @ beta   OK (24 sec)

7. mds+osd @ alpha  OK (30 sec)

8. mds+mon+osd @ beta  OK (27 sec)

9. power unplug @ alpha FAILED
   stuck in {0=beta=up:replay} for a long time
   finally it's switching to {0=alpha=up:active}, but FS not readable
   FS permanently freezed, even when bringing up alpha...

I included all the tests to show what worked and what didn't.
note that the mds+osd kill worked most of the time but there was also
a problematic test.
also note that the power unplug test FAILED all the time, I included
only one test.

On Tue, Dec 20, 2011 at 10:50 PM, Gregory Farnum
<gregory.farnum@xxxxxxxxxxxxx> wrote:
> On Tue, Dec 20, 2011 at 10:07 AM, Karoly Horvath <rhswdev@xxxxxxxxx> wrote:
>> Hi,
>> all test were made with kill -9, killing the active mds (and sometimes
>> other processes).I waited a couple of minutes between each test to
>> make sure that the cluster reached a stable state.(btw: how can I
>> check this programmatically?)
> You can run "ceph health", which has only a few different values you
> can look for. :)
>
>> #  KILLED           result1. mds @ beta       OK2. mds @ alpha
>> OK3. mds+osd @ beta   FAILED                    switch ok
>> {0=alpha=up:active}, but FS not readable                    FS
>> permanently freezed                    rebooted the whole cluster4.
>> mds+mon @ alpha  OK (32 sec)                    rebooted the whole
>> cluster5. mds+osd @ beta   OK (25 sec)                    rebooted the
>> whole cluster6. mds+osd @ beta   OK (24 sec)7. mds+osd @ alpha  OK (30
>> sec)8. mds+mon+osd @ beta  OK (27 sec)9. power unplug @ alpha FAILED
>>                  stuck in {0=beta=up:replay} for a long time
>>          finally it's switching to {0=alpha=up:active}, but FS not
>> readable                    FS permanently freezed, even when bringing
>> up alpha...
> Your formatting got pretty mangled here, and I'm still not sure what's
> going on. Did you restart all the daemons between each kill attempt?
> (for instance, it looks like '1' is to kill mds.beta; '2' is to kill
> mds.alpha, and then '3' is to kill mds.beta — but you already did
> that)
>
>> I uploaded test results here:
>> http://www.4shared.com/file/5nXMw_sM/cephlogs_mds_test.html?
>> If you need any other configuration options changed, let me know
> Sorry, I should have been clearer when I said turn on mds logging. Add
> "debug mds = 20" and "debug ms = 1" lines to your ceph.conf MDS
> sections. This will spit out a lot more information about what's going
> on internally, which will help us diagnose this. :)

I had those lines, the log seemed to be quite verbose... let me know
if it didn't work.

-- 
Karoly Horvath
rhswdev@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html