Re: osd stops

Martin Wilderoth <martin.wilderoth@xxxxxxxxxx> · Wed, 13 Apr 2011 14:12:51 +0200 (CEST)

This is my config,

;
; Sample ceph ceph.conf file.
;
; This file defines cluster membership, the various locations
; that Ceph stores data, and any other runtime options.

; If a 'host' is defined for a daemon, the start/stop script will
; verify that it matches the hostname (or else ignore it).  If it is
; not defined, it is assumed that the daemon is intended to start on
; the current host (e.g., in a setup with a startup.conf on each
; node).

; global
[global]
        ; enable secure authentication
        auth supported = cephx
        keyring = /etc/ceph/keyring.bin

        ; allow ourselves to open a lot of files
        max open files = 131072
        pid file = /var/run/ceph/$name.pid
        debug ms = 1

; monitors
;  You need at least one.  You need at least three if you want to
;  tolerate any node failures.  Always create an odd number.
[mon]
        mon data = /data/mon$id

        ; logging, for debugging monitor crashes, in order of
        ; their likelihood of being helpful :)
        ;debug ms = 1
        ;debug mon = 20
        ;debug paxos = 20
        ;debug auth = 20

[mon0]
        host = ceph1
        mon addr = 10.0.6.10:6789

[mon1]
        host = ceph2
        mon addr = 10.0.6.11:6789

[mon2]
        host = ceph3
        mon addr = 10.0.6.12:6789

; mds
;  You need at least one.  Define two to get a standby.
[mds]
        ; where the mds keeps it's secret encryption keys
        keyring = /etc/ceph/keyring.$name

        ; mds logging to debug issues.
        ;debug ms = 1
        ;debug mds = 20

[mds0]
        host = ceph1

[mds1]
        host = ceph2

[mds2]
        host = ceph3

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
[osd]
        sudo = true
        ; This is where the btrfs volume will be mounted.
        osd data = /data/osd$id
        ; where the ods keeps it's secret encryption keys
        keyring = /etc/ceph/keyring.$name

        ; Ideally, make this a separate disk or partition.  A few
        ; hundred MB should be enough; more if you have fast or many
        ; disks.  You can use a file under the osd data dir if need be
        ; (e.g. /data/osd$id/journal), but it will be slower than a
        ; separate disk or partition.

        ; This is an example of a file-based journal.
        ;osd journal = /data/osd$id/journal
        ;osd journal size = 1000 ; journal size, in megabytes

        ; osd logging to debug osd issues, in order of likelihood of being
        ; helpful
;       debug ms = 1
;       debug osd = 25
;       debug monc = 20
;       debug journal = 20
;       debug filestore = 10
;       osd use stale snap = true

[osd0]
        host = ceph1

        ; if 'btrfs devs' is not specified, you're responsible for
        ; setting up the 'osd data' dir.  if it is not btrfs, things
        ; will behave up until you try to recover from a crash (which
        ; usually fine for basic testing).
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd1]
        host = ceph1
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

[osd2]
        host = ceph2
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd3]
        host = ceph2
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

[osd4]
        host = ceph3
        btrfs devs = /dev/sdc
        osd journal = /dev/sda1

[osd5]
        host = ceph3
        btrfs devs = /dev/sdd
        osd journal = /dev/sda2

The statistics of the disks, this is after the crash of osd2 and osd4.

/dev/sdc             143373312 124954676  18418636  88% /data/osd0
/dev/sdd             143373312 137639524   5733788  97% /data/osd1

/dev/sdc             143373312 120350584  23022728  84% /data/osd2
/dev/sdd             143373312 141986188   1387124 100% /data/osd3

/dev/sdc             143373312 112025716  31347596  79% /data/osd4
/dev/sdd             143373312 115163124  28210188  81% /data/osd5

I will send some statistic of the ext3 as well

----- Ursprungligt meddelande ----- 
FrÃn: "Gregory Farnum" <gregory.farnum@xxxxxxxxxxxxx> 
Till: "Martin Wilderoth" <martin.wilderoth@xxxxxxxxxx> 
Kopia: ceph-devel@xxxxxxxxxxxxxxx 
Skickat: tisdag, 12 apr 2011 14:24:14 
Ãmne: Re: osd stops 

On Tuesday, April 12, 2011 at 11:05 AM, Martin Wilderoth wrote: 
Thanks for the answer, now I know the reson. Some of my osd had 90% of data, dmesg also shows error with the btrfs on the hosts. I will run the test with another file system ext3 :-) or is any other filesystem better. It's a backuppc filesystem with a lot of hardlinks and data I would like to test to run in ceph. 

ext3 or really any other FS will handle it better, although Ceph itself is also not super-resilient to such situations. Eventually we will have automatic rebalancing of data but it's not in there right now. 

Could you maybe send along your config file and the local filesystem statistics on each of your OSDs? CRUSH is psuedo-random and so it's not going to have perfectly even utilization but if the variance is too high we'll want to look into it sooner rather than later. 
-Greg 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html