Re: v0.38 released

Andre Noll <maan@xxxxxxxxxxxxxxx> · Wed, 16 Nov 2011 10:56:14 +0100

On Tue, Nov 15, 11:53, Gregory Farnum wrote:

> > Any plans to address the ENOSPC issue? I gave v0.38 a try and the
> > file system behaves like the older (<= 0.36) versions I've tried
> > before when it fills up: The ceph mounts hang on all clients.
> 
> This is something we hope to address in the future, but we haven't
> come up with a good solution yet. (I haven't seen a good solution in
> other distributed systems either...)

Glad to hear the problem is known and will be addressed. We'd love to
use ceph as a global tmp file system on our cluster, so users *will*
fill it up..

> > But there is progress: Sync is now interuptable (it used to block
> > in D state so that it could not be killed even with SIGKILL), and
> > umount works even if the file system is full. However, subsequent
> > mount attempts then fail with "mount error 5 = Input/output error".
> Yay!
> 
> > Our test setup consists of one mds, one monitor and 8 osds. mds and
> > monitor are on the same node, and this node is not not an osd. All
> > nodes are running Linux-3.0.9 ATM, but I would be willing to upgrade
> > to 3.1.1 if this is expected to make a difference.
> >
> > Here's some output of "ceph -w". Funny enough it reports 770G of free
> > disk space space although the writing process terminated with ENOSPC.
> Right now RADOS (the object store under the Ceph FS) is pretty
> conservative about reporting ENOSPC. Since btrfs is also pretty
> unhappy when its disk fills up, an OSD marks itself as "full" once
> it's reached 95% of its capacity, and once a single OSD goes full then
> RADOS marks itself that way so you don't overfill a disk and have
> really bad things happen. (Hung mounts suck but are a lot better than
> mysterious data loss.)

Six of the eight underlying btrfs for ceph are 500G large, the other
two are 800G. Used disk space varies between 459G and 476G. The peak
476G is on a 500G fs, so this one is 98% full.

The data was written by a single client using stress, which simply
created 5G files in an endless loop. All these files are in the top
level directory.

> Looking at your ceph -s I'm surprised by a few things, though...
> 1) Why do you have so many PGs? 8k/OSD is rather a lot

I can't answer this question, but please have a look at the ceph
config file below. Maybe you can spot something odd in it.

> 2) I wouldn't expect your OSDs to have become so unbalanced that one
> of them hits 95% full when the cluster's only at 84% capacity.

This seems to be due to the fact that roughly the same amount of data
was written to each file system despite of the different file system
sizes. Hence only 60% disk space is used on the two 800G file systems.

> What is this cluster used for? Are you running anything besides the
> Ceph FS on it? (radosgw, maybe?)

Besides the ceph daemons only sshd and sge_execd (for executing
cluster jobs) is running there. Job submission was disabled on these
nodes during the tests, so all systems were completely idle.

Thanks for your help
Andre
---
[global]
	; enable secure authentication
	;auth supported = cephx
	;osd journal size = 100    ; measured in MB 

[client] ; userspace client
	debug ms = 1
	debug client = 10

; You need at least one monitor. You need at least three if you want to
; tolerate any node failures. Always create an odd number.
[mon]
	mon data = /var/ceph/mon$id
	; some minimal logging (just message traffic) to aid debugging
	; debug ms = 1
	; debug auth = 20 ;authentication code

[mon.0]
	host = node334
	mon addr = 192.168.3.34:6789

; You need at least one mds. Define two to get a standby.
[mds]
	; where the mds keeps it's secret encryption keys
	keyring = /var/ceph/keyring.$name
	; debug mds = 20
[mds.0]
	host = node334

; osd
;  You need at least one.  Two if you want data to be replicated.
;  Define as many as you like.
[osd]
	; This is where the btrfs volume will be mounted.
	osd data = /var/ceph/osd$id

	keyring = /etc/ceph/keyring.$name

	; Ideally, make this a separate disk or partition.  A few GB
 	; is usually enough; more if you have fast disks.  You can use
 	; a file under the osd data dir if need be
 	; (e.g. /data/osd$id/journal), but it will be slower than a
 	; separate disk or partition.
	osd journal = /var/ceph/osd$id/journal
	; If the OSD journal is a file, you need to specify the size. This is specified in MB.
        osd journal size = 512

[osd.325]
	host = node325
	btrfs devs = /dev/ceph/data
[osd.326]
	host = node326
	btrfs devs = /dev/ceph/data
[osd.327]
	host = node327
	btrfs devs = /dev/ceph/data
[osd.328]
	host = node328
	btrfs devs = /dev/ceph/data
[osd.329]
	host = node329
	btrfs devs = /dev/ceph/data
[osd.330]
	host = node330
	btrfs devs = /dev/ceph/data
[osd.331]
	host = node331
	btrfs devs = /dev/ceph/data
[osd.333]
	host = node333
	btrfs devs = /dev/ceph/data
-- 
The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc

Description: Digital signature