On Tue, Nov 15, 11:53, Gregory Farnum wrote: > > Any plans to address the ENOSPC issue? I gave v0.38 a try and the > > file system behaves like the older (<= 0.36) versions I've tried > > before when it fills up: The ceph mounts hang on all clients. > > This is something we hope to address in the future, but we haven't > come up with a good solution yet. (I haven't seen a good solution in > other distributed systems either...) Glad to hear the problem is known and will be addressed. We'd love to use ceph as a global tmp file system on our cluster, so users *will* fill it up.. > > But there is progress: Sync is now interuptable (it used to block > > in D state so that it could not be killed even with SIGKILL), and > > umount works even if the file system is full. However, subsequent > > mount attempts then fail with "mount error 5 = Input/output error". > Yay! > > > Our test setup consists of one mds, one monitor and 8 osds. mds and > > monitor are on the same node, and this node is not not an osd. All > > nodes are running Linux-3.0.9 ATM, but I would be willing to upgrade > > to 3.1.1 if this is expected to make a difference. > > > > Here's some output of "ceph -w". Funny enough it reports 770G of free > > disk space space although the writing process terminated with ENOSPC. > Right now RADOS (the object store under the Ceph FS) is pretty > conservative about reporting ENOSPC. Since btrfs is also pretty > unhappy when its disk fills up, an OSD marks itself as "full" once > it's reached 95% of its capacity, and once a single OSD goes full then > RADOS marks itself that way so you don't overfill a disk and have > really bad things happen. (Hung mounts suck but are a lot better than > mysterious data loss.) Six of the eight underlying btrfs for ceph are 500G large, the other two are 800G. Used disk space varies between 459G and 476G. The peak 476G is on a 500G fs, so this one is 98% full. The data was written by a single client using stress, which simply created 5G files in an endless loop. All these files are in the top level directory. > Looking at your ceph -s I'm surprised by a few things, though... > 1) Why do you have so many PGs? 8k/OSD is rather a lot I can't answer this question, but please have a look at the ceph config file below. Maybe you can spot something odd in it. > 2) I wouldn't expect your OSDs to have become so unbalanced that one > of them hits 95% full when the cluster's only at 84% capacity. This seems to be due to the fact that roughly the same amount of data was written to each file system despite of the different file system sizes. Hence only 60% disk space is used on the two 800G file systems. > What is this cluster used for? Are you running anything besides the > Ceph FS on it? (radosgw, maybe?) Besides the ceph daemons only sshd and sge_execd (for executing cluster jobs) is running there. Job submission was disabled on these nodes during the tests, so all systems were completely idle. Thanks for your help Andre --- [global] ; enable secure authentication ;auth supported = cephx ;osd journal size = 100 ; measured in MB [client] ; userspace client debug ms = 1 debug client = 10 ; You need at least one monitor. You need at least three if you want to ; tolerate any node failures. Always create an odd number. [mon] mon data = /var/ceph/mon$id ; some minimal logging (just message traffic) to aid debugging ; debug ms = 1 ; debug auth = 20 ;authentication code [mon.0] host = node334 mon addr = 192.168.3.34:6789 ; You need at least one mds. Define two to get a standby. [mds] ; where the mds keeps it's secret encryption keys keyring = /var/ceph/keyring.$name ; debug mds = 20 [mds.0] host = node334 ; osd ; You need at least one. Two if you want data to be replicated. ; Define as many as you like. [osd] ; This is where the btrfs volume will be mounted. osd data = /var/ceph/osd$id keyring = /etc/ceph/keyring.$name ; Ideally, make this a separate disk or partition. A few GB ; is usually enough; more if you have fast disks. You can use ; a file under the osd data dir if need be ; (e.g. /data/osd$id/journal), but it will be slower than a ; separate disk or partition. osd journal = /var/ceph/osd$id/journal ; If the OSD journal is a file, you need to specify the size. This is specified in MB. osd journal size = 512 [osd.325] host = node325 btrfs devs = /dev/ceph/data [osd.326] host = node326 btrfs devs = /dev/ceph/data [osd.327] host = node327 btrfs devs = /dev/ceph/data [osd.328] host = node328 btrfs devs = /dev/ceph/data [osd.329] host = node329 btrfs devs = /dev/ceph/data [osd.330] host = node330 btrfs devs = /dev/ceph/data [osd.331] host = node331 btrfs devs = /dev/ceph/data [osd.333] host = node333 btrfs devs = /dev/ceph/data -- The only person who always got his work done by Friday was Robinson Crusoe
Attachment:
signature.asc
Description: Digital signature