Re: Ceph experiences

John Spray <john.spray@xxxxxxxxxx> · Mon, 20 Jul 2015 09:41:49 +0100

On 18/07/15 12:53, Steve Thompson wrote:

Ceph newbie (three weeks).

Ceph 0.94.2, CentOS 6.6 x86_64, kernel 2.6.32. Twelve identical OSD's 
(1 TB each), three MON's, one active MDS and two standby MDS's. 10GbE 
cluster network, 1GbE public network. Using CephFS on a single client 
via the 4.1.1 kernel from elrepo; using rsync to copy data to the Ceph 
file system (mostly small files). Only one client (me). All set up 
with ceph-deploy.

A bit off topic but: I wish everyone submitting a thread on ceph-users 
started with a rundown like this -- thank you!

For this test setup, the OSD's are present on two quad-core 3.16GHz 
hosts with 16GB memory each; six OSD's on each node. Journals are on 
the OSD drives for now. The two hosts are not user-accessible, and so 
are doing mostly OSD duty only (but they have light duty iSCSI targets 
on them).

First surprise: I have noticed that the OSD drives do not fill at the 
same rate. For example, when the Ceph file system was 71% full, I had 
one OSD go into a full state at 95%, while there is another OSD that 
is only 51% full, and another at 60%.

There have historically been bugs with data distribution on on small 
clusters (lots of search hits for "ceph uneven distribution"), you may 
find more placement groups (in your cephfs data pool) helps, or that you 
need to change the CRUSH tunables in use (for compatibility ceph doesn't 
use the latest-greatest algorithm by default, try "ceph osd crush 
tunables hammer").

As a fallback, you can use "ceph osd reweight-by-utilization" to force 
the system to examine how much data is really on each OSD, and adjust 
the OSD weights to compensate for the imbalance.

Second surprise: one full OSD results in ENOSPC for *all* writes, even 
though there is plenty of space available on other OSD's. I marked the 
full OSD as out to attempt to rebalance ("ceph osd out ods.0"). This 
appeared to be working, albeit very slowly. I stopped client writes.

The full condition is applied globally because (in theory!) the data 
distribution should be uniform, and you wouldn't want to write data to a 
less full OSD without space available on other OSDs to potentially 
rebalance after a failures.  As you've probably guessed, you're not 
meant to mind this policy because "first surprise" isn't meant to happen :-)

Third surprise: restart client writes after about an hour; data is 
still being written to the full OSD, but the full condition is no 
longer recognized; it went to 96% before I stopped the client writes 
one more. That was yesterday evening; today it is down to 91%. File 
system is not going to be useable until the rebalance completes (looks 
like taking days).

Odd, maybe this is the case where writes are ongoing to PGs that are not 
yet fully migrated away from the 'out' OSD, and they're still being 
replicated to the full OSD for safety.  I wonder if we should have a 
special case where a full OSD which is 'out' is automatically also 
treated as 'down'.  But I'm not an expert on the OSD so someone else 
will probably have a more insightful observation here.

Cheers,
John

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com