On 18/07/15 12:53, Steve Thompson wrote:
Ceph newbie (three weeks).
Ceph 0.94.2, CentOS 6.6 x86_64, kernel 2.6.32. Twelve identical OSD's
(1 TB each), three MON's, one active MDS and two standby MDS's. 10GbE
cluster network, 1GbE public network. Using CephFS on a single client
via the 4.1.1 kernel from elrepo; using rsync to copy data to the Ceph
file system (mostly small files). Only one client (me). All set up
with ceph-deploy.
A bit off topic but: I wish everyone submitting a thread on ceph-users
started with a rundown like this -- thank you!
For this test setup, the OSD's are present on two quad-core 3.16GHz
hosts with 16GB memory each; six OSD's on each node. Journals are on
the OSD drives for now. The two hosts are not user-accessible, and so
are doing mostly OSD duty only (but they have light duty iSCSI targets
on them).
First surprise: I have noticed that the OSD drives do not fill at the
same rate. For example, when the Ceph file system was 71% full, I had
one OSD go into a full state at 95%, while there is another OSD that
is only 51% full, and another at 60%.
There have historically been bugs with data distribution on on small
clusters (lots of search hits for "ceph uneven distribution"), you may
find more placement groups (in your cephfs data pool) helps, or that you
need to change the CRUSH tunables in use (for compatibility ceph doesn't
use the latest-greatest algorithm by default, try "ceph osd crush
tunables hammer").
As a fallback, you can use "ceph osd reweight-by-utilization" to force
the system to examine how much data is really on each OSD, and adjust
the OSD weights to compensate for the imbalance.
Second surprise: one full OSD results in ENOSPC for *all* writes, even
though there is plenty of space available on other OSD's. I marked the
full OSD as out to attempt to rebalance ("ceph osd out ods.0"). This
appeared to be working, albeit very slowly. I stopped client writes.
The full condition is applied globally because (in theory!) the data
distribution should be uniform, and you wouldn't want to write data to a
less full OSD without space available on other OSDs to potentially
rebalance after a failures. As you've probably guessed, you're not
meant to mind this policy because "first surprise" isn't meant to happen :-)
Third surprise: restart client writes after about an hour; data is
still being written to the full OSD, but the full condition is no
longer recognized; it went to 96% before I stopped the client writes
one more. That was yesterday evening; today it is down to 91%. File
system is not going to be useable until the rebalance completes (looks
like taking days).
Odd, maybe this is the case where writes are ongoing to PGs that are not
yet fully migrated away from the 'out' OSD, and they're still being
replicated to the full OSD for safety. I wonder if we should have a
special case where a full OSD which is 'out' is automatically also
treated as 'down'. But I'm not an expert on the OSD so someone else
will probably have a more insightful observation here.
Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com