Re: Ceph experiences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 18/07/15 12:53, Steve Thompson wrote:

Ceph newbie (three weeks).

Ceph 0.94.2, CentOS 6.6 x86_64, kernel 2.6.32. Twelve identical OSD's (1 TB each), three MON's, one active MDS and two standby MDS's. 10GbE cluster network, 1GbE public network. Using CephFS on a single client via the 4.1.1 kernel from elrepo; using rsync to copy data to the Ceph file system (mostly small files). Only one client (me). All set up with ceph-deploy.

A bit off topic but: I wish everyone submitting a thread on ceph-users started with a rundown like this -- thank you!


For this test setup, the OSD's are present on two quad-core 3.16GHz hosts with 16GB memory each; six OSD's on each node. Journals are on the OSD drives for now. The two hosts are not user-accessible, and so are doing mostly OSD duty only (but they have light duty iSCSI targets on them).

First surprise: I have noticed that the OSD drives do not fill at the same rate. For example, when the Ceph file system was 71% full, I had one OSD go into a full state at 95%, while there is another OSD that is only 51% full, and another at 60%.


There have historically been bugs with data distribution on on small clusters (lots of search hits for "ceph uneven distribution"), you may find more placement groups (in your cephfs data pool) helps, or that you need to change the CRUSH tunables in use (for compatibility ceph doesn't use the latest-greatest algorithm by default, try "ceph osd crush tunables hammer").

As a fallback, you can use "ceph osd reweight-by-utilization" to force the system to examine how much data is really on each OSD, and adjust the OSD weights to compensate for the imbalance.




Second surprise: one full OSD results in ENOSPC for *all* writes, even though there is plenty of space available on other OSD's. I marked the full OSD as out to attempt to rebalance ("ceph osd out ods.0"). This appeared to be working, albeit very slowly. I stopped client writes.


The full condition is applied globally because (in theory!) the data distribution should be uniform, and you wouldn't want to write data to a less full OSD without space available on other OSDs to potentially rebalance after a failures. As you've probably guessed, you're not meant to mind this policy because "first surprise" isn't meant to happen :-)


Third surprise: restart client writes after about an hour; data is still being written to the full OSD, but the full condition is no longer recognized; it went to 96% before I stopped the client writes one more. That was yesterday evening; today it is down to 91%. File system is not going to be useable until the rebalance completes (looks like taking days).


Odd, maybe this is the case where writes are ongoing to PGs that are not yet fully migrated away from the 'out' OSD, and they're still being replicated to the full OSD for safety. I wonder if we should have a special case where a full OSD which is 'out' is automatically also treated as 'down'. But I'm not an expert on the OSD so someone else will probably have a more insightful observation here.

Cheers,
John

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux