On Tue, Jun 18, 2013 at 08:13:39PM +0800, Da Chun wrote:
> Hi List,My ceph cluster has two osds on each node. One has 15g capacity, and the other 10g.
> It's interesting that, after I took the 15g osd out of the cluster, the cluster started to rebalance, and finally the 10g osd on the same node was finally full and taken off, and failed to start again with the following error in the osd log file:
> 2013-06-18 19:51:20.799756 7f6805ee07c0 -1 filestore(/var/lib/ceph/osd/ceph-1) Extended attributes don't appear to work. Got error (28) No space left on device. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option.
> 2013-06-18 19:51:20.800258 7f6805ee07c0 -1 ^[[0;31m ** ERROR: error converting store /var/lib/ceph/osd/ceph-1: (95) Operation not supported^[[0m
>
>
>
> I guess the 10g osd was chosen by the cluster to be the container for the extra objects.
> My questions here:
> 1. How are the extra objects spread in the cluster after an osd is taken out? Only spread to one of the osds?
> 2. Is there no mechanism to prevent the osds from being filled too full and taken off?
>
As far I understand it.
Each OSD has the same weight by default, you can give them a different weight to force it to be used less.
The reason to do so could be because it has less space or because it is slower.
I believe that as of cuttlefish you won't see recovery overflow an OSD's storage, FYI.
In general a failed OSD should recover to many other OSDs but it can depend on the shape of your CRUSH map. In this case it sounds like you have two nodes which you've marked as separate failure domains, so when you took out the 15GB drive all re-replication had to occur to the 10GB drive. :)
-Greg
--
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com