Re: adding osd node best practice

tombo <tombo@xxxxxx> · Sun, 08 Mar 2015 14:14:33 +0100

Hi 
On 08.03.2015 04:32, Anthony D'Atri wrote:

1) That's an awful lot of mons.  Are they VM's or something?  My sense is that mons >5 have diminishing returns at best.  

We have application cluster and ceph as storage solution, cluster consists of six servers, so we've installed monitor on every one of them, to have ceph cluster sane (quorum) if server or two of them goes down. As far as I know there is no limit or recommended number, could you please point out some issue with higher number of mons? We are going to deploy another mons on new storage nodes, or is it not necessary/recommended to have mon on node with osds?

2) Only two OSD nodes?  Assume you aren't running 3 copies of data or racks.  

 for now, only 2 copies. We are not running vms ,so we could lost few objects without any issue (objects are directly accessed with librados), maybe later in final state which means 6 storage nodes we switch settings to 3 copies 

3) The new nodes will have fewer OSD's?   Be careful with host / OSD weighting to avoid a gross imbalance in disk utilization.   

yes fewer, but OSD are weighted according disk size (we are using https://github.com/ceph/ceph/blob/master/src/ceph-osd-prestart.sh for calculation), and node weight should be sum of osd's weight, there should be no problem. But regarding number of OSD's
we have 36 and all big clusters around on internet have like 24 osd per node and I was not able to find any best practice regarding count of osds per node and we are going to reduce count of them. 

There are two options:

1) move few discs to new node, to have 3 nodes each 30 osd, this will lower data density per node
or
2) migrate discs to sw raid0, 2 discs per raid, from 36 it will become 18osd with better io performance per osd and same data density (raid is not recommended at all, but as redundancy solution, why not for performance use case? )

4) I've had experience tripling the size of a cluster in one shot, and in backfilling a whole rack of 100+ OSD's in one shot.  Cf.  BÖC's 'Veteran of the Psychic Wars'.  I do not recommend this approach esp.  if you don't have truly embarrassing amounts of RAM.  Suggest disabling scrubs / deep-scrubs, throttling the usual backfill / recovery values, including setting recovery op priority as low as 1 for the duration.  

deep-scrub is disabled (we have short ttl or sliding window for our data, most of objects are going to be deleted in 3 days anyway, you can imagine our ceph use case as two days fifo queue ) and recovery values already lowered, because ceph is sometimes throwing discs away (and we are putting them back), so basically it is recovering most of the time.This issue with unresponsive osds is reason why we want to reduce count of them. We are suspecting that we have too many osd per node.

 Deploy one OSD at a time.  Yes this will cause data to move more than once.  But it will also minimize your exposure to as-of-yet undiscovered problems with the new hardware, and the magnitude of peering storms.   And thus client impact.   One OSD on each new system, sequentially.   Check the weights in the CRUSH map.  Time backfill to HEALTH_OK.  Let them soak for a few days before serially deploying the rest.  

Thanks for that, so we will start with one new osd per night.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxxhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com