Re: Crushmap from Rack aware to Node aware

Deepak Naidu <dnaidu@xxxxxxxxxx> · Thu, 1 Jun 2017 21:59:17 +0000

>> If all 6 racks are tagged for Ceph storage nodes, I'd go ahead and just put the nodes in there now and configure the crush map accordingly
I just have 3 racks. That’s the max I have for now. 10 OSD Nodes.

--
Deepak

From: David Turner [mailto:drakonstein@xxxxxxxxx]

Sent: Thursday, June 01, 2017 2:05 PM

To: Deepak Naidu; ceph-users

Subject: Re: [ceph-users] Crushmap from Rack aware to Node aware

If all 6 racks are tagged for Ceph storage nodes, I'd go ahead and just put the nodes in there now and configure the crush map accordingly.  That way you can grow each of the racks while keeping each failure domain closer in size to the
 rest of the cluster.

On Thu, Jun 1, 2017 at 3:40 PM Deepak Naidu <dnaidu@xxxxxxxxxx> wrote:

Perfect David for detailed explanation. Appreciate it!.

In my case I have 10 OSD servers with each 60 Disks(ya I know…) ie total 600 OSD and I have 3 racks
 to spare.

--
Deepak

From: David Turner [mailto:drakonstein@xxxxxxxxx]

Sent: Thursday, June 01, 2017 12:23 PM

To: Deepak Naidu; ceph-users

Subject: Re: [ceph-users] Crushmap from Rack aware to Node aware

The way to do this is to download your crush map, modify it manually after decompiling it to text format or modify it using the crushtool.  Once you have your crush map with the
 rules in place that you want, you will upload the crush map to the cluster.  When you change your failure domain from host to rack, or any other change to failure domain, it will cause all of your PGs to peer at the same time.  You want to make sure that you
 have enough memory to handle this scenario.  After that point, your cluster will just backfill the PGs from where they currently are to their new location and then clean up after itself.  It is recommended to monitor your cluster usage and modify osd_max_backfills
 during this process to optimize how fast you can finish your backfilling while keeping your cluster usable by the clients.

I generally recommend starting a cluster with at least n+2 failure domains so would recommend against going to a rack failure domain with only 3 racks.  As an alternative that I've
 done, I've set up 6 "racks" when I only have 3 racks with planned growth to a full 6 racks.  When I added servers and expanded to fill more racks, I moved the servers to where they are represented in the crush map.  So if it's physically in rack1 but it's
 set as rack4 in the crush map, then I would move those servers to the physical rack 4 and start filling out rack 1 and rack 4 to complete their capacity, then do the same for rack 2/5 when I start into the 5th rack.

Another option to having full racks in your crush map is having half racks.  I've also done this for clusters that wouldn't grow larger than 3 racks.  Have 6 failure domains at
 half racks.  It lowers your chance of having random drives fail in different failure domains at the same time and gives you more servers that you can run maintenance on at a time over using a host failure domain.  It doesn't resolve the issue of using a single
 cross-link for the entire rack or a full power failure of the rack, but it's closer.

The problem with having 3 failure domains with replica 3 is that if you lose a complete failure domain, then you have nowhere for the 3rd replica to go.  If you have 4 failure domains
 with replica 3 and you lose an entire failure domain, then you over fill the remaining 3 failure domains and can only really use 55% of your cluster capacity.  If you have 5 failure domains, then you start normalizing and losing a failure domain doesn't impact
 as severely.  The more failure domains you get to, the less it affects you when you lose one.

Let's do another scenario with 3 failure domains and replica size 3.  Every OSD you lose inside of a failure domain gets backfilled directly onto the remaining OSDs in that failure
 domain.  There reaches a point where a switch failure in a rack or losing a node in the rack could over-fill the remaining OSDs in that rack.  If you have enough servers and OSDs in the rack, then this becomes moot.... but if you have a smaller cluster with
 only 3 nodes and <4 drives in each... if you lose a drive in one of your nodes, then all of it's data gets distributed to the other 3 drives in that node.  That means you either have to replace your storage ASAP when it fails or never fill your cluster up
 more than 55% if you want to be able to automatically recover from a drive failure.

tl;dr . Make sure you calculate what your failure domain, replica size, drive size, etc means for how fast you have to replace storage when it fails and how full you can fill your
 cluster to afford a hardware loss.

On Thu, Jun 1, 2017 at 12:40 PM Deepak Naidu <dnaidu@xxxxxxxxxx> wrote:

Greetings Folks.

Wanted to understand how ceph works when we start with rack aware(rack level replica) example 3 racks and 3 replica in crushmap in future is replaced by node aware(node level replica)
 ie 3 replica spread across nodes.

This can be vice-versa. If this happens. How does ceph rearrange the “old” data. Do I need to trigger any command to ensure the data placement is based on latest crushmap algorithm
 or ceph takes care of it automatically.

Thanks for your time.

--
Deepak

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited. 
 If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com