Re: The feasibility of mixed SSD and HDD replicated pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for digging this out. I believed to remember exactly this method (don't know where from), but couldn't find it in the documentation and started doubting it. Yes, this would be very useful information to add to the documentation and it also confirms that your simpler setup with just a specialized crush rule will work exactly as intended and is long-term stable.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: 胡 玮文 <huww98@xxxxxxxxxxx>
Sent: 26 October 2020 17:19
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@xxxxxxx
Subject: Re:  Re: The feasibility of mixed SSD and HDD replicated pool

> 在 2020年10月26日,15:43,Frank Schilder <frans@xxxxxx> 写道:
>
> 
>> I’ve never seen anything that implies that lead OSDs within an acting set are a function of CRUSH rule ordering.
>
> This is actually a good question. I believed that I had seen/heard that somewhere, but I might be wrong.
>
> Looking at the definition of a PG, is states that a PG is an ordered set of OSD (IDs) and the first up OSD will be the primary. In other words, it seems that the lowest OSD ID is decisive. If the SSDs were deployed before the HDDs, they have the smallest IDs and, hence, will be preferred as primary OSDs.

I don’t think this is correct. From my experiments, using previously mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are always SSD.

I also have a look at the code, if I understand it correctly:

* If the default primary affinity is not changed, then the logic about primary affinity is skipped, and the primary would be the first one returned by CRUSH algorithm [1].

* The order of OSDs returned by CRUSH still matters if you changed the primary affinity. The affinity represents the probability of a test to be success. The first OSD will be tested first, and will have higher probability to become primary. [2]
  * If any OSD has primary affinity = 1.0, the test will always success, and any OSD after it will never be primary.
  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability of 0.125. Otherwise, 1st will be primary.
  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be primary as fallback.

[1]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456
[2]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561

So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it to be the primary in my case.

Do you think I should contribute these to documentation?

> This, however, is not a sustainable situation. Any addition of OSDs will mess this up and the distribution scheme will fail in the future. A way out seem to be:
>
> - subdivide your HDD storage using device classes:
> * define a device class for HDDs with primary affinity=0, for example, pick 5 HDDs and change their device class to hdd_np (for no primary)
> * set the primary affinity of these HDD OSDs to 0
> * modify your crush rule to use "step take default class hdd_np"
> * this will create a pool with primaries on SSD and balanced storage distribution between SSD and HDD
> * all-HDD pools deployed as usual on class hdd
> * when increasing capacity, one needs to take care of adding disks to hdd_np class and set their primary affinity to 0
> * somewhat increased admin effort, but fully working solution
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux