Re: PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

David Turner <drakonstein@xxxxxxxxx> · Sat, 07 Oct 2017 20:41:53 +0000

Disclaimer, I have never attempted this configuration especially with Luminous. I doubt many have, but it's a curious configuration that I'd love to help see if it is possible.
There is 1 logical problem with your configuration (which you have most likely considered).  If you want all of your PGs to be primary on NVMe's across the 3 DC's, then you need to have 1/3 of your available storage (that you plan to use for this pool) be from NVMe's.  Otherwise they will fill up long before the HDDs and your cluster will be "full" while your HDDs are near empty.  I clarify "that you plan to use for this pool" because if you plan to put other stuff on just the HDDs, that is planning to utilize that extra space, then it's a part of the plan that your NVMe's don't total 1/3 of your storage.

Second, I'm noticing that if a PG has a primary OSD in any datacenter other than TEG4, then it only has 1 other datacenter available to have its 2 HDD copies on.  If the rules were working properly, then I would expect the PG to be stuck undersized as opposed to choosing an OSD from a datacenter that it shouldn't be able to.  Potentially, you could test setting the size to 2 for this pool (while you're missing the third HDD node) to see if any PGs still end up on an HDD and NVMe in the same DC.  I think that likely you will find that PGs will still be able to use 2 copies in the same DC based on your current configuration.

Then, I believe, the next best configuration would be to set size for this pool to 4.  It would choose an NVMe as the primary OSD, and then choose an HDD from each DC for the secondary copies.  This will guarantee that a copy of the data goes into each DC and you will have 2 copies in other DCs away from the primary NVMe copy.  It wastes a copy of all of the data in the pool, but that's on the much cheaper HDD storage and can probably be considered acceptable losses for the sake of having the primary OSD on NVMe drives.

On Sat, Oct 7, 2017 at 3:36 PM Peter Linder <peter.linder@xxxxxxxxxxxxxx> wrote:

    On 10/7/2017 8:08 PM, David Turner
      wrote:

      Just to make sure you understand that the reads will
        happen on the primary osd for the PG and not the nearest osd,
        meaning that reads will go between the datacenters. Also that
        each write will not ack until all 3 writes happen adding the
        latency to the writes and reads both.

    Yes, I understand this. It is actually fine, the datacenters have
    been selected so that they are about 10-20km apart. This yields
    around a 0.1 - 0.2ms round trip time due to speed of light being too
    low. Nevertheless, latency due to network shouldn't be a problem and
    it's all 40G (dedicated) TRILL network for the moment.

    I just want to be able to select 1 SSD and 2 HDDs, all spread out. I
    can do that, but one of the HDDs end up in the same datacenter,
    probably because I'm using the "take" command 2 times (resets
    selecting buckets?).

        On Sat, Oct 7, 2017, 1:48 PM Peter Linder <peter.linder@xxxxxxxxxxxxxx>
          wrote:

            On
              10/7/2017 7:36 PM, Дробышевский, Владимир wrote:

              Hello!

                  2017-10-07 19:12 GMT+05:00
                    Peter Linder <peter.linder@xxxxxxxxxxxxxx>:

                    The
                      idea is to select an nvme osd, and

                      then select the rest from hdd osds in different
                      datacenters (see crush

                      map below for hierarchy). 

                    It's a little bit aside of the question, but
                      why do you want to mix SSDs and HDDs in the same
                      pool? Do you have read-intensive workload and
                      going to use primary-affinity to get all reads
                      from nvme?

           Yes, this is pretty
            much the idea, getting the performance from NVMe reads,
            while still maintaining triple redundancy and a reasonable
            cost. 

                -- 

                        Regards,
                        Vladimir

          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com