Re: Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs

Wladimir Mutel <mwg@xxxxxxxxx> · Sun, 22 Sep 2019 12:54:53 +0300

Ashley Merrick wrote:
Correct, in a large cluster no problem.

I was talking in Wladimir setup where they are running single node with 
a failure domain of OSD. Which would be a loss of all OSD's and all data.

	Sure I am aware that running with 1 NVMe is risky, so we have a plan to 
add a mirroring NVMe to it in some future. Hope this could be solved by 
using simple mdadm+lvm scheme

	Btw, are there any recommendations on cheapest Ceph node hardware ? Now 
I understand that 8x3TB HDDs in single host is quite a centralized 
setup. And I have a feeling that a good Ceph cluster should have more 
hosts than OSDs in each host. Like, with 8 OSDs per host, at least 8 
hosts. Or at least 3 hosts with 3 OSDs in each. Right ? And then it 
would be reasonable to add single NVMe per host to allow any component 
of the host to fail within failure domain=host.
I am still thinking within the cheapest concept of multiple HDDs + 
single NVMe per host.

---- On Sun, 22 Sep 2019 03:42:52 +0800 *solarflow99 
<solarflow99@xxxxxxxxx <mailto:solarflow99@xxxxxxxxx>>* wrote ----

    now my understanding is that a NVMe drive is recommended to help
    speed up bluestore.  If it were to fail then those OSDs would be
    lost but assuming there is 3x replication and enough OSDs I don't
    see the problem here.  There are other scenarios where a whole
    server might le lost, it doesn't mean the total loss of the cluster.

    On Sat, Sep 21, 2019 at 5:27 AM Ashley Merrick
    <singapore@xxxxxxxxxxxxxx <mailto:singapore@xxxxxxxxxxxxxx>> wrote:

        __
        Placing it as a Journal / Bluestore DB/WAL will help with writes
        mostly, by the sounds of it you want to increase read
        performance?, how important is the data on this CEPH cluster?

        If you place it as a Journal DB/WAL any failure of it will cause
        total data loss so I would very much advise against this unless
        this is totally for testing and total data loss is not an issue.

        In that can is worth upgrading to blue store by rebuilding each
        OSD placing the DB/WAL on a SSD partition, you can do this one
        OSD at a time but there is no migration path so you would need
        to wait for data rebuilding after each OSD change before moving
        onto the next.

        If you need to make sure your data is safe then your really
        limited to using it as a read only cache, but I think even then
        most setups would cause all OSD's to go offline till you
        manually removed it from a read only cache if the disk failed.
        However bcache/dm-cache may support this automatically however
        is still a risk that I personally wouldn't want to take.

        Also it really depends on your use for CEPH and the I/O activity
        expected to what the best option may be.

        ---- On Fri, 20 Sep 2019 14:56:12 +0800 *Wladimir Mutel
        <mwg@xxxxxxxxx <mailto:mwg@xxxxxxxxx>>* wrote ----

                 Dear everyone,

                 Last year I set up an experimental Ceph cluster (still
            single node,
            failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L,
            RAM 64 GB,
            HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from
            Ubuntu mainline
            PPA and Ceph 14.2.4 from
            download.ceph.com/debian-nautilus/dists/bionic
            <http://download.ceph.com/debian-nautilus/dists/bionic>
            ). I set up JErasure 2+1 pool, created some RBDs using that
            as data pool
            and exported them by iSCSI (using tcmu-runner, gwcli and
            associated
            packages). But with HDD-only setup their performance was
            less than
            stellar, not saturating even 1Gbit Ethernet on RBD reads.

                 This year my experiment was funded with Gigabyte PCIe
            NVMe 1TB SSD
            (GP-ASACNE2100TTTDR). Now it is plugged in the MB and is
            visible as a
            storage device to lsblk. Also I can see its 4 interrupt
            queues in
            /proc/interrupts, and its transfer measured by hdparm -t is
            about 2.3GB/sec.

                 And now I want to ask your advice on how to best
            include it into this
            already existing setup. Should I allocate it for OSD
            journals and
            databases ? Is there a way to reconfigure existing OSD in
            this way
            without destroying and recreating it ? Or are there plans to
            ease this
            kind of migration ? Can I add it as a write-adsorbing cache to
            individual RBD images ? To individual block devices at the
            level of
            bcache/dm-cache ? What about speeding up RBD reads ?

                 I would appreciate to read your opinions and
            recommendations.
                 (just want to warn you that in this situation I don't
            have financial
            option of going full-SSD)

                 Thank you all in advance for your response
            _______________________________________________
            ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com