Re: running Qemu / Hypervisor AND Ceph on the same nodes

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 26 Mar 2015 12:36:53 -0500

On 03/26/2015 12:13 PM, Quentin Hartman wrote:
That one big server sounds great, but it also sounds like a single point
of failure.

Absolutely, but I'm talking about folks who want dozens of these, not one.

It's also not cheap. I've been able to build this cluster
for about $1400 per node, including the 10Gb networking gear, which is

Think about how much that is per OSD, and consider power, cooling, and 
$/sqft of datacenter space.  I would *love* it if folks would gravitate 
toward smaller nodes with fewer drives for Ceph.  Dense nodes tend to be 
much more complicated and it's a hidden cost that folks often gloss 
over.  Having said that, small nodes are absolutely more expensive per 
OSD as far as raw hardware and power/cooling goes.

less than what I see the _empty case_ you describe going for new. Even
used, the lowest I've seen (lacking trays at that price) is what I paid
for one of my nodes including CPU and RAM, and drive trays. So, it's
been a pretty inexpensive venture considering what we get out of it. I
have no per-node fault tolerance, but if one of my nodes dies, I just
restart the VMs that were on it somewhere else and wait for ceph to
heal. I also benefit from higher aggregate network bandwidth because I
have more ports on the wire. And better per-U cpu and RAM density (for
the money). *shrug* different strokes.

As for difficulty of management, any screwing around I've done has had
nothing to do with the converged nature of the setup, aside from
discovering and changing the one setting I mentioned. So, for me at
least, it's been a pretty well unqualified net win. I can imagine all
sorts of scenarios where that wouldn't be, but I think it's probably
debatable whether or not those constitute a common case. The higher node
count does add some complexity, but that's easily overcome with some
simple automation. Again though, that has no bearing on the converged
setup, it's just a factor of how much CPU and RAM we need for our use case.

I guess what I'm trying to say is that I don't think the answer is as
cut and dry as you seem to think.

I don't want you to feel like I'm attacking your setup.  I'm sure it 
works very well!  I don't think we'd be able to convince many folks to 
adopt servers that only provide 3 OSDs per U though, converged or not.

Mark

QH

On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> wrote:

    I suspect a config like this where you only have 3 OSDs per node
    would be more manageable than something denser.

    IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U
    super micro chassis for a semi-dense converged solution.  You could
    attempt to restrict the OSDs to one socket and then use a second
    E5-2697v3 for VMs.  Maybe after you've got cgroups setup properly
    and if you've otherwise balanced things it would work out ok.  I
    question though how much you really benefit by doing this rather
    than running a 36 drive storage server with lower bin CPUs and a 2nd
    1U box for VMs (which you don't need as many of because you can
    dedicate both sockets to VMs).

    It probably depends quite a bit on how memory, network, and disk
    intensive the VMs are, but my take is that it's better to error on
    the side of simplicity rather than making things overly
    complicated.  Every second you are screwing around trying to make
    the setup work right eats into any savings you might gain by going
    with the converged setup.

    Mark

    On 03/26/2015 10:12 AM, Quentin Hartman wrote:

        I run a converged openstack / ceph cluster with 14 1U nodes.
        Each has 1
        SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores,
        10Gb NICs
        for ceph network, and 72GB of RAM. I configure openstack to
        leave 3GB of
        RAM unused on each node for OSD / OS overhead. All the VMs are
        backed by
        ceph volumes and things generally work very well. I would prefer a
        dedicated storage layer simply because it seems more "right", but I
        can't say that any of the common concerns of using this kind of
        setup
        have come up for me. Aside from shaving off that 3GB of RAM, my
        deployment isn't any more complex than a split stack deployment
        would
        be. After running like this for the better part of a year, I
        would have
        a hard time honestly making a real business case for the extra
        hardware
        a split stack cluster would require.

        QH

        On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson <mnelson@xxxxxxxxxx
        <mailto:mnelson@xxxxxxxxxx>
        <mailto:mnelson@xxxxxxxxxx <mailto:mnelson@xxxxxxxxxx>>> wrote:

             It's kind of a philosophical question.  Technically there's
        nothing
             that prevents you from putting ceph and the hypervisor on
        the same
             boxes. It's a question of whether or not potential cost
        savings are
             worth increased risk of failure and contention.  You can
        minimize
             those things through various means (cgroups, ristricting
        NUMA nodes,
             etc).  What is more difficult is isolating disk IO
        contention (say
             if you want local SSDs for VMs), memory bus and QPI contention,
             network contention, etc. If the VMs are working really hard
        you can
             restrict them to their own socket, and you can even
        restrict memory
             usage to the local socket, but what about remote socket
        network or
             disk IO? (you will almost certainly want these things on
        the ceph
             socket)  I wonder as well about increased risk of hardware
        failure
             with the increased load, but I don't have any statistics.

             I'm guessing if you spent enough time at it you could make
        it work
             relatively well, but at least personally I question how
        beneficial
             it really is after all of that.  If you are going for cost
        savings,
             I suspect efficient compute and storage node designs will
        be nearly
             as good with much less complexity.

             Mark

             On 03/26/2015 07:11 AM, Wido den Hollander wrote:

                 On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

                     Hi Wido,
                     Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

                         On 26-03-15 11:52, Stefan Priebe - Profihost AG
        wrote:

                             Hi,

                             in the past i rwad pretty often that it's
        not a good
                             idea to run ceph
                             and qemu / the hypervisors on the same nodes.

                             But why is this a bad idea? You save space
        and can
                             better use the
                             ressources you have in the nodes anyway.

                         Memory pressure during recovery *might* become a
                         problem. If you make
                         sure that you don't allocate more then let's
        say 50% for
                         the guests it
                         could work.

                     mhm sure? I've never seen problems like that.
        Currently i
                     ran each ceph
                     node with 64GB of memory and each hypervisor node with
                     around 512GB to
                     1TB RAM while having 48 cores.

                 Yes, it can happen. You have machines with enough
        memory, but if you
                 overprovision the machines it can happen.

                         Using cgroups you could also prevent that the
        OSDs eat
                         up all memory or CPU.

                     Never seen an OSD doing so crazy things.

                 Again, it really depends on the available memory and
        CPU. If you
                 buy big
                 machines for this purpose it probably won't be a problem.

                     Stefan

                         So technically it could work, but memorey and CPU
                         pressure is something
                         which might give you problems.

                             Stefan

        ___________________________________________________
                             ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
                             <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>

        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

             ___________________________________________________
             ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>
             <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com