That one big server sounds great, but it also sounds like a single point of failure. It's also not cheap. I've been able to build this cluster for about $1400 per node, including the 10Gb networking gear, which is less than what I see the _empty case_ you describe going for new. Even used, the lowest I've seen (lacking trays at that price) is what I paid for one of my nodes including CPU and RAM, and drive trays. So, it's been a pretty inexpensive venture considering what we get out of it. I have no per-node fault tolerance, but if one of my nodes dies, I just restart the VMs that were on it somewhere else and wait for ceph to heal. I also benefit from higher aggregate network bandwidth because I have more ports on the wire. And better per-U cpu and RAM density (for the money). *shrug* different strokes.
As for difficulty of management, any screwing around I've done has had nothing to do with the converged nature of the setup, aside from discovering and changing the one setting I mentioned. So, for me at least, it's been a pretty well unqualified net win. I can imagine all sorts of scenarios where that wouldn't be, but I think it's probably debatable whether or not those constitute a common case. The higher node count does add some complexity, but that's easily overcome with some simple automation. Again though, that has no bearing on the converged setup, it's just a factor of how much CPU and RAM we need for our use case.
I guess what I'm trying to say is that I don't think the answer is as cut and dry as you seem to think.
QH
On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
I suspect a config like this where you only have 3 OSDs per node would be more manageable than something denser.
IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super micro chassis for a semi-dense converged solution. You could attempt to restrict the OSDs to one socket and then use a second E5-2697v3 for VMs. Maybe after you've got cgroups setup properly and if you've otherwise balanced things it would work out ok. I question though how much you really benefit by doing this rather than running a 36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many of because you can dedicate both sockets to VMs).
It probably depends quite a bit on how memory, network, and disk intensive the VMs are, but my take is that it's better to error on the side of simplicity rather than making things overly complicated. Every second you are screwing around trying to make the setup work right eats into any savings you might gain by going with the converged setup.
Mark
On 03/26/2015 10:12 AM, Quentin Hartman wrote:
I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
RAM unused on each node for OSD / OS overhead. All the VMs are backed by
ceph volumes and things generally work very well. I would prefer a
dedicated storage layer simply because it seems more "right", but I
can't say that any of the common concerns of using this kind of setup
have come up for me. Aside from shaving off that 3GB of RAM, my
deployment isn't any more complex than a split stack deployment would
be. After running like this for the better part of a year, I would have
a hard time honestly making a real business case for the extra hardware
a split stack cluster would require.
QH
On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson <mnelson@xxxxxxxxxx_________________________________________________<mailto:mnelson@xxxxxxxxxx>> wrote:
It's kind of a philosophical question. Technically there's nothing
that prevents you from putting ceph and the hypervisor on the same
boxes. It's a question of whether or not potential cost savings are
worth increased risk of failure and contention. You can minimize
those things through various means (cgroups, ristricting NUMA nodes,
etc). What is more difficult is isolating disk IO contention (say
if you want local SSDs for VMs), memory bus and QPI contention,
network contention, etc. If the VMs are working really hard you can
restrict them to their own socket, and you can even restrict memory
usage to the local socket, but what about remote socket network or
disk IO? (you will almost certainly want these things on the ceph
socket) I wonder as well about increased risk of hardware failure
with the increased load, but I don't have any statistics.
I'm guessing if you spent enough time at it you could make it work
relatively well, but at least personally I question how beneficial
it really is after all of that. If you are going for cost savings,
I suspect efficient compute and storage node designs will be nearly
as good with much less complexity.
Mark
On 03/26/2015 07:11 AM, Wido den Hollander wrote:
On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:
Hi Wido,
Am 26.03.2015 um 11:59 schrieb Wido den Hollander:
On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:
Hi,
in the past i rwad pretty often that it's not a good
idea to run ceph
and qemu / the hypervisors on the same nodes.
But why is this a bad idea? You save space and can
better use the
ressources you have in the nodes anyway.
Memory pressure during recovery *might* become a
problem. If you make
sure that you don't allocate more then let's say 50% for
the guests it
could work.
mhm sure? I've never seen problems like that. Currently i
ran each ceph
node with 64GB of memory and each hypervisor node with
around 512GB to
1TB RAM while having 48 cores.
Yes, it can happen. You have machines with enough memory, but if you
overprovision the machines it can happen.
Using cgroups you could also prevent that the OSDs eat
up all memory or CPU.
Never seen an OSD doing so crazy things.
Again, it really depends on the available memory and CPU. If you
buy big
machines for this purpose it probably won't be a problem.
Stefan
So technically it could work, but memorey and CPU
pressure is something
which might give you problems.
Stefan
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxx.com>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_________________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com