Re: Ceph for "home lab" / hobbyist use?

Hector Martin <hector@xxxxxxxxxxxxxx> · Tue, 10 Sep 2019 17:57:23 +0900

I run Ceph on both a home server and a personal offsite backup server 
(both single-host setups). It's definitely feasible and comes with a lot 
of advantages over traditional RAID and ZFS and the like. The main 
disadvantages are performance overhead and resource consumption.

On 07/09/2019 06.16, William Ferrell wrote:
They're about $50 each, can boot from MicroSD or eMMC flash (basically
an SSD with a custom connector), and have one SATA port. They have
8-core 32-bit CPUs, 2GB of RAM and a gigabit ethernet port. Four of
them (including disks) can run off a single 12V/8A power adapter
(basically 100 watts per set of 4). The obvious appeal is price, plus
they're stackable so they'd be easy to hide away in a closet.

Is it feasible for these to work as OSDs at all? The Ceph hardware
recommendations page suggests OSDs need 1GB per TB of space, so does
this mean these wouldn't be suitable with, say, a 4TB or 8TB disk? Or
would they work, but just more slowly?

2GB seems tight, but *should* work if you're literally running an OSD 
and only an OSD on the thing.

I use a 32GB RAM server to run 64TB worth of raw storage (8x8TB SATA 
disks), plus mon/mds, plus a bunch of unrelated applications and 
servers, routing, and even a desktop UI (I'll soon be splitting off some 
of these duties, since this box has grown into way too much of a kitchen 
sink). It used to be 16GB when I first moved to Ceph, and that juuust 
about worked but it was tight. So the 1GB/1TB recommendation is ample, 
1GB/2TB works well, and 1GB/4TB is tight.

I configure my OSDs for a 1.7GB memory target, so that should work on 
your 2GB RAM board, but it doesn't give you much headroom for increased 
consumption during recovery. To be safe I'd set them up for 1.2GB or so 
target on a board like yours.

I would recommend keeping your PG count low and relying on the Ceph 
balancer to even out your disk usage. Keep in mind that the true number 
of PGs is your PG count multiplied by your pool width: that's your 
replica count for replicated pools, or your erasure code k+m width for 
EC pools. I use 8 PGs for my metadata pool (x3 replication = 24 PGs, 3 
per OSD) and 64 PGs for my data pool (x 5.2 RS profile = 448 PG, 56 per 
OSD). All my data is on CephFS on my home setup.

If you use EC and since your memory usage is tight, you might want to 
drop that a bit, e.g. target something like 24 PGs per OSD. For 
utilization balance, as long as your overall PG count (multiplied by 
pool width) is a multiple of your OSD count, you should be able to 
achieve a "perfect" distribution when using the balancer (otherwise 
you'll be off by +/- 1 PG, where more PGs makes that a smaller imbalance).

Pushing my luck further (assuming the HC2 can handle OSD duties at
all), is that enough muscle to run the monitor and/or metadata
servers? Should monitors and MDS's be run separately, or can/should
they piggyback on hosts running OSDs?

The MDS will need RAM to cache your metadata (and you want it to, 
because it makes performance *way* better). I would definitely keep it 
well away from such tight OSDs. In fact the MDS in my setup uses more 
RAM than any given OSD, about 2GB or so (the cache size is also 
configurable). If you have fewer large files then resource consumption 
will be lower (I have a mixed workload with most of the space taken up 
by large files, but I have a few trees full of tiny ones, totaling 
several million files). You might get away with dedicating a separate 
identical board to the MDS. Maybe consider multi-mds, but I've found 
that during recovery it adds some phases that eat up even more RAM, so 
that might not be a good idea.

The mon isn't as bad, but I wouldn't push it and try to co-host it on 
such limited hosts. Mine's sitting at about ~1GB res.

In either case, keep some larger x86 box that you have the software 
installed on and can plug in some disks into around. In the worst case, 
if you end up in an out-of-memory loop, you can move e.g. the MDS or 
some OSDs to that machine and bring the cluster back to health.

I'd be perfectly happy with a setup like this even if it could only
achieve speeds in the 20-30MB/sec range.

I can't speak for small ARM boards, but my performance (on a quad-core 
Haswell i5 hosting everything) is somewhere around half of what I'd get 
on a RAID6 over the same storage, using ~equivalent RS 5.2 encoding and 
dm-crypt (AES-NI) under the OSDs. Since you'd be running a single OSD 
per host, I imagine you should be able to get reasonable aggregate 
performance out of the whole thing, but I've never tried a setup like that.

I'm actually considering this kind of thing in the future (moving from 
one monolithic server to a more cluster-like setup) but it's just an 
idea for now.

--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com