Re: ceph configuration; Was: FreeBSD rc.d script: sta.rt not found

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 21 Aug 2018 13:00:33 +0200

Norman,

I'm cc-ing this back to ceph-users for others the reply to or in future 
to find

On 21/08/2018 12:01, Norman Gray wrote:

Willem Jan, hello.

Thanks for your detailed notes on my list question.

On 20 Aug 2018, at 21:32, Willem Jan Withagen wrote:

     # zpool create -m/var/lib/ceph/osd/osd.0 osd.0 gpt/zd000 gpt/zd001

Over the weekend I update the Ceph manual for FreeBSD manual, with 
exactly that.
I 'm not sure what sort of devices zd000 and zd001 are, but concating 
devices seriously lowers the MTBF for the vdev. And as such it is 
likely better to create 2 OSDs on these 2 devices.

My sort-of problem is that the machine I'm doing this on was not specced 
with Ceph in mind: it has 16 3.5TB disks.  Given that 
<http://docs.ceph.com/docs/master/start/hardware-recommendations/> 
suggests that 20 is a 'high' number of OSDs on a host, I thought it 
might be better to aim for an initial setup of 6 two-disk OSDs rather 
than 12 one-disk ones (leaving four disks free).

That said, 12 < 20, so I think that, especially bearing in mind your 
advice here, I should probably stick to 1-disk OSDs with one (default) 
5GB SSD journal each, and not complicate things.

Only one way to find out: try both...
But I certainly do not advise to put concat disks in an OSD. Especially 
not for production. Break one disk, you break the vdev.

And the most important thing for OSDs is 1G per 1T of disk.
So with 70T of disk you'd need 64 or more of RAM, preferably more since 
ZFS will want his share as well..
CPUs there is not going to that much of a issue. Unless you have real 
tiny CPUs.

What I still have not figured out is what to do with the SSDs.
There are 3 things you can do (or in any combination)
1) Ceph standard: make it a journal. Mount the SSD on a separate dir and
	get ceph-disk to start using it as journal
2) Attach a ZFS cache to the vdev which will improve reading
3) Attach a ZFS log on SSD to the vdev to improve sync writing.

At the moment I'm doing all three:
[~] wjw@xxxxxxxxxxxxxxxxxxxx> zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
osd.0.journal          316K  5.33G    88K 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.0/journal-ssd
osd.1.journal          316K  5.33G    88K 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.1/journal-ssd
osd.2.journal          316K  5.33G    88K 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.2/journal-ssd
osd.3.journal          316K  5.33G    88K 
/usr/jails/ceph_3/var/lib/ceph/osd/osd.3/journal-ssd
osd.4.journal          316K  5.33G    88K 
/usr/jails/ceph_4/var/lib/ceph/osd/osd.4/journal-ssd
osd.5.journal          316K  5.33G    88K 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.5/journal-ssd
osd.6.journal          316K  5.33G    88K 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.6/journal-ssd
osd.7.journal          316K  5.33G    88K 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.7/journal-ssd
osd_0                 5.16G   220G  5.16G 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.0
osd_1                 5.34G   219G  5.34G 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.1
osd_2                 5.42G   219G  5.42G 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.2
osd_3                 6.62G  1.31T  6.62G 
/usr/jails/ceph_3/var/lib/ceph/osd/osd.3
osd_4                 6.83G  1.75T  6.83G 
/usr/jails/ceph_4/var/lib/ceph/osd/osd.4
osd_5                 5.92G  1.31T  5.92G 
/usr/jails/ceph_0/var/lib/ceph/osd/osd.5
osd_6                 6.00G  1.31T  6.00G 
/usr/jails/ceph_1/var/lib/ceph/osd/osd.6
osd_7                 6.10G  1.31T  6.10G 
/usr/jails/ceph_2/var/lib/ceph/osd/osd.7

[~] wjw@xxxxxxxxxxxxxxxxxxxx> zpool list -v osd_1
NAME                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP 
DEDUP  HEALTH  ALTROOT
osd_1               232G  5.34G   227G        -         -     0%     2% 
1.00x  ONLINE  -
  gpt/osd_1         232G  5.34G   227G        -         -     0%     2%
log                    -      -      -         -      -      -
  gpt/osd.1.log     960M    12K   960M        -         -     0%     0%
cache                  -      -      -         -      -      -
  gpt/osd.1.cache  22.0G  1.01G  21.0G        -         -     0%     4%

So each OSD has a SSD journal (zfs volume) and each osd volume has cache 
and log. ATM the cluster is idle, so hence the log is "empty"

But I would first work on the architecture of how you want the cluster 
to be, and then start tuning. fs log and cache are easily added and 
removed after the fact.

I found what appear to be a couple of typos in your script which I can 
report back to you.  I hope to make significant progress with this work 
this week, so should be able to give you more feedback on the script, on 
my experiences, and on the FreeBSD page in the manual.

Sure, keep'm coming

--WjW

I'll work through your various notes.  Below are a couple of specific 
points.

When I attempt to start the service, I get:

# service ceph start
=== mon.pochhammer ===

You're sort of free to pick names, but most of the times tooling 
expects naming converntions:
    mon: mon.[a-z]
    osd: osd.[0-9]+
    mgr: mgr.[x-z]

Using other names should work, but I'm not sure it works for all cases.

Thanks!  I wasn't sure if the restricted naming was just for demo 
purposes.  It's valuable to know that this is very firm advice.

Could also be permission thing. Most daemons used to run as root, but 
"recently" they started running as user ceph:ceph

Yes, I had to change ownership of a couple of files before getting this 
far.

My mon.a directory looks like:

Aha!

Yup, it is an overwhelming set of tools, with little begin or end.

I hadn't planned to be particularly Brave, here.  But onward...

Best wishes,

Norman

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com