Hi,
On 12/04/2017 12:12 PM, tim taler wrote:
Hi,
thnx a lot for the quick response
and for laying out some of the issues
I'm also new, but I'll try to help. IMHO most of the pros here would be quite worried about this cluster if it is production:
thought so ;-/
-A prod ceph cluster should not be run with size=2 min_size=1, because:
--In case of a down'ed osd / host the cluster could have problems determining which data is the correct when the osd/host came back up
Uhm I thought at least THAT wouldn't be the case here since we hace
three mons??
don't THEY keep track of which osd has the latest data
isn't the size set on the pool level not on the cluster level??
It is not a mon related issue, I do not really understand the cause
euther, but either way, the problem exists. You can read back the thread
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022108.html
--If an osd dies, the others get more io (has to compensate the lost io capacity and the rebuilding too) which can instantly kill another close to death disc (not with ceph, but with raid i have been there)
--If an osd dies ANY other osd serving that pool has well placed inconsistency, like bitrot you'll lose data
good point, with scrubbing the checksums of the the objects are checked, right?
can I get somewhere the report how much errors where found by the last
scrub run (like in zpool status)
to estimate how well a disk is doing (right now the raid controller
won't let me read the smart data from the disks)
The deep scrub does check for these, I do not know how to check the stats.
You probably can access the underlying disc smart somehow, for ex. lsi
hbas provide this at /dev/sgX
-There are not enough hosts in your setup, or rather the discs are not distributed well:
--If an osd / host dies, the cluster trys to repair itself and relocate the data onto another host. In your config there is no other host to reallocate data to if ANY of the hosts fail (I guess that hdds and ssds are separated)
Yupp, HDD and SDD form seperate pools.
Good point, not in my list of arguments yet
-The disks should nod be placed in raid arrays if it can be avoided especially raid0:
--You multiply the possibility of an un-recoverable disc error (and since the data is striped) the other disks data is unrecoverable too
--When an osd dies, the cluster should relocate the data onto another osd. When this happens now there is double the data that need to be moved, this causes 2 problems: Recovery time / io, and free space. The cluster should have enough free space to reallocate data to, in this setup you cannot do that in case of a host dies (see above), but in case an osd dies, ceph would try to replicate the data onto other osds in the machine. So you have to have enough free space on >>the same host<< in this setup to replicate data to.
ON THE SAME MACHINE ?
is that so?
So than there should be at the BARE MINIMUM always be more free space
on each machine than the biggest OSD it hosts, right?
This applies to your current config. Since by default ceph will not
replicate the data onto the same host twice (it is sensible!) and you
have only 2 hosts for each of your pools, ceph does not have any other
choice but to replicate the data onto another osd on the same host. If
you would have more hosts, the additional load would be divided between
them.
There is a max usage limit for the cluster'shealing process (maybe 90%?).
The rule of thumb is, that you should have enough space in the cluster
to accommodate the replica placement caused by the loss of a host. So
you should have "size"+1 hosts, and free space on the cluster, so if you
subtract the size of the biggest host from the cluster sum size, then
the resulting usage should be < 90%
In your case, I would recommend:
-Introducing (and activating) a fourth osd host
-setting size=3 min_size=2
that will be difficult, can't I run size=3 min_size=2 with three hosts?
Yes you could, but then in case of a failed host ceph cannot repair
itself (there is no host to put the third copy onto).
Furthermore, the more hosts you have, the less impact it makes if you
loose one of them, see the calculation above.
-After data migration is done, one-by-one separating the raid0 arrays: (remove, split) -> (zap, init, add) separately, in such a manner that hdds and ssds are evenly distributed across the servers
from what I understand the sizes of OSDs can vary
and the weight setting in our setup seems plausible to me (it's
directly derived from the size of the osd)
why than are the not filled on the same level nor even tending to
being filled the same?
does ceph by itself include other measurements like latency of the
OSD? that would explain why the raid0 OSDs have so much more data
than the single disks, but I haven't seen anything about that in the
docus (so far?)
The osd weight can be set (by hand) according to disc size, IO
capability, ssd DWPD etc. depending on your need.
The osd weight determines how many placement groups will be placed on
the osd compared to other osds, thus assigning io, and data amount to
the osd.
If some objects in your cluster are too big, their count (for a pool) is
not a power of 2, or you do not have enough placement groups (there is a
calculator) in your cluster it could cause that the data is not
distributed evenly across pgs and and therefore not distributed evenly
across osds (considering the weight)
-Always keeping that much free space, so the cluster could lose a host and still has space to repair (calculating with the repair max usage % setting).
thnx again!
yupp that was helpfull
I hope this helps, and please keep in mind that I'm a noob too :)
Denes.
I hope this helps,
Denes.
On 12/04/2017 10:07 AM, tim taler wrote:
Hi
I'm new to ceph but have to honor to look after a cluster that I haven't set up by myself.
Rushing to the ceph docs and having a first glimpse on our cluster I start worrying about our setup,
so I need some advice and guidance here.
The set up is:
3 machines, each running a ceph-monitor.
all of them are also hosting OSDs
machine A:
2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning disk)
3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning disk)
machine B:
3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0 (spinning disk)
1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0 (spinning disk)
3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
machine C:
3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
the spinning disks and the SSD disks are forming two seperate pools.
Now what I'm worrying about is that I read "don't use raid together with ceph"
in combination with our poolsize
:~ ceph osd pool get <poolname> size
size: 2
From what I understand from the ceph docu the size tell me "how many disks may fail" without loosing the data of the whole pool.
Is that right? or can HALF the OSDs fail (since all objects are duplicated)?
Unfortunately I'm not very good in stochastic but given a probability of 1% disk failure per year
I'm not feeling very secure with this set up (How do I calculate the value that two disks fail "at the same time"? - or ahs anybody a rough number about that?)
although looking at our OSD tree it seems we try to spread the objects always between two peers:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-19 4.76700 root here_ssd
-15 2.38350 room 2_ssd
-14 2.38350 rack 2_ssd
-4 2.38350 host B_ssd
4 hdd 0.79449 osd.4 up 1.00000 1.00000
5 hdd 0.79449 osd.5 up 1.00000 1.00000
13 hdd 0.79449 osd.13 up 1.00000 1.00000
-18 2.38350 room 1_ssd
-17 2.38350 rack 1_ssd
-5 2.38350 host C_ssd
0 hdd 0.79449 osd.0 up 1.00000 1.00000
1 hdd 0.79449 osd.1 up 1.00000 1.00000
2 hdd 0.79449 osd.2 up 1.00000 1.00000
-1 51.96059 root here_spinning
-12 25.98090 room 2_spinning
-11 25.98090 rack 2_spinning
-2 25.98090 host B_spinning
3 hdd 3.99959 osd.3 up 1.00000 1.00000
8 hdd 3.99429 osd.8 up 1.00000 1.00000
9 hdd 3.99429 osd.9 up 1.00000 1.00000
10 hdd 3.99429 osd.10 up 1.00000 1.00000
11 hdd 1.99919 osd.11 up 1.00000 1.00000
12 hdd 3.99959 osd.12 up 1.00000 1.00000
20 hdd 3.99959 osd.20 up 1.00000 1.00000
-10 25.97969 room 1_spinning
-8 25.97969 rack l1_spinning
-3 25.97969 host A_spinning
6 hdd 3.99959 osd.6 up 1.00000 1.00000
7 hdd 3.99959 osd.7 up 1.00000 1.00000
14 hdd 3.99429 osd.14 up 1.00000 1.00000
15 hdd 3.99429 osd.15 up 1.00000 1.00000
16 hdd 3.99429 osd.16 up 1.00000 1.00000
17 hdd 1.99919 osd.17 up 1.00000 1.00000
18 hdd 1.99919 osd.18 up 1.00000 1.00000
19 hdd 1.99919 osd.19 up 1.00000 1.00000
And the second question
I tracked the disk usage of our OSDs over the last two weeks and it looks somehow strange too:
While osd.14, and osd.20 are filled only well below 60%
the osd 9,16 and 18 are well about 80%
graphing that shows pretty stable parallel lines, with no hint of convergence
That's true for both the HDD and the SSD pool.
How is that and why and is that normal and okay or is there a(nother) glitch in our config?
any hints and comments are welcome
TIA
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com