Re: HELP with some basics please

tim taler <robur314@xxxxxxxxx> · Mon, 4 Dec 2017 18:27:04 +0100

thnx a lot again,
makes sense to me.

We have all journals of the HDD-OSDs on partitions on an extra
SSD-raid1 (each OSD got it's own journal partition on that raid1)
but as I understand they could be moved back to the OSD, at least for
the time of the restructuring.

What makes my tommy turn though, is the thought of ripping out a raid0
pair and plug it into another machine, (it's hwraid not zfs!)
in the hope of keeping the data on it, even if I can get the same sort
of controller (which might be possible, although the machines are a
couple of years old and
Machine C is not the same as A and B) .

And I'm still puzzled about the implication of the cluster size on the
amount of OSD failures.
With size=2 min_size=1 one host could die and (if by chance there is
NO read error on any bit on the living host) I could (theoretically)
recover, is that right?
OR is it that if any two disks in the cluster fail at the same time
(or while one is still being rebuild) all my data would be gone?

On Mon, Dec 4, 2017 at 4:42 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
> Your current node configuration cannot do size=3 for any pools.  You only
> have 2 hosts with HDDs and 2 hosts with SSDs in each root.  You cannot put 3
> copies of data for an HDD pool on 3 separate nodes when you only have 2
> nodes with HDDs...  In this configuration, size=2 is putting a copy of the
> data on every available node.  That is why you need to have the space
> available on the host with the failed OSD to be able to recover; there is no
> other way for the cluster to keep 2 copies of the data on different nodes.
> The same will be true if you only have 3 available nodes and size=3; any
> failed disk can only backfill onto the same node.
>
> I would start by recommending that you restructure your nodes quite heavily.
> You want as close to the same number of disks in each node as you can get.
> A balanced setup might look like...  This is of course assuming that the
> CPU, RAM, and disk controllers are similar between the 3 nodes.
>
> machine A:
> 2x 3.6TB
> 2x 3.6TB RAID0
> 1x 1.8TB
> 2x .7TB SSD (1 each from machines B & C)
>
> machine B:
> 2x 3.6TB
> 2x 3.6TB RAID0
> 1x 1.8TB
> 2x .7TB SSD
>
> machine C:
> 1x 3.6TB (from machine B)
> 2x 3.6TB RAID0 (1 each from machines A & B)
> 2x 1.8TB  (from machine A)
> 2x .7TB SSD
>
> After all of that is configured and backfilled (a lot of backfilling). The
> next step is to remove the RAID0 OSDs and add them back in as individual
> 1.8TB OSDs.  You can also consider size=3 min_size=2 for some of your pools
> in this configuration.  Both rebuilding the RAID0 OSDs and increasing the
> size of a pool will require that you have enough space in your
> cluster/nodes.  Depending on how you have your journals configured moving
> the OSDs between hosts is usually fairly trivial (except for the
> backfilling).
>
> Your % used is going to be a problem throughout this as an inherent issue
> with Ceph not being perfect at balancing data which is a trade-off for data
> integrity in the CRUSH algorithm.  There are ways to change the weights of
> the OSDs to help fix the balance issue, but it is not indicative of a
> problem in your configuration... just something that you need to be aware of
> to be able to prevent it from being a major problem.
>
> There is a lot of material on why size=2 min_size=1 is bad.  Read back
> through the ML archives to find some.  My biggest take-away is... if you
> lose all but 1 copy of your data... do you really want to make changes to
> it?  I've also noticed that the majority of clusters on the ML that have
> irreparably lost data were running with size=2 min_size=1.
>
> On Mon, Dec 4, 2017 at 6:12 AM tim taler <robur314@xxxxxxxxx> wrote:
>>
>> Hi,
>>
>> thnx a lot for the quick response
>> and for laying out some of the issues
>>
>> > I'm also new, but I'll try to help. IMHO most of the pros here would be
>> > quite worried about this cluster if it is production:
>>
>> thought so ;-/
>>
>> > -A prod ceph cluster should not be run with size=2 min_size=1, because:
>> > --In case of a down'ed osd / host the cluster could have problems
>> > determining which data is the correct when the osd/host came back up
>>
>> Uhm  I thought at least THAT wouldn't be the case here since we hace
>> three mons??
>> don't THEY keep track of which osd has the latest data
>> isn't the size set on the pool level not on the cluster level??
>>
>> > --If an osd dies, the others get more io (has to compensate the lost io
>> > capacity and the rebuilding too) which can instantly kill another close to
>> > death disc (not with ceph, but with raid i have been there)
>> > --If an osd dies ANY other osd serving that pool has well placed
>> > inconsistency, like bitrot you'll lose data
>>
>> good point, with scrubbing the checksums of the the objects are checked,
>> right?
>> can I get somewhere the report how much errors where found by the last
>> scrub run (like in zpool status)
>> to estimate how well a disk is doing (right now the raid controller
>> won't let me read the smart data from the disks)
>>
>>
>> > -There are not enough hosts in your setup, or rather the discs are not
>> > distributed well:
>> > --If an osd / host dies, the cluster trys to repair itself and relocate
>> > the data onto another host. In your config there is no other host to
>> > reallocate data to if ANY of the hosts fail (I guess that hdds and ssds are
>> > separated)
>> Yupp, HDD and SDD form seperate pools.
>> Good point, not in my list of arguments yet
>>
>> > -The disks should nod be placed in raid arrays if it can be avoided
>> > especially raid0:
>> > --You multiply the possibility of an un-recoverable disc error (and
>> > since the data is striped) the other disks data is unrecoverable too
>> > --When an osd dies, the cluster should relocate the data onto another
>> > osd. When this happens now there is double the data that need to be moved,
>> > this causes 2 problems: Recovery time / io, and free space. The cluster
>> > should have enough free space to reallocate data to, in this setup you
>> > cannot do that in case of a host dies (see above), but in case an osd dies,
>> > ceph would try to replicate the data onto other osds in the machine. So you
>> > have to have enough free space on >>the same host<< in this setup to
>> > replicate data to.
>>
>> ON THE SAME MACHINE ?
>> is that so?
>> So than there should be at the BARE MINIMUM always be more free space
>> on each machine than the biggest OSD it hosts, right?
>>
>> > In your case, I would recommend:
>> > -Introducing (and activating) a fourth osd host
>> > -setting size=3 min_size=2
>>
>> that will be difficult, can't I run size=3 min_size=2 with three hosts?
>>
>> > -After data migration is done, one-by-one separating the raid0 arrays:
>> > (remove, split) -> (zap, init, add) separately, in such a manner that hdds
>> > and ssds are evenly distributed across the servers
>>
>> from what I understand the sizes of OSDs can vary
>> and the weight setting in our setup seems plausible to me (it's
>> directly derived from the size of the osd)
>> why than are the not filled on the same level nor even tending to
>> being filled the same?
>> does ceph by itself include other measurements like latency of the
>> OSD? that would explain why the raid0 OSDs have so much more data
>> than the single disks, but I haven't seen anything about that in the
>> docus (so far?)
>>
>> > -Always keeping that much free space, so the cluster could lose a host
>> > and still has space to repair (calculating with the repair max usage %
>> > setting).
>>
>> thnx again!
>> yupp that was helpfull
>>
>> > I hope this helps, and please keep in mind that I'm a noob too :)
>> >
>> > Denes.
>> >
>> >
>> > On 12/04/2017 10:07 AM, tim taler wrote:
>> >
>> > Hi
>> > I'm new to ceph but have to honor to look after a cluster that I haven't
>> > set up by myself.
>> > Rushing to the ceph docs and having a first glimpse on our cluster I
>> > start worrying about our setup,
>> > so I need some advice and guidance here.
>> >
>> > The set up is:
>> > 3 machines, each running a ceph-monitor.
>> > all of them are also hosting OSDs
>> >
>> > machine A:
>> > 2 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
>> > 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0
>> > (spinning disk)
>> > 3 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0
>> > (spinning disk)
>> >
>> > machine B:
>> > 3 OSDs, each 3.6 TB - consisitng of 1 disk each (spinning disk)
>> > 3 OSDs, each 3.6 TB - consisting each of a 2 disk hardware-raid 0
>> > (spinning disk)
>> > 1 OSDs, each 1.8 TB - consisting each of a 2 disk hardware-raid 0
>> > (spinning disk)
>> >
>> > 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>> >
>> > machine C:
>> > 3 OSDs, each, 0.7 TB - consisitng of 1 disk each (SSD)
>> >
>> > the spinning disks and the SSD disks are forming two seperate pools.
>> >
>> > Now what I'm worrying about is that I read "don't use raid together with
>> > ceph"
>> > in combination with our poolsize
>> > :~ ceph osd pool get <poolname> size
>> > size: 2
>> >
>> > From what I understand from the ceph docu the size tell me "how many
>> > disks may fail" without loosing the data of the whole pool.
>> > Is that right? or can HALF the OSDs fail (since all objects are
>> > duplicated)?
>> >
>> > Unfortunately I'm not very good in stochastic but given a probability of
>> > 1% disk failure per year
>> > I'm not feeling very secure with this set up (How do I calculate the
>> > value that two disks fail "at the same time"? - or ahs anybody a rough
>> > number about that?)
>> > although looking at our OSD tree it seems we try to spread the objects
>> > always between two peers:
>> >
>> > ID  CLASS WEIGHT   TYPE NAME                      STATUS REWEIGHT
>> > PRI-AFF
>> > -19        4.76700 root here_ssd
>> > -15        2.38350     room 2_ssd
>> > -14        2.38350         rack 2_ssd
>> >  -4        2.38350             host B_ssd
>> >   4   hdd  0.79449                 osd.4              up  1.00000
>> > 1.00000
>> >   5   hdd  0.79449                 osd.5              up  1.00000
>> > 1.00000
>> >  13   hdd  0.79449                 osd.13             up  1.00000
>> > 1.00000
>> > -18        2.38350     room 1_ssd
>> > -17        2.38350         rack 1_ssd
>> >  -5        2.38350             host C_ssd
>> >   0   hdd  0.79449                 osd.0              up  1.00000
>> > 1.00000
>> >   1   hdd  0.79449                 osd.1              up  1.00000
>> > 1.00000
>> >   2   hdd  0.79449                 osd.2              up  1.00000
>> > 1.00000
>> >  -1       51.96059 root here_spinning
>> > -12       25.98090     room 2_spinning
>> > -11       25.98090         rack 2_spinning
>> >  -2       25.98090             host B_spinning
>> >   3   hdd  3.99959                 osd.3              up  1.00000
>> > 1.00000
>> >   8   hdd  3.99429                 osd.8              up  1.00000
>> > 1.00000
>> >   9   hdd  3.99429                 osd.9              up  1.00000
>> > 1.00000
>> >  10   hdd  3.99429                 osd.10             up  1.00000
>> > 1.00000
>> >  11   hdd  1.99919                 osd.11             up  1.00000
>> > 1.00000
>> >  12   hdd  3.99959                 osd.12             up  1.00000
>> > 1.00000
>> >  20   hdd  3.99959                 osd.20             up  1.00000
>> > 1.00000
>> > -10       25.97969     room 1_spinning
>> >  -8       25.97969         rack l1_spinning
>> >  -3       25.97969             host A_spinning
>> >   6   hdd  3.99959                 osd.6              up  1.00000
>> > 1.00000
>> >   7   hdd  3.99959                 osd.7              up  1.00000
>> > 1.00000
>> >  14   hdd  3.99429                 osd.14             up  1.00000
>> > 1.00000
>> >  15   hdd  3.99429                 osd.15             up  1.00000
>> > 1.00000
>> >  16   hdd  3.99429                 osd.16             up  1.00000
>> > 1.00000
>> >  17   hdd  1.99919                 osd.17             up  1.00000
>> > 1.00000
>> >  18   hdd  1.99919                 osd.18             up  1.00000
>> > 1.00000
>> >  19   hdd  1.99919                 osd.19             up  1.00000
>> > 1.00000
>> >
>> >
>> >
>> > And the second question
>> > I tracked the disk usage of our OSDs over the last two weeks and it
>> > looks somehow strange too:
>> > While osd.14, and osd.20 are filled only well below 60%
>> > the osd 9,16 and 18 are well about 80%
>> > graphing that shows pretty stable parallel lines, with no hint of
>> > convergence
>> > That's true for both the HDD and the SSD pool.
>> > How is that and why and is that normal and okay or is there a(nother)
>> > glitch in our config?
>> >
>> > any hints and comments are welcome
>> >
>> > TIA
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com