Hello!
I can only answer some of your questions:
-The backfill process obeys a "nearfull_ratio" limit (I think defaults
to 85%) above it the system will stop repairing itself, so it wont go up
to 100%
-The normal write ops obey a full_ratio too, I think default 95%, above
that no write io will be accepted to the pool.
-You have min_size=1 (as i can recall) So if you lose a disc then the
other osds on the same hosts would fill up to 85% and then the cluster
would stop repairing and would remain in a degraded (some pgs
undersized) state until you solve the problem, or reach 95% at which
point the cluster would stop accepting write io.
Calculations:
Sum pool used : 995+14986+1318 = 17299 ... 17299 * 2 (size) = 34598
(+journal ?) ~ 35349 (global raw used)
Size: 52806G = 35349 (raw used) + 17457 (raw avail) => 66.94% OK.
The dociumentation sais that max avail pool is an estimate and is
calculated against the osd which will run out of space first, so in tour
case this is the relevant info.
I think you can access the per osd statistics with the ceph pg dump command.
However, I think you are quite correct:
Spinning usage: 14986+995 = 15981G
Sum spinning capacity: 15981+3232 = 19213G -> 83% full
(I used the vaues caluclated by your ceph df, as it uses the most full
osd, so it is a good estimate for a worst case)
Since at 85% full, the cluster will stop self healing, so you cannot
lose any spinning disc in a way that the cluster auto recovers to a
healthy state (no undersized pgs). I would consider adding at least 2
new discs to the host which only has ssds in your setup, of course
considering slots, memory, etc. This would give you some breathing space
to restructure your cluster too.
Denes.
On 12/05/2017 03:07 PM, tim taler wrote:
okay another day another nightmare ;-)
So far we discussed pools as bundles of:
- pool 1) 15 HDD-OSDs (consisting of a total of 25 HDDs actual, 5
single HDDs and five raid0 pairs as mentioned before)
- pool 2) 6 SSD-OSDs
unfortunately (well) on the "physical" pool 1 there are two "logical"
pools (my wording is here maybe not cephish?)
now I wonder about the real free space on "the pool"...
ceph df tells me:
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
52806G 17457G 35349G 66.94
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
pool-1-HDD 9 995G 13.34 3232G 262134
pool-2-HDD 10 14986G 69.86 3232G 3892481
pool-3-SDD 12 1318G 55.94 519G 372618
Now how do I read this?
the sum of "MAX AVAIL" in the "POOLS" section is 7387
okay 7387*2 (since all three pools have a size of 2) is 14774
The GLOBAL section on the other hand tells me I still got 17457G available
17457-14774=2683
where are the missing 2683 GB?
or am I missing something (else than space and a sane setup I mean :-)
AND (!)
if in the "physical" HDD pool the reported two times 3232G available
space is true,
than in this setup (two hosts) there would be only 3232G free on each host.
Given that the HDD-OSDs are 4TB in size - if one dies and the host
tries to restore the data
(as I learned yesterday the data in this setup will ONLY be restored
on that host on which the OSD died)
than ...
it doesn't work, right?
Except I could hope that - due to too few placement groups and the resulting
miss-balance of space usage on the OSDs - the dead OSD was only filled
by 60% and not 85%
and only the real data will rewritten(restored).
But even that seems not possible - given the miss-balanced OSDs - the
fuller ones will hit total saturation
and - at least as I understand it now - after that (again after the
first OSD is filled 100%) I can't use the left
space on the other OSDs.
right?
If all that is true (and PLEASE point out any mistake in my thinking)
than I got here at the moment
25 harddisks of which NONE must fail or the pool will at least stop
accepting writes.
Am I right? (feels like a reciprocal Russian roulette ... ONE chamber
WITHOUT a bullet ;-)
Now - sorry we are not finished yet (and yes this is true, I'm not
trying to make fun of you)
On top of all this I see a rapid decrease in the available space which
is not consistent
with growing data inside the rbds living in this cluster nore growing
numbers of rbds (we ONLY use rbds).
BUT someone is running sanpshots.
How do I sum up the amount of space each snapshot is using.
is it the sum of the USED column in the output of "rbd du --snapp" ?
And what is the philosophy of snapshots in ceph?
AN object is 4MB in size, if a bit in that object changes is the whole
object replicated?
(the cluster is luminous upgraded from jewel so we use filestore on
xfs not bluestore)
TIA
On Tue, Dec 5, 2017 at 11:10 AM, Stefan Kooman <stefan@xxxxxx> wrote:
Quoting tim taler (robur314@xxxxxxxxx):
And I'm still puzzled about the implication of the cluster size on the
amount of OSD failures.
With size=2 min_size=1 one host could die and (if by chance there is
NO read error on any bit on the living host) I could (theoretically)
recover, is that right?
True.
OR is it that if any two disks in the cluster fail at the same time
(or while one is still being rebuild) all my data would be gone?
Only the objects that are located on those disks. So for example obj1
disk1,host1 and obj 1 on disk2,host2 ... you will lose data, yes.
Gr. Stefan
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com