Re: HELP with some basics please

Denes Dolhay <denke@xxxxxxxxxxxx> · Tue, 5 Dec 2017 21:00:31 +0100

Hello!

I can only answer some of your questions:

-The backfill process obeys a "nearfull_ratio" limit (I think defaults 
to 85%) above it the system will stop repairing itself, so it wont go up 
to 100%
-The normal write ops obey a full_ratio too, I think default 95%, above 
that no write io will be accepted to the pool.
-You have min_size=1 (as i can recall) So if you lose a disc then the 
other osds on the same hosts would fill up to 85% and then the cluster 
would stop repairing and would remain in a degraded (some pgs 
undersized) state until you solve the problem, or reach 95% at which 
point the cluster would stop accepting write io.
Calculations:

Sum pool used : 995+14986+1318 = 17299 ... 17299 * 2 (size) = 34598 
(+journal ?) ~ 35349 (global raw used)
Size: 52806G = 35349 (raw used) + 17457 (raw avail) => 66.94% OK.

The dociumentation sais that max avail pool is an estimate and is 
calculated against the osd which will run out of space first, so in tour 
case this is the relevant info.

I think you can access the per osd statistics with the ceph pg dump command.

However, I think you are quite correct:
Spinning usage: 14986+995 = 15981G
Sum spinning capacity: 15981+3232 = 19213G -> 83% full
(I used the vaues caluclated by your ceph df, as it uses the most full 
osd, so it is a good estimate for a worst case)
Since at 85% full, the cluster will stop self healing, so you cannot 
lose any spinning disc in a way that the cluster auto recovers to a 
healthy state (no undersized pgs). I would consider adding at least 2 
new discs to the host which only has ssds in your setup, of course 
considering slots, memory, etc. This would give you some breathing space 
to restructure your cluster too.
Denes.

On 12/05/2017 03:07 PM, tim taler wrote:
okay another day another nightmare ;-)

So far we discussed pools as bundles of:
- pool 1) 15 HDD-OSDs (consisting of a total of 25 HDDs actual, 5
single HDDs and five raid0 pairs as mentioned before)
- pool 2) 6 SSD-OSDs
unfortunately (well) on the "physical" pool 1 there are two "logical"
pools (my wording is here maybe not cephish?)

now I wonder about the real free space on "the pool"...

ceph df tells me:

GLOBAL:
     SIZE       AVAIL      RAW USED     %RAW USED
     52806G     17457G       35349G         66.94
POOLS:
     NAME                     ID     USED       %USED     MAX AVAIL     OBJECTS
     pool-1-HDD              9         995G     13.34         3232G       262134
     pool-2-HDD            10     14986G      69.86         3232G     3892481
     pool-3-SDD            12       1318G      55.94           519G      372618

Now how do I read this?
the sum of "MAX AVAIL" in the "POOLS" section is 7387
okay 7387*2 (since all three pools have a size of 2) is 14774

The GLOBAL section on the other hand tells me I still got 17457G available
17457-14774=2683
where are the missing 2683 GB?
or am I missing something (else than space and a sane setup I mean :-)

AND (!)
if in the "physical" HDD pool the reported two times 3232G available
space is true,
than in this setup (two hosts) there would be only 3232G free on each host.
Given that the HDD-OSDs are 4TB in size - if one dies and the host
tries to restore the data
(as I learned yesterday the data in this setup will ONLY be restored
on that host on which the OSD died)
than ...
it doesn't work, right?
Except I could hope that - due to too few placement groups and the resulting
miss-balance of space usage on the OSDs - the dead OSD was only filled
by 60% and not 85%
and only the real data will rewritten(restored).
But even that seems not possible - given the miss-balanced OSDs - the
fuller ones will hit total saturation
and - at least as I understand it now - after that (again after the
first OSD is filled 100%) I can't use the left
space on the other OSDs.
right?

If all that is true (and PLEASE point out any mistake in my thinking)
than I got here at the moment
25 harddisks of which NONE  must fail or the pool will at least stop
accepting writes.

Am I right? (feels like a reciprocal Russian roulette ... ONE chamber
WITHOUT a bullet ;-)

Now - sorry we are not finished yet (and yes this is true, I'm not
trying to make fun of you)

On top of all this I see a rapid decrease in the available space which
is not consistent
with growing data inside the rbds living in this cluster nore growing
numbers of rbds (we ONLY use rbds).
BUT someone is running sanpshots.
How do I sum up the amount of space each snapshot is using.

is it the sum of the USED column in the output of "rbd du --snapp" ?

And what is the philosophy of snapshots in ceph?
AN object is 4MB in size, if a bit in that object changes is the whole
object replicated?
(the cluster is luminous upgraded from jewel so we use filestore on
xfs not bluestore)

TIA

On Tue, Dec 5, 2017 at 11:10 AM, Stefan Kooman <stefan@xxxxxx> wrote:
Quoting tim taler (robur314@xxxxxxxxx):
And I'm still puzzled about the implication of the cluster size on the
amount of OSD failures.
With size=2 min_size=1 one host could die and (if by chance there is
NO read error on any bit on the living host) I could (theoretically)
recover, is that right?
True.
OR is it that if any two disks in the cluster fail at the same time
(or while one is still being rebuild) all my data would be gone?
Only the objects that are located on those disks. So for example obj1
disk1,host1 and obj 1 on disk2,host2 ... you will lose data, yes.

Gr. Stefan

--
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com