Zdenek Kabelac schreef op 14-09-2017 21:05:
Basically user-land tool takes a runtime snapshot of kernel metadata
(so gets you information from some frozen point in time) then it
processes the input data (up to 16GiB!) and outputs some number - like
what is the
real unique blocks allocated in thinLV.
That is immensely expensive indeed.
Typically snapshot may share
some blocks - or could have already be provisioning all blocks in
case shared blocks were already modified.
I understand and it's good technology.
Yes I mean my own 'system' I generally of course know how much data is
on it and there is no automatic data generation.
However lvm2 is not 'Xen oriented' tool only.
We need to provide universal tool - everyone can adapt to their needs.
I said that to indicate that prediction problems are not current
important for me as much but they definitely would be important in other
scenarios or for other people.
You twist my words around to imply that I am trying to make myself
special, while I was making myself unspecial: I was just being modest
there.
Since your needs are different from others needs.
Yes and we were talking about the problems of prediction, thank you.
But if I do create snapshots (which I do every day) when the root and
boot snapshots fill up (they are on regular lvm) they get dropped
which is nice,
old snapshot are different technology for different purpose.
Again, what I was saying was to support the notion that having snapshots
that may grow a lot can be a problem.
I am not sure the purpose of non-thin vs. thin snapshots is all that
different though.
They are both copy-on-write in a certain sense.
I think it is the same tool with different characteristics.
With 'plain' lvs output is - it's just an orientational number.
Basically highest referenced chunk for a thin given volume.
This is great approximation of size for a single thinLV.
But somewhat 'misleading' for thin devices being created as
snapshots...
(having shared blocks)
I understand. The above number for "snapshots" were just the missing
numbers from this summing up the volumes.
So I had no way to know snapshot usage.
I just calculated all used extents per volume.
The missing extents I put in snapshots.
So I think it is a very good approximation.
So you have no precise idea how many blocks are shared or uniquely
owned by a device.
Okay. But all the numbers were attributed to the correct volume
probably.
I did not count the usage of the snapshot volumes.
Whether they are shared or unique is irrelevant from the point of view
of wanting to know the total consumption of the "base" volume.
In the above 6 extents were not accounted for (24 MB) so I just assumed
that would be sitting in snapshots ;-).
Removal of snapshot might mean you release NOTHING from your
thin-pool if all snapshot blocks where shared with some other thin
volumes....
Yes, but that was not indicated in above figure either. It was just 24
MB that would be freed ;-).
Snapshots can only become a culprit if you start overwriting a lot of
data, I guess.
If you say that any additional allocation checks would be infeasible
because it would take too much time per request (which still seems odd
because the checks wouldn't be that computation intensive and even for
100 gigabyte you'd only have 25.000 checks at default extent size) --
of course you asynchronously collect the data.
Processing of mapping of upto 16GiB of metadata will not happen in
miliseconds.... and consumes memory and CPU...
I get that. If that is the case.
That's just the sort of thing that in the past I have been keeping track
of continuously (in unrelated stuff) such that every mutation also
updated the metadata without having to recalculate it...
I am meaning to say that if indeed this is the case and indeed it is
this expensive, then clearly what I want is not possible with that
scheme.
I mean to say that I cannot argue about this design. You are the
experts.
I would have to go in learning first to be able to say anything about it
;-).
So I can only defer to your expertise. Of course.
But the purpose of what you're saying is that the number of uniquely
owned blocks by any snapshot is not known at any one point in time.
And needs to be derived from the entire map. Okay.
Thus reducing allocation would hardly be possible, you say.
Because the information is not known anyway.
Well pardon me for digging this deeply. It just seemed so alien that
this thing wouldn't be possible.
I mean it seems so alien that you cannot keep track of those numbers
runtime without having to calculate them using aggregate measures.
It seems information you want the system to have at all times.
I am just still incredulous that this isn't being done...
But I am not well versed in kernel concurrency measures so I am hardly
qualified to comment on any of that.
In any case, thank you for your time in explaining. Of course this is
what you said in the beginning as well, I am just still flabbergasted
that there is no accounting being done...
Regards.
I think they are some of the most pleasant command line tools
anyway...
We try really hard....
You're welcome.
On the other hand if all you can do is intervene in userland, then all
LVM team can do is provide basic skeleton for execution of some
standard scripts.
Yes - we give all the power to suit thin-p for individual needs to the
user.
Which is of course pleasant.
So all you need to do is to use the tool in user-space for this task.
So maybe we can have an assortment of some 5 interventionalist
policies like:
a) Govern max snapshot size and drop snapshots when they exceed this
b) Freeze non-critical volumes when thin space drops below aggegrate
values appropriate for the critical volumes
c) Drop snapshots when thin space <5% starting with the biggest one
d) Also freeze relevant snapshots in case (b)
e) Drop snapshots when exceeding max configured size in case of
threshold reach.
But you are aware you can run such task even with cronjob.
Sure the point is not that it can't be done, but that it seems an unfair
burden on the system maintainer to do this in isolation of all other
system maintainers who might be doing the exact same thing.
There is some power in numbers and it is just rather facilitating if a
common scenario is somewhat provided by a central party.
I understand that every professional outlet dealing in terabytes upon
terabytes of data will have the manpower to do all of this and do it
well.
But for everyone else, it is a landscape you cannot navigate because you
first have to deploy that manpower before you can start using the
system!!!
It becomes a rather big enterprise to install thinp for anyone!!!
Because to get it running takes no time at all!!! But to get it running
well then implies huge investment.
I just wouldn't mind if this gap was smaller.
Many of the things you'd need to do are pretty standard. Running more
and more cronjobs... well I am already doing that. But it is not just
the maintenance of the cron job (installation etc.) but also the script
itself that you have to first write.
That means for me and for others that may not be doing it professionally
or in a larger organisation, the benefit of spending all that time may
not weigh up to the cost it has and the result is then that you keep
stuck with a deeply suboptimal situation in which there is little or no
reporting or fixing, all because the initial investment is too high.
Commonly provided scripts just hugely reduce that initial investment.
For example the bigger vs. smaller system I imagined. Yes I am eager to
make it. But I got other stuff to do as well :p.
And then, when I've made it, chances are high no one will ever use it
for years to come.
No one else I mean.
So for example you configure max size for snapshot. When snapshots
exceeds size gets flagged for removal. But removal only happens when
other condition is met (threshold reach).
We are blamed already for having way too much configurable knobs....
Yes but I think it is better to script these things anyway.
Any official mechanism is only going to be inflexible when it goes that
far.
Like I personally don't like SystemD services compared to cronjobs.
Systemd services take longer to set up, have to agree to a descriptive
language, and so on.
Then you need to find out exactly what are the extents of the
possibilities of that descriptive language, maybe there is a feature you
do not know about yet, but you can probably also code it using knowledge
you already have and for which you do not need to read any man pages.
So I do create those services.... for the boot sequence... but anything
I want to run regularly I still do with a cron job...
It's a bit archaic to install but... it's simple, clean, and you have
everything in one screen.
So you would have 5 different interventions you could use that could
be considered somewhat standard and the admit can just pick and choose
or customize.
And we have way longer list of actions we want to do ;) We have not
yet come to any single conclusion how to make such thing manageable
for a user...
Hmm.. Well I cannot ... claim to have the superior idea here.
But Idk... I think you can focus on the model right.
Maintaining max snapshot consumption is one model.
Freezing bigger volumes to protect space for smaller volumes is another
model.
Doing so based on a "critical" flag is another model... (not myself such
a fan of that)... (more to configure).
Reserving max, set or configured space for a specific volume is another
model.
(That would be actually equivalent to a 'critical' flag since only those
volumes that have reserved space would become 'critical' and their space
reservation is going to be the threshold to decide when to deny other
volumes more space).
So you can simply call the 'critical flag' idea the same as the 'space
reservation' idea.
The basic idea is that all space reservations get added together and
become a threshold.
So that's just one model and I think it is the most important one.
"Reserve space for certain volumes" (but not all of them or it won't
work). ;-).
This is what Gionatan refered to with the ZFS ehm... shit :p.
And the topic of this email thread.
So you might as well focus on that one alone as per mr. Jonathan's
reply.
(Pardon for my language there).
While personally I also like the bigger versus smaller idea because you
don't have to configure it.
The only configuration you need to do is to ensure that the more
important volumes are a bit smaller.
Which I like.
Then there is automatic space reservation using fsfreezing.
Because the free space required for bigger volumes is always going to be
bigger than that of smaller volumes.
But how expensive is it to do it say every 5 seconds?
If you have big metadata - you would keep you Intel Core busy all the
time ;)
That's why we have those thresholds.
Script is called at 50% fullness, then when it crosses 55%, 60%, ...
95%, 100%. When it drops bellow threshold - you are called again once
the boundary is crossed...
How do you know when it is at 50% fullness?
If you are proud sponsor of your electricity provider and you like the
extra heating in your house - you can run this in loop of course...
Threshold are based on mapped size for whole thin-pool.
Thin-pool surely knows all the time how many blocks are allocated and
free for
its data and metadata devices.
But didn't you just say you needed to process up to 16GiB to know this
information?
I am confused?
This means the in-kernel policy can easily be implemented.
You may not know the size and attribution of each device but you do know
the overall size and availability?
In any case the only policy you could have in-kernel would be either
what Gionatan proposed (fixed reserved space for certain volumes)
(easy calculation right) or potentially allocation freeze at threshold
for non-critical volumes,
In the single thin-pool all thins ARE equal.
But you could make them unequal ;-).
Low number of 'data' block may cause tremendous amount of provisioning.
With specifically written data pattern you can (in 1 second!) cause
provisioning of large portion of your thin-pool (if not the whole one
in case you have small one in range of gigabytes....)
Because you only have to write a byte to every extent, yes.
And that's the main issue - what we solve in lvm2/dm - we want to be
sure that when thin-pool is FULL - written & committed data are
secure and safe.
Reboot is mostly unavoidable if you RUN from a device which is
out-of-space -
we cannot continue to use such device - unless you add MORE space to
it within 60second window.
That last part is utterly acceptable.
All other proposals solve only very localized solution and problems
which are different for every user.
I.e. you could have a misbehaving daemon filling your system device
very fast with logs...
In practice - you would need some system analysis and detect which
application causes highest pressure on provisioning - but that's well
beyond range lvm2 team ATM with the amount of developers can
provide....
And any space reservation would probably not do much; if it is not
filled 100% now, it will be so in a few seconds, in that sense.
The goal was more to protect the other volumes, supposing that log
writing happened on another one, for that other log volume not to impact
the other main volumes.
So you have thin global reservation of say 10GB.
Your log volume is overprovisioned and starts eating up the 20GB you
have available and then runs into the condition that only 10GB remains.
The 10GB is a reservation maybe for your root volume. The system
(scripts) (or whatever) recognises that less than 10GB remains, that you
have claimed it for the root volume, and that the log volume is
intruding upon that.
It then decides to freeze the log volume.
But it is hard to decide what volume to freeze because it would need
that run-time analysis of what's going on. So instead you just freeze
all non-reserved volumes.
So all non-critical volumes in Gionatan and Brassow's parlance.
I just still don't see how one check per 4MB would be that expensive
provided you do data collection in background.
You say size can be as low as 64kB... well.... in that case...
Default chunk size if 64k for the best 'snapshot' sharing - the bigger
the pool chunk is the less like you could 'share' it between
snapshots...
Okay.. I understand. I guess I was deluded a bit by non-thin snapshot
behaviour (filled up really fast without me understanding why, and
concluding that it was doing 4MB copies).
As well as of course that extents were calculated in whole numbers in
overviews... apologies.
But attribution of an extent to a snapshot will still be done in
extent-sizes right?
So I was just talking about allocation, nothing else.
BUT if allocator operates on 64kB requests, then yes...
(As pointed in other thread - ideal chunk for best snapshot sharing
would be 4K - but that's not affordable for other reasons....)
Okay.
2) I would freeze non-critical volumes ( I do not write to
snapshots so that is no issue ) when critical volumes reached safety
threshold in free space ( I would do this in-kernel if I could ) ( But
Freezing In User-Space is almost the same ).
There are lots of troubles when you have freezed filesystems present
in your machine fs tree... - if you know all connections and
restrictions - it can be 'possibly' useful - but I can't imagine this
being useful in generic case...
Well, yeah. Linux.
(I mean, just a single broken NFS or CIFS connection can break so
much....).
And more for your thinking -
If you have pressure on provisioning caused by disk-load on one of
your 'critical' volumes this FS 'freezeing' scripting will 'buy' you
only couple seconds
Oh yeah of course, this is correct.
(depends how fast drives you have and how big
thresholds you will use) and you are in the 'exact' same situation -
expect now you have system in bigger troubles - and you already might
have freezed other systems apps by having them accessing your
'low-prio' volumes....
Well I guess you would reduce non-critical volumes to single-purpose
things.
Ie. only used by one application.
And how you will be solving 'unfreezing' in cases thin-pool usage
drops down is also pretty interesting topic on its own...
I guess that would be manual?
I need to wish good luck when you will be testing and developing all
this machinery.
Well as you say it has to be an anomaly in the first place -- an error
or problem situation.
It is not standard operation.
So I don't think the problems of freezing are bigger than the problems
of rebooting.
The whole idea is that you attribute non-critical volumes to single apps
or single purposes so that when they run amock, or in any case, that if
anything runs amock on them...
Yes it won't protect the critical volumes from being written to.
But that's okay.
You don't need to automatically unfreeze.
You need to send an email and say stuff has happened ;-).
"System is still running but some applications may have crashed. You
will need to unfreeze and restart in order to solve it, or reboot if
necessary. But you can still log into SSH, so maybe you can do it
remotely without a console ;-)".
I don't see any issues with this.
One could say: use filesystem quotas.
Then that involves setting up users etc.
Setting up a quota for a specific user on a specific volume...
All more configuration.
And you're talking mostly about services of course.
The benefit (and danger) of LVM is that it is so easy to create more
volumes.
(The danger being that you now also need to back up all these volumes).
(Independently).
Default is to auto-extend thin-data & thin-metadata when needed if
you
set threshold bellow 100%.
Q: In a 100% filled up pool, are snapshots still going to be valid?
Could it be useful to have a default policy of dropping snapshots at
high consumption? (ie. 99%). But it doesn't have to be default if you
can easily configure it and the scripts are available.
All snapshots/thins with 'fsynced' data are always secure.
Thin-pool is protecting all user-data on disk.
The only lost data are those flying in your memory (unwritten on disk).
And depends on you 'page-cache' setup how much that can be...
That seemes pretty secure. Thank you.
So there is no issue with snapshots behaving differently. It's all the
same and all committed data will be safe prior to the fillup and not
change afterward.
I guess.
_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/