Re: Reserve space for specific thin logical volumes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Zdenek Kabelac schreef op 15-09-2017 11:22:

lvm2 makes them look the same - but underneath it's very different
(and it's not just by age - but also for targeting different purpose).

- old-snaps are good for short-time small snapshots - when there is
estimation for having low number of changes and it's not a big issue
if snapshot is 'lost'.

- thin-snaps are ideal for long-time living objects with possibility
to take snaps of snaps of snaps and you are guaranteed the snapshot
will not 'just dissapear' while you modify your origin volume...

Both have very different resources requirements and performance...

Point being that short-time small snapshots are also perfectly served by thin...

So I don't really think there are many instances where "old" trumps "thin".

Except, of course, if the added constraint is a plus (knowing in advance how much it is going to cost).

But that's the only thing: predictability.

I use my regular and thin snapshots for the same purpose. Of course you can do more with Thin.

That are cases where it's quite valid option to take  old-snap of
thinLV and it will payoff...

Even exactly in the case you use thin and you want to make sure your
temporary snapshot will not 'eat' all your thin-pool space and you
want to let snapshot die.

Right.

That sounds pretty sweet actually. But it will be a lot slower right.

I currently just make new snapshots each day. They live for an entire day. If the system wants to make a backup of the snapshot it has to do it within the day ;-).

My root volume is not on thin and thus has an "old-snap" snapshot. If the snapshot is dropped it is because of lots of upgrades but this is no biggy; next week the backup will succeed. Normally the root volume barely changes.

So it would be possible to reserve regular LVM space for thin volumes as well right, for snapshots, as you say below. But will this not slow down all writes considerably more than a thin snapshot?

So while my snapshots are short-lived, they are always there.

The current snapshot is always of 0:00.

Thin-pool still does not support shrinking - so if the thin-pool
auto-grows to big size - there is not a way for lvm2 to reduce the
thin-pool size...

Ah ;-). A detriment of auto-extend :p.

That's just the sort of thing that in the past I have been keeping track of continuously (in unrelated stuff) such that every mutation also updated the metadata without having to recalculate it...

Would you prefer to spend all you RAM to keep all the mapping
information for all the volumes and put very complex code into kernel
to parse the information which is technically already out-of-data in
the moment you get the result ??

No if you only kept some statistics that would not amount to all the mapping data but only to a summary of it.

Say if you write a bot that plays a board game. While searching for moves the bot has to constantly perform moves on the board. It can either create new board instances out of every move, or just mutate the existing board and be a lot faster.

In mutating the board it will each time want the same information as before: how many pieces does the white player have, how many pieces the black player, and so on.

A lot of this information is easier to update than to recalculate, that is, the moves themselves can modify this summary information, rather than derive it again from the board positions.

This is what I mean by "updating the metadata without having to recalculate it".

You wouldn't have to keep the mapping information in RAM, just the amount of blocks attributed and so on. A single number. A few single numbers for each volume and each pool.

No more than maybe 32 bytes, I don't know.

It would probably need to be concurrently updated, but that's what it is.

You just maintain summary information that you do not recalculate, but just modify each time an action is performed.

But the purpose of what you're saying is that the number of uniquely owned blocks by any snapshot is not known at any one point in time.

As long as 'thinLV' (i.e. your snapshot thinLV) is NOT active - there
is nothing in kernel maintaining its dataset.  You can have lots of
thinLV active and lots of other inactive.

But if it's not active, can it still 'trace' another volume? Ie. it has to get updated if it is really a snapshot of something right.

If it doesn't get updated (and not written to) then it also does not allocate new extents.

So then it never needs to play a role in any mechanism needed to prevent allocation.

However volumes that see new allocation happening for them, would then always reside in kernel memory right.

You said somewhere else that overall data (for pool) IS available. But not for volumes themselves?

Ie. you don't have a figure on uniquely owned vs. shared blocks.

I get that it is not unambiguous to interpret these numbers.

Regardless with one volume as "master" I think a non-ambiguous interpretation arises?

So is or is not the number of uniquely owned/shared blocks known for each volume at any one point in time?

Well pardon me for digging this deeply. It just seemed so alien that this thing wouldn't be possible.

I'd say it's very smart ;)

You mean not keeping everything in memory.

You can use only very small subset of 'metadata' information for
individual volumes.

But I'm still talking about only summary information...


It becomes a rather big enterprise to install thinp for anyone!!!

It's enterprise level software ;)

Well I get that you WANT that ;-).

However with the appropriate amount of user friendliness what was first only for experts can be simply for more ordinary people ;-).

I mean, kuch kuch, if I want some SSD caching in Microsoft Windows, kuch kuch, I right click on a volume in Windows Explorer, select properties, select ReadyBoost tab, click "Reserve complete volume for ReadyBoost", click okay, and I'm done.

It literally takes some 10 seconds to configure SSD caching on such a machine.

Would probably take me some 2 hours in Linux not just to enter the commands but also to think about how to do it.

Provided I don't end up with the SSD kernel issues with IO queue bottlenecking I had before...

Which, I can tell you, took a multitude of those 2 hours with the conclusion that the small mSata SSD I had was just not suitable, much like some USB device.


For example, OpenVPN clients on Linux are by default not configured to automatically reconnect when there is some authentication issue (which could be anything, including a dead link I guess) and will thus simply quit at the smallest issue. It then needs the "auth-retry nointeract" directive to keep automatically reconnecting.

But on any Linux machine the command line version of OpenVPN is going to be probably used as an unattended client.

So it made no sense to have to "figure this out" on your own. An enterprise will be able to do so yes.

But why not make it easier...

And even if I were an enterprise, I would still want:

- ease of mind
- sane defaults
- if I make a mistake the earth doesn't explode
- If I forget to configure something it will have a good default
- System is self-contained and doesn't need N amount of monitoring systems before it starts working

In most common scenarios - user knows when he runs out-of-space - it
will not be 'pleasant' experience - but users data should be safe.

Yes again, apologies, but I was basing myself on Kernel 4.4 in Debian 8 with LVM 2.02.111 which, by now, is three years old hahaha.

Hehe, this is my self-made reporting tool:

Subject: Snapshot linux/root-snap has been umounted

Snapshot linux/root-snap has been unmounted from /srv/root because it filled up to a 100%.

Log message:

Sep 16 22:37:58 debian lvm[16194]: Unmounting invalid snapshot linux-root--snap from /srv/root.

Earlier messages:

Sep 16 22:37:52 debian lvm[16194]: Snapshot linux-root--snap is now 97% full. Sep 16 22:37:42 debian lvm[16194]: Snapshot linux-root--snap is now 93% full. Sep 16 22:37:32 debian lvm[16194]: Snapshot linux-root--snap is now 86% full. Sep 16 22:37:22 debian lvm[16194]: Snapshot linux-root--snap is now 82% full.

Now do we or do we not upgrade to Debian Stretch lol.

And then it depends how much energy/time/money user wants to put into
monitoring effort to minimize downtime.

Well yes but this is exacerbated by say this example of OpenVPN having bad defaults. If you can't figure out why your connection is not maintained now you need monitoring script to automatically restart it.

If something is hard to recover from, now you need monitoring script to warn you plenty ahead of time so you can prevent it, etc.

If the monitoring script can fail, now you need a monitoring script to monitor the monitoring script ;-).

System admins keep busy ;-).

As has been said - disk-space is quite cheap.
So if you monitor and insert your new disk-space in-time
(enterprise...)  you have less set of problems - then if you try to
fight constantly with 100% full thin-pool...

In that case it's more of a safety measure. But a bit pointless if you don't intend to keep growing your data collection.

Ie. you could keep an extra disk in your system for this purpose, but then you can't shrink the thing as you said once it gets used ;-).

That makes it rather pointless to have it as a safety net for a system that is not meant to expand ;-).


You can always use normal device - it's really about the choice and purpose...

Well the point is that I never liked BTRFS.

BTRFS has its own set of complexities and people running around and tumbling over each other in figuring out how to use the darn thing. Particularly with regards to the how-to of using subvolumes, of which there seem to be many different strategies.

And then Red Hat officially deprecates it for the next release. Hmmmmm.

So ZFS has very linux-unlike command set.

Its own universe.

LVM in general is reasonably customer-friendly or user-friendly. Configuring cache volumes etc. is not that easy but also not that complicated. Configuring RAID is not very hard compared to mdadm although it remains a bit annoying to have to remember pretty explicit commands to manage it.

But rebuilding e.g. RAID 1 sets is pretty easy and automatic.

Sometimes there is annoying stuff like not being able to change a volume group (name) when a PV is missing, but if you remove the PV how do you put it back in? And maybe you don't want to... well whatever.

I guess certain things are difficult enough that you would really want a book about it, and having to figure it out is fun the first time but after that a chore.

So I am interested in developing "the future" of computing you could call it.

I believe that using multiple volumes is "more natural" than a single big partition.

But traditionally the "single big partition" is the only way to get a flexible arrangement of free space.

So when you move towards multiple (logical) volumes, you lose that flexibility that you had before.

The only way to solve that is by making those volumes somewhat virtual.

And to have them draw space from the same big pool.

So you end up with thin provisioning. That's all there is to it.

While personally I also like the bigger versus smaller idea because you don't have to configure it.

I'm still proposing to use different pools for different purposes...

You mean use a different pool for that one critical volume that can't run out of space.


This goes against the idea of thin in the first place. Now you have to give up the flexibility that you seek or sought in order to get some safety because you cannot define any constraints within the existing system without separating physically.

Sometimes spreading the solution across existing logic is way easier,
then trying to achieve some super-inteligent universal one...

I get that... building a wall between two houses is easier than having to learn to live together.

But in the end the walls may also kill you ;-).

Now you can't share washing machine, you can't share vacuum cleaner, you have to have your own copy of everything, including bath rooms, toilet, etc.

Even though 90% of the time these things go unused.

So resource sharing is severely limited by walls.

Total cost of services goes up.


But didn't you just say you needed to process up to 16GiB to know this information?

Of course thin-pool has to be aware how much free space it has.
And this you can somehow imagine as 'hidden' volume with FREE space...

So to give you this 'info' about  free blocks in pool - you maintain
very small metadata subset - you don't need to know about all other
volumes...

Right, just a list of blocks that are free.

If other volume is releasing or allocation chunks - your 'FREE space'
gets updated....

That's what I meant by mutating the data (summary).

It's complex underneath and locking is very performance sensitive -
but for easy understanding you can possibly get the picture out of
this...

I understand, but does this mean that the NUMBER of free blocks is also always known?

So isn't the NUMBER of used/shared blocks in each DATA volume also known?

You may not know the size and attribution of each device but you do know the overall size and availability?

Kernel support 1 setting for threshold - where the user-space
(dmeventd) is waked-up when usage has passed it.

The mapping of value is lvm.conf autoextend threshold.

As a 'secondary' source - dmeventd checks every 10 second pool
fullness with single ioctl() call and compares how the fullness has
changed and provides you with callbacks for those  50,55...  jumps
(as can be found in  'man dmeventd')

So for autoextend theshold passing you get instant call.
For all others there is up-to 10 second delay for discovery.

But that's about the 'free space'.

What about the 'used space'. Could you, potentially, theoretically, set a threshold for that? Or poll for that?

I mean the used space of each volume.

But you could make them unequal ;-).

I cannot ;)  - I'm lvm2 coder -   dm thin-pool is Joe's/Mike's toy :)

In general - you can come with many different kernel modules which
take different approach to the problem.

Worth to note -  RH has now Permabit  in its porfolio - so there can
more then one type of thin-provisioning supported in lvm2...

Permabit solution has deduplication, compression, 4K blocks - but no
snapshots....

Hmm, sounds too 'enterprise' for me ;-).

In principle it comes down to the same thing... one big pool of storage and many views onto it.

Deduplication is natural part of that...

Also for backup purposes mostly.

You can have 100 TB worth of backups only using 5 TB.

Without having to primitively hardlink everything.

And maintaining complete trees of every backup open on your filesystem.... no usage of archive formats...

If the system can hardlink blocks instead of files, that is very interesting.

Of course snapshots (thin) are also views onto the dataset.

That's the point of sharing.

But sometimes you live in the same house and you want a little room for yourself ;-).

But in any case...

Of course if you can only change lvm2, maybe nothing of what I said was ever possible.

But I thought you also spoke of possibilities including the possibility of changing the device mapper, saying it is impossible what I want :p.

IF you could change the device mapper, THEN could it be possible to reserve allocation space for a single volume???

All you have to do is lie to the other volumes when they want to know how much space is available ;-).

Or something of the kind.

Logically there are only two conditions:

- virtual free space for critical volume is smaller than its reserved space - virtual free space for critical volume is bigger than its reserved space

If bigger, then all the reserved space is necessary to stay free
If smaller, then we don't need as much.

But it probably also doesn't hurt.

So 40GB virtual volume has 5GB free but reserved space is 10GB.

Now real reserved space also becomes 5GB.

So for this system to work you need only very limited data points:

- unallocated extents of virtual 'critical' volumes (1 number for each 'critical' volume)
- total amount of free extents in pool

And you're done.

+ the reserved space for each 'critical volume'.

So say you have 2 critical volumes:

virtual size      reserved space
    10GB                500MB
    40GB                 10GB

Total reserved space is 10.5GB

If second one has allocated 35GB, only could possibly need 5GB more, so figure changes to

  5.5GB reserved space

Now other volumes can't touch that space, when the available free space in entire pool becomes <= 5.5GB, allocation fails for non-critical volumes.

It really requires very limited information.

- free extents for all critical volumes (unallocated as per the virtual size)
- total amount free extents in pool
- max space reservation for each critical volume

And you're done. You now have a working system. This is the only information the allocator needs to employ this strategy.

No full maps required.

If you have 2 critical volumes, this is a total of 5 numbers.

This is 40 bytes of data at most.


The goal was more to protect the other volumes, supposing that log writing happened on another one, for that other log volume not to impact the other main volumes.

IMHO best protection is different pool for different thins...
You can more easily decide which pool can 'grow-up'
and which one should rather be taken offline.

Yeah yeah.

But that is like avoiding the problem, so there doesn't need to be a solution.

Motto: keep it simple ;)

The entire idea of thin provisioning is to not keep it simple ;-).

Same goes for LVM.

Otherwise we'd be still using physical partitions.


So you have thin global reservation of say 10GB.

Your log volume is overprovisioned and starts eating up the 20GB you have available and then runs into the condition that only 10GB remains.

The 10GB is a reservation maybe for your root volume. The system (scripts) (or whatever) recognises that less than 10GB remains, that you have claimed it for the root volume, and that the log volume is intruding upon that.

It then decides to freeze the log volume.

Of course you can play with 'fsfreeze' and other things - but all
these things are very special to individual users with their
individual preferences.

Effectively if you freeze your 'data' LV - as a reaction you may
paralyze the rest of your system - unless you know the 'extra'
information about the user use-pattern.

Many things only work if the user follows a certain model of behaviour.

The whole idea of having a "critical" versus a "non-critical" volume is that you are going to separate the dependencies such that a failure of the "non-critical" volume will not be "critical" ;-).

So the words themselves predict that anyone employing this strategy will ensure that the non-critical volumes are not critically depended upon ;-).

But do not take this as something to discourage you to try it - you
may come with perfect solution for your particular system  - and some
other user may find it useful in some similar pattern...

It's just something that lvm2 can't give support globally.

I think the model is clean enough that you can provide at least a skeleton script for it...

But that was already suggested you know, so...


If people want different intervention than "fsfreeze" that is perfectly fine.

Most of the work goes into not deciding the intervention (that is usually simple) but in writing the logic.

(Where to store the values, etc.).

(Do you use LVM tags, how to use that, do we read some config file somewhere else, etc.).

Only reason to provide skeleton script with LVM is to lessen the burden on all those that would like to follow that separation of critical vs. non-critical.

The big vs. small idea is extension of that.

Of course you don't have to support it in that sense personally.

But logical separation of more critical vs. less critical of course would require you to also organize your services that way.

If you have e.g. three levels of critical services (A B C) and three levels of critical volumes (X Y Z) then:

A (most critical)   B (intermediate)   C (least critical)
        |               ___/|     _______/  ___/|
        |           ___/   _|____/      ___/    |
        |       ___/  ____/ |       ___/        |
        |   ___/_____/      |   ___/            |
        |  /                |  /                |
X (most critical)   Y (intermediate)   Z (least critical)

Service A can only use volume X
Service B can use both X and Y
Service C can use X Y and Z.

This is the logical separation you must make if "critical" is going to have any value.

But lvm2 will give you enough bricks for writing 'smart' scripts...

I hope so.

It is just convenient if certain models are more mainstream or more easy to implement.

Instead of each person having to reinvent the wheel...

But anyway.

I am just saying that the simple thing Sir Jonathan offered would basically implement the above.

It's not very difficult, just a bit of level-based separation of orders of importance.

Of course the user (admin) is responsible for ensuring that programs actually agree with it.

So I don't think the problems of freezing are bigger than the problems of rebooting.

With 'reboot' you know where you are - it's IMHO fair condition for this.

With frozen FS and paralyzed system and your 'fsfreeze' operation of
unimportant volumes actually has even eaten the space from thin-pool
which may possibly been used better to store data for important
volumes....

Fsfreeze would not eat more space than was already eaten.

A reboot doesn't change anything about that either.

If you don't freeze it (and neither reboot) the whole idea is that more space would be eaten than was already.

So not doing anything is not a solution (and without any measures in place like this, the pool would be full).

So we know why we want reserved space; it was already rapidly being depleted.

and there is even big danger you will 'freeze' yourself already during
call of fsfreeze  (unless you of course put BIG margins around)

Well I didn't say fsfreeze was the best high level solution anyone could ever think of.

But I think freezing a less important volume should ... according to the design principles laid out above... not undermine the rest of the 'critical' system.

That's the whole idea right.

Again not suggesting everyone has to follow that paradigm.

But if you're gonna talk about critical vs. non-critical, the admin has to pursue that idea throughout the entire system.

If I freeze a volume only used by a webserver... I will only freeze the webserver... not anything else?


"System is still running but some applications may have crashed. You will need to unfreeze and restart in order to solve it, or reboot if necessary. But you can still log into SSH, so maybe you can do it remotely without a console ;-)".

Compare with  email:

Your system has run out-of-space, all actions to gain some more space
has failed  - going to reboot into some 'recovery' mode

Actions to gain more space in this case only amounts to dropping snapshots, otherwise we are talking much more aggressive policy.

So now your system has rebooted and is in a recovery mode. Your system ran 3 different services. SSH/shell/email/domain etc, webserver and providing NFS mounts.

Very simple example right.

Your webserver had dedicated 'less critical' volume.

Some web application overflowed, user submitted lots of data, etc.

Web application volume is frozen.

(Or web server has been shut down, same thing here).

- Now you can still SSH, system still receives and sends email
- You can still access filesystems using NFS

Compare to recovery console:

- SSH doesn't work, you need Console
- email isn't received nor sent
- NFS is unavailable
- pings to domain don't work
- other containers go offline too
- entire system is basically offline.

Now for whatever reason you don't have time to solve the problem.

System is offline for a week. Emails are thrown away, not received, you can't ssh and do other tasks, you may be able to clean the mess but you can't put the server online (webserver) in case it happens again.

You need time to deal with it but in the meantime entire system was offline. You have to manually reboot and shut down web application.

But in our proposed solution, the script already did that for you.

So same outcome. Less intervention from you required.

Better to keep the system running partially than not at all?

SSH access is absolute premium in many cases.

So there is no issue with snapshots behaving differently. It's all the same and all committed data will be safe prior to the fillup and not change afterward.

Yes - snapshot is 'user-land' language - in kernel - all thins maps chunks...

If you can't map new chunk - things is going to stop - and start to
error things out shortly...

I get it.

We're going to prevent them from mapping new chunks ;-).

Well.

:p.

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



[Index of Archives]     [Gluster Users]     [Kernel Development]     [Linux Clusters]     [Device Mapper]     [Security]     [Bugtraq]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]

  Powered by Linux