Re: lvm2 deadlock

Jaco Kroon <jaco@xxxxxxxxx> · Fri, 7 Jun 2024 11:03:43 +0200

Hi,

On 2024/06/07 00:17, Zdenek Kabelac wrote:
Dne 07. 06. 24 v 0:14 Zdenek Kabelac napsal(a):
Dne 05. 06. 24 v 10:59 Jaco Kroon napsal(a):
Hi,

On 2024/06/04 18:07, Zdenek Kabelac wrote:
Dne 04. 06. 24 v 13:52 Jaco Kroon napsal(a):
Last but not least -  disk scheduling policies also do have impact 
- to i.e. ensure better fairness - at the prices of lower 
throughput...
We normally use mq-deadline, in this setup I notice this has been 
updated to "none", the plan was to revert, this was done in 
collaboration with a discussion with Bart van Assche. Happy to 
revert this to be honest. 
https://lore.kernel.org/all/07d8b189-9379-560b-3291-3feb66d98e5c@xxxxxxx/ 
relates.

Hi

So I guess we can tell the store like this -

When you've created your 'snapshot' of a thin-volume - this enforces 
full flush (& fsfreeze) of a thin volume - so any dirty pages need to 
written in thin pool before snapshot could be taken (and thin pool 
should not run out of space) - this CAN potentially hold your system 
running for a long time (depending on performance of your storage) 
and may cause various lock-ups states of your system if you are using 
this 'snapshoted' volume for anything else - as the volume is 
suspended - so it blocks further operations on this device  - 
eventually causing full system circular deadlock  (catch 22) - this 
is hard to analyze without whole picture of the system.

We may eventually think whether we can somehow minimize the amount of 
holding
vglock and suspending with flush & fsfreeze -  but it's about some 
future possible enhancement and flush disk upfront to minimize dirty 
size.

I've forget to mention that a 'simplest' way is just to run 'sync' 
before running 'lvcreate -s...' command...

Thanks.  I think all in all everything mentioned here makes a lot of 
sense, and (in my opinion at least) explains the symptoms we've been seeing.

Overall the system does "feel" more responsive with the lower dirty 
buffers, and most likely it helps with data persistence (as has been 
mentioned) in case of system crashes and/or loss of power.

The tasks during peak usage also does seem to run faster on average, I 
suspect this is because of the use-case for this host:

1.  Data is seldomly overwritten (this was touched on).  Pretty much 
everything is WORM-type access (Write-Once, Read-Many).
2.  Caches are mostly needed to avoid read-bandwidth from consuming 
capacity for writing.
3.  It's thus beneficial to get writes out of the way as soon as 
possible, rather than at a later stage having to block getting many 
writes done for a flush() or sync() or lvcreate (snapshot).

Is 500MB needlessly low?  Probably.  But given the above I think this is 
acceptable.  Rather keep the disk writing *now* in order to free up 
*future* capacity.

I'm guessing your "simple way" is workable for the generic case as well, 
towards that end, is a relatively simple change to the lvm2 tools not 
perhaps to add an syncfs() call to lvcreate *just prior* to freezing?  
The hard part is probably to figure out if the LV is mounted somewhere, 
and if it is, to open() that path in order to have a file-descriptor to 
pass to syncfs()?  Obviously if the LV isn't mounted none of this is a 
concern and we can just proceed.

What would be more interesting is if cluster-lvm is in play and the 
origin LV is active/open on an alternative node?  But that's well beyond 
the scope of our requirements (for now).

Kind regards,
Jaco