Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state

Jonathan Halliday <jonathan.halliday@xxxxxxxxxx> · Wed, 26 Feb 2020 11:56:29 +0000

On 26/02/2020 11:31, Jan Kara wrote:
Hello,

On Wed 26-02-20 09:28:57, Jonathan Halliday wrote:
I'm a middleware developer, focused on how Java (JVM) workloads can benefit
from app-direct mode pmem. Initially the target is apps that need a fast
binary log for fault tolerance: the classic database WAL use case;
transaction coordination systems; enterprise message bus persistence and
suchlike. Critically, there are cases where we use log based storage, i.e.
it's not the strict 'read rarely, only on recovery' model that a classic db
may have, but more of a 'append only, read many times' event stream model.

Think of the log oriented data storage as having logical segments (let's
implement them as files), of which the most recent is being appended to
(read_write) and the remaining N-1 older segments are full and sealed, so
effectively immutable (read_only) until discarded. The tail segment needs to
be in DAX mode for optimal write performance, as the size of the append may
be sub-block and we don't want the overhead of the kernel call anyhow. So
that's clearly a good fit for putting on a DAX fs mount and using mmap with
MAP_SYNC.

However, we want fast read access into the segments, to retrieve stored
records. The small access index can be built in volatile RAM (assuming we're
willing to take the startup overhead of a full file scan at recovery time)
but the data itself is big and we don't want to move it all off pmem. Which
means the requirements are now different: we want the O/S cache to pull hot
data into fast volatile RAM for us, which DAX explicitly won't do.
Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
except here we're using the O/S rather than the hardware memory controller
to do the cache management for us.

Currently this requires closing the full (read_write) file, then copying it
to a non-DAX device and reopening it (read_only) there. Clearly that's
expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
then, leaving the file where it is, reopen it in a mode that will instead go
via the O/S cache in the traditional manner. Bonus points if I can do it
over non-overlapping ranges in a file without closing the DAX mode mmap,
since then the segments are entirely logical instead of needing separate
physical files.

I note a comment below regarding a per-directly setting, but don't have the
background to fully understand what's being suggested. However, I'll note
here that I can live with a per-directory granularity, as relinking a file
into a new dir is a constant time operation, whilst the move described above
isn't. So if a per-directory granularity is easier than a per-file one
that's fine, though as a person with only passing knowledge of filesystem
design I don't see how having multiple links to a file can work cleanly in
that case.

Well, with per-directory setting, relinking the file will not magically
make it stop using DAX. So your situation would be very similar to the
current one, except "copy to non-DAX device" can be replaced by "copy to
non-DAX directory". Maybe the "copy" part could be actually reflink which
would make it faster.

Indeed. The requirement is for 'change mode in constant time' rather 
than the current 'change mode in time proportional to file size'. That 
seems to imply requiring the approach to just change fs metadata, 
without relocating the file data bytes. Beyond that I'm largely 
indifferent to the implementation details.

P.S. I'll cheekily take the opportunity of having your attention to tack on
one minor gripe about the current system: The only way to know if a mmap
with MAP_SYNC will work is to try it and catch the error. Which would be
reasonable if it were free of side effects.  However, the process requires
first expanding the file to at least the size of the desired map, which is
done non-atomically i.e. is user visible. There are thus nasty race
conditions in the cleanup, where after a failed mmap attempt (e.g the device
doesn't support DAX), we try to shrink the file back to its original size,
but something else has already opened it at its new, larger size. This is
not theoretical: I got caught by it whilst adapting some of our middleware
to use pmem.  Therefore, some way to probe the file path for its capability
would be nice, much the same as I can e.g. inspect file permissions to (more
or less) evaluate if I can write it without actually mutating it.  Thanks!

Well, reporting error on mmap(2) is the only way how to avoid
time-to-check-time-to-use races. And these are very important when we are
speaking about data integrity guarantees. So that's not going to change.
But with Ira's patches you could use statx(2) to check whether file at
least supports DAX and so avoid doing mmap check with the side effects in
the common case where it's hopeless... I'd also think that you could
currently do mmap check with the current file size and if it succeeds,
expand the file to the desired size and mmap again. It's not ideal but it
should work.

Sure. Best effort is fine here, just as with looking at the permission 
bits on a file example - even in the absence of racing permission 
changes it's not definitive because of additional quota or selinux 
checks, but it's a reasonable approximation. That's a sufficiently 
useful improvement for my purposes, given the impractical nature of a 
100% solution.

Jonathan

--
Registered in England and Wales under Company Registration No. 03798903 
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander