Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

Thomas Schoebel-Theuer <tst@xxxxxxxxxxxxxxxxxx> · Sat, 12 Mar 2016 11:11:38 +0100

On 03/12/2016 08:19 AM, Theodore Ts'o wrote:
On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote:

There's a big difference between "give the user rope", and "tie the
rope in a noose and put a banana peel so that the user might stumble
into the rope and hang himself", though.
[...]  And then the application has to run
setgid with that group's privileges.

Your concept of hierarchically nesting containers via filesystem 
instances looks nice to me.

A potential concern could be whether gids are the right implementation 
for expressing hierarchically nested access permissions in a persistent way.

Your permissions attached to gids are nested (because inside of your 
containers you may have another instance of a completely different gid 
namespace), they are also persistent when your mount flags etc are 
restored properly after a crash (by some scripts), but probably use of 
gids for this might look like a kind of "misuse" of the original gid 
concept from the 1970s.

Maybe you currently don't have a better /persistent/ concept for 
expressing your needs, so maybe your solution could be just fine under 
the currently given cirumstances.

Introduction of a new concept for overcoming the current limitations 
must be done very carefully.

The bad discard semantics concerns about information leaks could be 
/hypothetically/ solved at /concept level/ in the following way. Please 
note that by "concept level" I don't want to imply any particular 
implementation, this is just a mental experiment for discussion of the 
problems,  just a "model of thinking":

a) Use a hierarchical namespace for naming subjects, e.g. 
hypervisorA.containerB.subcontainerC.user9 instead of gid=9

b) Attach actual permissions to each block of the underlying block 
device (fine-grained object model).

c) Correctly maintain access rights at each hierarchical layer, and for 
all operations (including discard with whatever semantics). In case some 
inner instance is untrusted and may do evil things, this will be 
intercepted / corrected at outer layers (which are more trusted). In 
essence, the nesting hierarchy is also a hierarchy of trust.

Now information leaks by bad discard semantics etc should be solved at 
any level, even regarding completely unrelated containers or users, as 
long as no physical access to the disk is possible. In addition, 
encryption may be used for even overcoming this.

Of course, a direct implementation of such extremely fine-grained access 
permissions would carry way too much overhead. Both the number of 
subjects as well as the number of objects must be reduced to some 
reasonable order of magnitude, at least at outer levels.

Thus the question is: how can we achieve almost the same effect with 
much less overhead?

Hmm, in my old Athomux research prototype, I proposed some solutions for 
this, on an academic green meadow. But I am unsure what is transferable 
to a standard POSIX semantics system, and what not. Rethinking these 
concepts as well as checking them may take some time....

Here is a first alpha-stage attempt:

1) Give up the hierarchical subject namespace a), but maybe not fully. 
Access checking will continue /locally/ at each layer, by treating each 
subsystem as a (grey) blackbox. This is already the default 
implementation strategy. The total system may be less secure than in an 
idealized fine-grained system, because outer levels can no longer detect 
bad guys inside of their subsystem instances. The question is: how to 
get a "more secure" system than currently, with some reasonable effort.

2) Some /coarse/ access permission checks at the block layer b), but 
finer than today. Currently there is almost no checking at all (except 
when accessing a huge block device as a whole during open() => at 1&1 we 
have very large ones, and they may continue running for years). I am 
unsure how to achieve this in detail.

An idea for a long-term solution would be offloading of "allocation 
groups" to the block layer (if their size is coarsely dynamic in 
general, e.g. in steps of gigabytes), and to implement some coarse 
permission checks there. These could then be related to "containers" or 
"container groups". One of the problems is that some wide-spread network 
protocols like iSCSI have no clue about this, so this can only be an 
optional new feature.

Further ideas sought.

Cheers, Thomas

P.S. The concept of a "nest" in Athomux was already some kind of 
"recursively nested block device".

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html