On Fri, Jan 24, 2025 at 03:28:02PM +0300, Mikhail Ivanov wrote: > On 1/14/2025 9:31 PM, Mickaël Salaün wrote: > > Happy new year! > > > > On Fri, Jan 10, 2025 at 07:55:10PM +0300, Mikhail Ivanov wrote: > > > On 1/10/2025 7:27 PM, Günther Noack wrote: > > > > On Fri, Jan 10, 2025 at 04:02:42PM +0300, Mikhail Ivanov wrote: > > > > > Correct, but we also agreed to use bitmasks for "family" field as well: > > > > > > > > > > https://lore.kernel.org/all/af72be74-50c7-d251-5df3-a2c63c73296a@xxxxxxxxxxxxxxxxxxx/ > > > > > > > > OK > > > > > > > > > > > > > On 1/10/2025 2:12 PM, Günther Noack wrote: > > > > > > I do not understand why this convenience feature in the UAPI layer > > > > > > requires a change to the data structures that Landlock uses > > > > > > internally? As far as I can tell, struct landlock_socket_attr is only > > > > > > used in syscalls.c and converted to other data structures already. I > > > > > > would have imagined that we'd "unroll" the specified bitmasks into the > > > > > > possible combinations in the add_rule_socket() function and then call > > > > > > landlock_append_socket_rule() multiple times with each of these? > > > > I agree that UAPI should not necessarily dictate the kernel > > implementation. > > > > > > > > > > > > I thought about unrolling bitmask into multiple entries in rbtree, and > > > > > came up with following possible issue: > > > > > > > > > > Imagine that a user creates a rule that allows to create sockets of all > > > > > possible families and types (with protocol=0 for example): > > > > > { > > > > > .allowed_access = LANDLOCK_ACCESS_SOCKET_CREATE, > > > > > .families = INT64_MAX, /* 64 set bits */ > > > > > .types = INT16_MAX, /* 16 set bits */ > > > > > .protocol = 0, > > > > > }, > > > > > > > > > > This will add 64 x 16 = 1024 entries to the rbtree. Currently, the > > > > > struct landlock_rule, which is used to store rules, weighs 40B. So, it > > > > > will be 40kB by a single rule. Even if we allow rules with only > > > > > "existing" families and types, it will be 46 x 7 = 322 entries and ~12kB > > > > > by a single rule. > > > > > > > > > > I understand that this may be degenerate case and most common rule will > > > > > result in less then 8 (or 4) entries in rbtree, but I think API should > > > > > be as intuitive as possible. User can expect to see the same > > > > > memory usage regardless of the content of the rule. > > > > > > > > > > I'm not really sure if this case is really an issue, so I'd be glad > > > > > to hear your opinion on it. > > > > > > > > I think there are two separate questions here: > > > > > > > > (a) I think it is OK that it is *possible* to allocate 40kB of kernel > > > > memory. At least, this is already possible today by calling > > > > landlock_add_rule() repeatedly. > > > > > > > > I assume that the GFP_KERNEL_ACCOUNT flag is limiting the > > > > potential damage to the caller here? That flag was added in the > > > > Landlock v19 patch set [1] ("Account objects to kmemcg."). > > > > > > > > (b) I agree it might be counterintuitive when a single > > > > landlock_add_rule() call allocates more space than expected. > > > > > > > > Mickaël, since it is already possible today (but harder), I assume > > > > that you have thought about this problem before -- is it a problem > > > > when users allocate more kernel memory through Landlock than they > > > > expected? > > > > The imbalance between the user request and the required kernel memory is > > interesting. It would not be a security issue thanks to the > > GFP_KERNEL_ACCOUNT, but it can be surprising for users that for some > > requests they can receive -ENOMEM but not for quite-similar ones (e.g. > > with only some bits different). We should keep the principle of least > > astonishment in mind, but also the fact that the kernel implementation > > and the related required memory may change between two kernel versions > > for the exact same user request (because of Landlock implementation > > differences or other parts of the kernel). In summary, we should be > > careful to prevent users from being able to use a large amount of memory > > with only one syscall. This which would be an usability issue. > > > > > > > > > > > > > > Naive proposal: > > > > > > > > If this is an issue, how about we set a low limit to the number of > > > > families and the number of types that can be used in a single > > > > landlock_add_rule() invocation? (It tends to be easier to start with > > > > a restrictive API and open it up later than the other way around.) > > > > > > Looks like a good approach! Better to return with an error (which almost > > > always won't happen) and let the user to refactor the code than to > > > allow ruleset to eat an unexpected amount of memory. > > > > > > > > > > > For instance, > > > > > > > > * In the families field, at most 2 bits may be set. > > > > * In the types field, at most 2 bits may be set. > > > > > > Or we can say that rule can contain not more than 4 combinations of > > > family and type. > > > > > > BTW, 4 seems to be sufficient, at least for IP protocols. > > > > > > > > > > > In my understanding, the main use case of the bit vectors is that > > > > there is a short way to say "all IPv4+v6 stream+dgram sockets", but we > > > > do not know scenarios where much more than that is needed? With that, > > > > we would still keep people from accidentally allocating larger amounts > > > > of memory, while permitting the main use case. > > > > > > Agreed > > > > > > > > > > > Having independent limits for the family and type fields is a bit > > > > easier to understand and document than imposing a limit on the > > > > multiplication result. > > > > This is a safer approach that can indeed be documented, but it looks > > unintuitive and an arbitrary limitation. Here is another proposal: > > > > Let's ignore my previous suggestion to use bitfields for families and > > protocols. To keep it simple, we can get back to the initial struct but > > specifically handle the (i64)-1 value (which cannot be a family, > > protocol, nor a type): > > { > > .allowed_access = LANDLOCK_ACCESS_SOCKET_CREATE, > > .family = AF_INET, > > .type = SOCK_STREAM, > > .protocol = -1, > > }, > > > > This would read: deny socket creation except for AF_INET with > > SOCK_STREAM (and any protocol). > > > > Users might need to add several rules (e.g. one for AF_INET and another > > for AF_INET6) but that's OK. Unfortunately we cannot identify a TCP > > socket with only protocol = IPPROTO_TCP because protocol definitions > > are relative to network families. Specifying the protocol without the > > family should then return an error. > > > > Before rule could be loaded, users define how much they want to match a > > socket: at least the family, optionally the type, and if the type is > > also set then the protocol can also be set. These dependencies are > > required to transform this triplet to a key number, see below. > > > > A landlock_ruleset_attr.handled_socket_layers field would define how > > much we want to match a socket: > > - 1: family only > > - 2: family and type > > - 3: family, type, and protocol > > > > According to this ruleset's property, users will be allowed to fill the > > family, type, or protocol fields in landlock_socket_attr rules. If a > > socket layer is not handled, it should contain (i64)-1 for the kernel to > > detect misuse of the API. > > > > This enables us to get a key from this triplet: > > > > family_bits = 6; // 45 values for now > > type_bits = 3; // 7 values for now > > protocol_bits = 5; // 28 values for now > > > > // attr.* are the sanitized UAPI values, including -1 replaced with 0. > > // In this example, landlock_ruleset_attr.handled_socket_layers is 3, so > > // the key is composed of all the 3 properties. > > landlock_key.data = attr.family << (type_bits + protocol_bits) | > > attr.type << protocol_bits | attr.protocol; > > > > For each layer of restriction in a domain, we know how precise they > > define a socket (i.e. with how many "socket layers"). We can then look > > for at most 3 entries in the red-black tree: one with only the family, > > another with the family and the type, and potentially a third also > > including the protocol. Each key would have the same significant bits > > but with the lower bits masked according to each > > landlock_ruleset_attr.handled_socket_layers . Composing the related > > access masks according to the defined socket layers, we can create an > > array of struct access_masks for the request and then check if such > > request is allowed by the current domain. As for the currently stored > > data, we can also identify the domain layer that blocked the request > > (required for audit). > > I do not quite understand why we need socket_layers. Without it, > user can set (i64)(-1) to type or protocol whenever he wants. While > transforming triplet to a key we can replace (i64)(-1) with some > constant (e.g. (1 << type_bits - 1) for the type if type_bits = 8 and > (1 << protocol_bits - 1) for the protocol if protocol_bits = 16). We can indead add a virtual any/wildcard type and protocol, translated from user space's (i64)-1 . The downside is that for the same domain layer we would need to check 4 different keys for the same triplet (i.e. famility is always set, but type and protocol are optionals). With the socket_layers number, we only have to check for one key per domain layer. However, in the worse case with at least 3 domain layers having different socket_layers value, we would still need to check for 3 different keys. So, I think it's reasonable to get rid of the socket_layers as you suggested (while requiring the family to always be set). > > > > > With this design, each sandbox can define a socket as much as it wants. > > > > The downside is that we lost the bitfields and we need several calls to > > filter more complex sockets (e.g. 4 for UDP and TCP with IPv4 and IPv6), > > which looks OK compared to the required calls for filesystem access > > control. > > > > > > > > > > > > That being said, I am not a big fan of red-black trees for such simple > > > > > > integer lookups either, and I also think there should be something > > > > > > better if we make more use of the properties of the input ranges. The > > > > > > question is though whether you want to couple that to this socket type > > > > > > patch set, or rather do it in a follow up? (So far we have been doing > > > > > > fine with the red black trees, and we are already contemplating the > > > > > > possibility of changing these internal structures in [2]. We have > > > > > > also used RB trees for the "port" rules with a similar reasoning, > > > > > > IIRC.) > > > > > > > > > > I think it'll be better to have a separate series for [2] if the socket > > > > > restriction can be implemented without rbtree refactoring. > > > > > > > > Sounds good to me. 👍 > > > > > > > > [1] https://lore.kernel.org/all/20200707180955.53024-2-mic@xxxxxxxxxxx/ > > > > red-black trees are a good generic data structure for the current main > > use case (for dynamic rulesets and static domains), but we'll need to > > use more appropriate data structures. I think this should not be a > > blocker for this patch series. It will be required to match (port) > > ranges though (even if the use case seems limited), and in general for > > better performances. >