Re: [PATCH v11 23/23] x86/resctrl: Introduce interface to modify assignment states of the groups

"Moger, Babu" <babu.moger@xxxxxxx> · Thu, 20 Feb 2025 14:57:31 -0600

Hi Dave,

On 2/20/25 09:21, Dave Martin wrote:
> Hi,
> 
> On Wed, Feb 19, 2025 at 06:34:42PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/19/2025 10:07 AM, Dave Martin wrote:
>>> Hi,
>>>
>>> On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote:
> 
>>> [...]
>>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 6e29827239e0..299839bcf23f 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>>>
>>> [...]
>>>
>>>> +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
>>>> +						char *buf, size_t nbytes, loff_t off)
>>>> +{
> 
> [...]
> 
>>>> +	while ((token = strsep(&buf, "\n")) != NULL) {
>>>> +		/*
>>>> +		 * The write command follows the following format:
>>>> +		 * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
>>>> +		 * Extract the CTRL_MON group.
>>>> +		 */
>>>> +		cmon_grp = strsep(&token, "/");
>>>> +
>>>
>>> As when reading this file, I think that the data can grow larger than a
>>> page and get split into multiple write() calls.
>>>
>>> I don't currently think the file needs to be redesigned, but there are
>>> some concerns about how userspace will work with it that need to be
>>> sorted out.
>>>
>>> Every monitoring group can contribute a line to this file:
>>>
>>> 	CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF
>>>
>>> so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1
>>>
>>> NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about
>>> 600 bytes per monitoring group in the worst case.
>>>
>>> We don't need to have many control and monitoring groups for this to
>>> grow potentially over 4K.
>>>
>>>
>>> We could simply place a limit on how much userspace is allowed to write
>>> to this file in one go, although this restriction feels difficult for
>>> userspace to follow -- but maybe this is workable in the short term, on
>>> current systems (?)
>>>
>>> Otherwise, since we expect this interface to be written using scripting
>>> languages, I think we need to be prepared to accept fully-buffered
>>> I/O.  That means that the data may be cut at random places, not
>>> necessarily at newlines.  (For smaller files such as schemata this is
>>> not such an issue, since the whole file is likely to be small enough to
>>> fit into the default stdio buffers -- this is how sysfs gets away with
>>> it IIUC.)
>>>
>>> For fully-buffered I/O, we may have to cache an incomplete line in
>>> between write() calls.  If there is a dangling incomplete line when the
>>> file is closed then it is hard to tell userspace, because people often
>>> don't bother to check the return value of close(), fclose() etc.
>>> However, since it's an ABI violation for userspace to end this file
>>> with a partial line, I think it's sufficient to report that via
>>> last_cmd_status.  (Making close() return -EIO still seems a good idea
>>> though, just in case userspace is listening.)
>>
>> Seems like we can add a check in resctrl_mbm_assign_control_write() to
>> compare nbytes > PAGE_SIZE.
> 
> This might be a reasonable stopgap approach, if we are confident that the
> number of RMIDs and monitoring domains is small enough on known
> platforms that the problem is unlikely to be hit.  I can't really judge
> on this.
> 
>> But do we really need this? I have no way of testing this. Help me
>> understand.
> 
> It's easy to demonatrate this using the schemata file (which works in a
> similar way).  Open f in /sys/fs/resctrl/schemata, then:
> 
> 	int n = 0;
> 
> 	for (n = 0; n < 1000; n++)
> 		if (fputs("MB:0=100;1=100\n", f) == EOF)
> 			fprintf(stderr, "Failed on interation %d\n", n);
> 
> This will succeed a certain number of times (272, for me) and then fail
> when the stdio buffer for f overflows, triggering a write().
> 
> Putting an explicit fflush() after every fputs() call (or doing a
> setlinebuf(f) before the loop) makes it work.  But this is awkward and
> unexpected for the user, and doing the right thing from a scripting
> language may be tricky.
> 
> In this example I am doing something a bit artificial -- we don't
> officially say what happens when a pre-opened schemata file handle is
> reused in this way, AFAICT.  But for mbm_assign_control it is
> legitimate to write many lines, and we can hit this kind of problem.
> 
> 
> I'll leave it to others to judge whether we _need_ to fix this, but it
> feels like a problem waiting to happen.

Created the problem using this code using a "test" group.

include <stdio.h>
#include <errno.h>
#include <string.h>

int main()
{
        FILE *file;
        int n;

        file = fopen("/sys/fs/resctrl/info/L3_MON/mbm_assign_control", "w");

        if (file == NULL) {
                printf("Error opening file!\n");
                return 1;
        }

        printf("File opened successfully.\n");

        for (n = 0; n < 100; n++)
                if
(fputs("test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;9=tl;10=tl;11=tl\n", file) == EOF)
                        fprintf(stderr, "Failed on interation %d error
%s\n ", n, strerror(errno));

        if (fclose(file) == 0) {
                printf("File closed successfully.\n");
        } else {
                printf("Error closing file!\n");
        }
}


When the buffer overflow happens the newline will not be there. I have
added this error via rdt_last_cmd_puts. At least user knows there is an error.

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 484d6009869f..70a96976e3ab 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1250,8 +1252,10 @@ static ssize_t
resctrl_mbm_assign_control_write(struct kernfs_open_file *of,
        int ret;

        /* Valid input requires a trailing newline */
-       if (nbytes == 0 || buf[nbytes - 1] != '\n')
+       if (nbytes == 0 || buf[nbytes - 1] != '\n') {
+               rdt_last_cmd_puts("mbm_cntr_assign: buffer invalid\n");
                return -EINVAL;
+       }

        buf[nbytes - 1] = '\0';



I am open to other ideas to handle this case.


> 
> 
>> All these file operations go thru generic call kernfs_fop_write_iter().
>> Doesn't it take care of buffer check and overflow?
> 
> No, this is called for each iovec segment (where userspace used one of
> the iovec based I/O syscalls).  But there is no buffering or
> concatenation of the data read in: each segment gets passed down to the
> individual kernfs_file_operations write method for the file:
> 
> 	len = ops->write(of, buf, len, iocb->ki_pos)
> 
> calls down to
> 
> 	resctrl_mbm_assign_control_write(of, buf, len, iocb->ki_pos).
> 
> 
> I'll try to port my buffering hack on top of the series -- that should
> help to illustrate what I mean.
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger