Hi Dave, On 2/20/25 09:21, Dave Martin wrote: > Hi, > > On Wed, Feb 19, 2025 at 06:34:42PM -0600, Moger, Babu wrote: >> Hi Dave, >> >> On 2/19/2025 10:07 AM, Dave Martin wrote: >>> Hi, >>> >>> On Wed, Jan 22, 2025 at 02:20:31PM -0600, Babu Moger wrote: > >>> [...] >>> >>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c >>>> index 6e29827239e0..299839bcf23f 100644 >>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c >>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c >>>> @@ -1050,6 +1050,244 @@ static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of, >>> >>> [...] >>> >>>> +static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of, >>>> + char *buf, size_t nbytes, loff_t off) >>>> +{ > > [...] > >>>> + while ((token = strsep(&buf, "\n")) != NULL) { >>>> + /* >>>> + * The write command follows the following format: >>>> + * “<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>” >>>> + * Extract the CTRL_MON group. >>>> + */ >>>> + cmon_grp = strsep(&token, "/"); >>>> + >>> >>> As when reading this file, I think that the data can grow larger than a >>> page and get split into multiple write() calls. >>> >>> I don't currently think the file needs to be redesigned, but there are >>> some concerns about how userspace will work with it that need to be >>> sorted out. >>> >>> Every monitoring group can contribute a line to this file: >>> >>> CTRL_GROUP / MON_GROUP / DOMAIN = [t][l] [ ; DOMAIN = [t][l] ]* LF >>> >>> so, 2 * (NAME_MAX + 1) + NUM_DOMAINS * 5 - 1 + 1 >>> >>> NAME_MAX on Linux is 255, so with, say, up to 16 domains, that's about >>> 600 bytes per monitoring group in the worst case. >>> >>> We don't need to have many control and monitoring groups for this to >>> grow potentially over 4K. >>> >>> >>> We could simply place a limit on how much userspace is allowed to write >>> to this file in one go, although this restriction feels difficult for >>> userspace to follow -- but maybe this is workable in the short term, on >>> current systems (?) >>> >>> Otherwise, since we expect this interface to be written using scripting >>> languages, I think we need to be prepared to accept fully-buffered >>> I/O. That means that the data may be cut at random places, not >>> necessarily at newlines. (For smaller files such as schemata this is >>> not such an issue, since the whole file is likely to be small enough to >>> fit into the default stdio buffers -- this is how sysfs gets away with >>> it IIUC.) >>> >>> For fully-buffered I/O, we may have to cache an incomplete line in >>> between write() calls. If there is a dangling incomplete line when the >>> file is closed then it is hard to tell userspace, because people often >>> don't bother to check the return value of close(), fclose() etc. >>> However, since it's an ABI violation for userspace to end this file >>> with a partial line, I think it's sufficient to report that via >>> last_cmd_status. (Making close() return -EIO still seems a good idea >>> though, just in case userspace is listening.) >> >> Seems like we can add a check in resctrl_mbm_assign_control_write() to >> compare nbytes > PAGE_SIZE. > > This might be a reasonable stopgap approach, if we are confident that the > number of RMIDs and monitoring domains is small enough on known > platforms that the problem is unlikely to be hit. I can't really judge > on this. > >> But do we really need this? I have no way of testing this. Help me >> understand. > > It's easy to demonatrate this using the schemata file (which works in a > similar way). Open f in /sys/fs/resctrl/schemata, then: > > int n = 0; > > for (n = 0; n < 1000; n++) > if (fputs("MB:0=100;1=100\n", f) == EOF) > fprintf(stderr, "Failed on interation %d\n", n); > > This will succeed a certain number of times (272, for me) and then fail > when the stdio buffer for f overflows, triggering a write(). > > Putting an explicit fflush() after every fputs() call (or doing a > setlinebuf(f) before the loop) makes it work. But this is awkward and > unexpected for the user, and doing the right thing from a scripting > language may be tricky. > > In this example I am doing something a bit artificial -- we don't > officially say what happens when a pre-opened schemata file handle is > reused in this way, AFAICT. But for mbm_assign_control it is > legitimate to write many lines, and we can hit this kind of problem. > > > I'll leave it to others to judge whether we _need_ to fix this, but it > feels like a problem waiting to happen. Created the problem using this code using a "test" group. include <stdio.h> #include <errno.h> #include <string.h> int main() { FILE *file; int n; file = fopen("/sys/fs/resctrl/info/L3_MON/mbm_assign_control", "w"); if (file == NULL) { printf("Error opening file!\n"); return 1; } printf("File opened successfully.\n"); for (n = 0; n < 100; n++) if (fputs("test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;9=tl;10=tl;11=tl\n", file) == EOF) fprintf(stderr, "Failed on interation %d error %s\n ", n, strerror(errno)); if (fclose(file) == 0) { printf("File closed successfully.\n"); } else { printf("Error closing file!\n"); } } When the buffer overflow happens the newline will not be there. I have added this error via rdt_last_cmd_puts. At least user knows there is an error. diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 484d6009869f..70a96976e3ab 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1250,8 +1252,10 @@ static ssize_t resctrl_mbm_assign_control_write(struct kernfs_open_file *of, int ret; /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') + if (nbytes == 0 || buf[nbytes - 1] != '\n') { + rdt_last_cmd_puts("mbm_cntr_assign: buffer invalid\n"); return -EINVAL; + } buf[nbytes - 1] = '\0'; I am open to other ideas to handle this case. > > >> All these file operations go thru generic call kernfs_fop_write_iter(). >> Doesn't it take care of buffer check and overflow? > > No, this is called for each iovec segment (where userspace used one of > the iovec based I/O syscalls). But there is no buffering or > concatenation of the data read in: each segment gets passed down to the > individual kernfs_file_operations write method for the file: > > len = ops->write(of, buf, len, iocb->ki_pos) > > calls down to > > resctrl_mbm_assign_control_write(of, buf, len, iocb->ki_pos). > > > I'll try to port my buffering hack on top of the series -- that should > help to illustrate what I mean. > > Cheers > ---Dave > -- Thanks Babu Moger