On Wed, 13 Nov 2024 00:13:13 +0000 Michael Pratt <mcpratt@xxxxx> wrote: > > > > Why is stable Cc'd? > > > > I believe this should be backported, if accepted, > so that the behavior between kernel versions is matching. That's not the purpose of stable. In fact, I would argue that it's the opposite of what stable is for. A stable kernel should *not* change behavior as that can cause regressions. If you want the newest behavior, then you should use the newest kernels. > I can do: > > $ cat /proc/$$/sched > > and see the 120 without needing interpretation > due to it being represented in a different way. True it is exposed via files, but wouldn't this be the first change to make it visible via a system call? > > > > That said, you are worried about the race of spawning a new task and > > setting its nice value because the new task may have exited. What about > > using pidfd? Create a task returning the pidfd and use that to set its nice > > value. > > I read a little about pidfd, but I'm not seeing the exact connection here, > perhaps it will reduce the race condition but it cannot eliminate it as far as I see. > For example, I am not finding a function that uses it to adjust niceness. We can always add a system call do to that ;-) In fact, there's a lot of system calls that need to be converted to use pidfd over pid. > > It's not that the "exit before modify" race condition is the only concern, > it's just one of the less obvious factors making up my rationale for this change. > I'm also concerned with efficiency. Why do we need to call another syscall > if the syscall we are already in can handle it? > > Personally, I find it strange that in sched_setscheduler() > the policy can be changed but not the priority, > when there is a standardized function dedicated to just that. My concern is the man page that has (in Debian): $ man sched_setscheduler [..] Currently, Linux supports the following "normal" (i.e., non-real-time) scheduling policies as values that may be specified in policy: SCHED_OTHER the standard round-robin time-sharing policy; SCHED_BATCH for "batch" style execution of processes; and SCHED_IDLE for running very low priority background jobs. For each of the above policies, param->sched_priority must be 0. Where we already document that the sched_priority "must be 0". > > The difference between RT and normal processes > is simply that for normal processes, we use "niceness" instead, > so this patch simply translates the priority to "niceness", > which cannot be expressed separately with the relevant POSIX functions. I agree that POSIX has never been that great, but instead of modifying an existing documented system call to do something that is documented not to do, I believe we should either use other existing system calls or make a new one. -- Steve