of course failed to attach, remedied in this email :) On 8/3/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On 8/3/23, Christian Brauner <brauner@xxxxxxxxxx> wrote: >> On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote: >>> On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote: >>> > I really hate making the traditional unix single-threaded file >>> > descriptor case take that lock. >>> > >>> > Maybe it doesn't matter. Obviously it can't have contention, and your >>> > patch in that sense is pretty benign. >>> > >>> > But locking is just fundamentally expensive in the first place, and it >>> > annoys me that I never realized that pidfd_getfd() did that thing that >>> > I knew was broken for /proc. >>> > >>> >>> So I got curious what the impact is and checked on quite a modern CPU >>> (Sapphire Rapid), so nobody can claim it's some old yeller and atomics >>> are nowhere near as expensive on modern uarchs. >>> >>> I used read1_processes from will-it-scale -- it is doing 4KB reads at a >>> time over a 1MB file and dodges refing the file, but it does not dodge >>> the lock with the patch at hand. >>> >>> In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s). >>> >>> The kernel was patched with a toggle to force or elide the proposed >>> mandatory locking, like so: >>> @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) >>> struct file *file = (struct file *)(v & ~3); >>> >>> if (file && (file->f_mode & FMODE_ATOMIC_POS)) { >>> - v |= FDPUT_POS_UNLOCK; >>> - mutex_lock(&file->f_pos_lock); >>> + if (file_count(file) > 1 || fdget_pos_mutex) { >>> + v |= FDPUT_POS_UNLOCK; >>> + mutex_lock(&file->f_pos_lock); >>> + } >>> } >>> return v; >>> } >>> >>> I got rather unstable single-threaded perf, going up and down several % >>> between runs, I don't know yet what's that about. But toggling back and >> >> We've had the whole lkp intel performance testsuite run on this for a >> long time to see whether there any performance regressions that aren't >> in the noise. This includes all of will-it-scale. Also with a focus on >> single-thread performance. So I'm a little skeptical about the >> reliability of manual performance runs in comparison. >> > > I did not merely post numbers claiming a difference, I justified it > with profile output, prominently showing the mutex lock/unlock pair. I > genuinely think this puts the ball in your court. > > One potential factor is mere CPU(s) -- maybe whatever was used in your > bench is older than the CPU I used and for example is more negatively > affected by mitigations like retpoline et al. I intentionally used > something very high end, specifically: Intel(R) Xeon(R) Platinum 8470N > > Another potential factor is discrepancies in kernel config. > > I found one here: > https://lore.kernel.org/oe-lkp/20230802103152.ae3s43z6yjkhnkee@quack3/T/#t > , I'm guessing it is the same as the one used in Intel tests. In there > I see: > CONFIG_RANDOMIZE_KSTACK_OFFSET=y > > This is implemented with rdstsc, which comes with a massive premium to > syscall cost. Also note it does not *fix* CPU bugs or whatever, it > merely is a hardening measure to shuffle the stack pointer (way > overpriced btw). Which is a long way of saying I sanity-checked > syscall perf on my box with getppid and had to disable this opt (afair > it bumped the rate from about 11 mln to 15 mln ops/s) and that it is > fine to do so. > > I could easily see how more things like this combined with older CPUs > would make the mutex trip small enough to not worry about, in that > setting. I don't think this is the same as the cost of the change > being negligible in general. I can't stress enough my config retains > retpoline and other mitigations for actual bugs, so it's not like I > cheated the system. > > I'm attaching my config for reference. It is what's present in Debian > + a bunch of stuff disabled. > >> If Linus thinks it matters then I think we should only take the >> f_pos_lock unconditionally on directory fds as this is where it really >> matters. For read/write we're only doing it because of posix anyway and >> using pidfd_getfd() is squarely out of posix anyway. So we can document >> that using pidfd_getfd() on a regular file descriptor outside of a >> seccomp stopped task is not recommended. >> > > In stock kernel the code is: > if (file && (file->f_mode & FMODE_ATOMIC_POS)) { > if (file_count(file) > 1) { > v |= FDPUT_POS_UNLOCK; > mutex_lock(&file->f_pos_lock); > } > } > > after pidfd_getfd succeeds the count is > 1, so the thing to worry > about is racing with the thread calling fdget_pos or making sure it > stopped using the fd, but only the first time around. Perhaps there is > a way to wait it out (as in guarantee it is not using the fd *or* > already had seen the bumped refcount) which would be perfectly > acceptable for pidfd consumers? Note the thread may be stuck in an > uninterruptible, indefinite sleep somewhere while using it without the > lock. On the other hand if such a thing can happen with f_pos_lock > mutex held, the pidfd consumer running into the same codepath with the > lock would be in the same spot, so perhaps this is not so bad? > > Note this is definitely hackable just for fun without pessimizing > fdget_pos much, but off hand I don't have anything reasonable for > real-world use. I'm going to sleep on it. > > -- > Mateusz Guzik <mjguzik gmail.com> > -- Mateusz Guzik <mjguzik gmail.com>
Attachment:
config
Description: Binary data