On Wed, Dec 06, 2023 at 08:31:54AM -0700, Keith Busch wrote: > On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote: > > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote: > > > > > > It's not necessarily about the read/write passthrough commands. It's for > > > commands we don't know about today. Do we want to revisit this problem > > > every time spec provides another operation? Are vendor unique solutions > > > not allowed to get high IOPs access? > > > > Except for read/write, what other commands are performance sensitive? > > It varies by command set, but this question is irrelevant. I'm not > interested in gatekeeping the fast path. IMO, it doesn't make sense to run such optimization for commands which aren't performance sensitive. > > > > Secondly, some people have rediscovered you can abuse this interface to > > > corrupt kernel memory, so there are considerations to restricting this > > > > Just wondering why ADMIN won't corrupt kernel memory, and only normal > > user can, looks it is kernel bug instead of permission related issue. > > Admin can corrupt memory as easily as a normal user through this > interface. We just don't want such capabilities to be available to > regular users. > > And it's a user bug: user told the kernel to map buffer of size X, but > the device transfers size Y into it. Kernel can't do anything about that > (other than remove the interface, but such an action will break many > existing users) because we fundamentally do not know the true transfer > size of a random command. Many NVMe commands don't explicitly encode > transfer lengths, so disagreement between host and device on implicit > lengths risk corruption. It's a protocol "feature". Got it, thanks for the explanation, and looks one big defect of NVMe protocol or the device implementation. > > > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we > > > have to go that route. > > > > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not > > optimize it in task_struct? > > That's an interesting point to look into. I was hoping to not touch such > a common struct, but I'm open to all options. capability is per-thread, and it is updated in current process/pthread, so the correct place to cache this info is 'task_struct'. Thanks, Ming