On Wed, 15 Jun 2022 at 19:43, T.J. Mercier <tjmercier@xxxxxxxxxx> wrote: > > On Wed, Jun 1, 2022 at 5:40 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > On Mon, May 30, 2022 at 08:12:16AM +0200, Christian König wrote: > > > Am 25.05.22 um 23:05 schrieb T.J. Mercier: > > > > On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote: > > > > > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote: > > > > > > > On Mon, May 16, 2022 at 12:21 PM Christian König > > > > > > > <christian.koenig@xxxxxxx> wrote: > > > > > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier: > > > > > > > > > On Mon, May 16, 2022 at 10:20 AM Christian König > > > > > > > > > <christian.koenig@xxxxxxx> wrote: > > > > > > > > > > Am 16.05.22 um 19:13 schrieb T.J. Mercier: > > > > > > > > > > > Recently, we noticed an issue where a process went into direct reclaim > > > > > > > > > > > while holding the kernfs rw semaphore for sysfs in write (exclusive) > > > > > > > > > > > mode. This caused processes who were doing DMA-BUF exports and releases > > > > > > > > > > > to go into uninterruptible sleep since they needed to acquire the same > > > > > > > > > > > semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid > > > > > > > > > > > blocking DMA-BUF export for an indeterminate amount of time while > > > > > > > > > > > another process is holding the sysfs rw semaphore in exclusive mode, > > > > > > > > > > > this patch moves the per-buffer sysfs file creation to the default work > > > > > > > > > > > queue. Note that this can lead to a short-term inaccuracy in the dmabuf > > > > > > > > > > > sysfs statistics, but this is a tradeoff to prevent the hot path from > > > > > > > > > > > being blocked. A work_struct is added to dma_buf to achieve this, but as > > > > > > > > > > > it is unioned with the kobject in the sysfs_entry, dma_buf does not > > > > > > > > > > > increase in size. > > > > > > > > > > I'm still not very keen of this approach as it strongly feels like we > > > > > > > > > > are working around shortcoming somewhere else. > > > > > > > > > > > > > > > > > > > My read of the thread for the last version is that we're running into > > > > > > > > > a situation where sysfs is getting used for something it wasn't > > > > > > > > > originally intended for, but we're also stuck with this sysfs > > > > > > > > > functionality for dmabufs. > > > > > > > > > > > > > > > > > > > > Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs") > > > > > > > > > > > Originally-by: Hridya Valsaraju <hridya@xxxxxxxxxx> > > > > > > > > > > > Signed-off-by: T.J. Mercier <tjmercier@xxxxxxxxxx> > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > See the originally submitted patch by Hridya Valsaraju here: > > > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0 > > > > > > > > > > > > > > > > > > > > > > v2 changes: > > > > > > > > > > > - Defer only sysfs creation instead of creation and teardown per > > > > > > > > > > > Christian König > > > > > > > > > > > > > > > > > > > > > > - Use a work queue instead of a kthread for deferred work per > > > > > > > > > > > Christian König > > > > > > > > > > > --- > > > > > > > > > > > drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++------- > > > > > > > > > > > include/linux/dma-buf.h | 14 ++++++- > > > > > > > > > > > 2 files changed, 54 insertions(+), 16 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c > > > > > > > > > > > index 2bba0babcb62..67b0a298291c 100644 > > > > > > > > > > > --- a/drivers/dma-buf/dma-buf-sysfs-stats.c > > > > > > > > > > > +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c > > > > > > > > > > > @@ -11,6 +11,7 @@ > > > > > > > > > > > #include <linux/printk.h> > > > > > > > > > > > #include <linux/slab.h> > > > > > > > > > > > #include <linux/sysfs.h> > > > > > > > > > > > +#include <linux/workqueue.h> > > > > > > > > > > > > > > > > > > > > > > #include "dma-buf-sysfs-stats.h" > > > > > > > > > > > > > > > > > > > > > > @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void) > > > > > > > > > > > kset_unregister(dma_buf_stats_kset); > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > +static void sysfs_add_workfn(struct work_struct *work) > > > > > > > > > > > +{ > > > > > > > > > > > + struct dma_buf_sysfs_entry *sysfs_entry = > > > > > > > > > > > + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work); > > > > > > > > > > > + struct dma_buf *dmabuf = sysfs_entry->dmabuf; > > > > > > > > > > > + > > > > > > > > > > > + /* > > > > > > > > > > > + * A dmabuf is ref-counted via its file member. If this handler holds the only > > > > > > > > > > > + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an > > > > > > > > > > > + * optimization and a race; when the reference count drops to 1 immediately after > > > > > > > > > > > + * this check it is not harmful as the sysfs entry will still get cleaned up in > > > > > > > > > > > + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference > > > > > > > > > > > + * is released, and that can't happen until the end of this function. > > > > > > > > > > > + */ > > > > > > > > > > > + if (file_count(dmabuf->file) > 1) { > > > > > > > > > > Please completely drop that. I see absolutely no justification for this > > > > > > > > > > additional complexity. > > > > > > > > > > > > > > > > > > > This case gets hit around 5% of the time in my testing so the else is > > > > > > > > > not a completely unused branch. > > > > > > > > Well I can only repeat myself: This means that your userspace is > > > > > > > > severely broken! > > > > > > > > > > > > > > > > DMA-buf are meant to be long living objects > > > > > > > This patch addresses export *latency* regardless of how long-lived the > > > > > > > object is. Even a single, long-lived export will benefit from this > > > > > > > change if it would otherwise be blocked on adding an object to sysfs. > > > > > > > I think attempting to improve this latency still has merit. > > > > > > Fixing the latency is nice, but as it's just pushing the needed work off > > > > > > to another code path, it will take longer overall for the sysfs stuff to > > > > > > be ready for userspace to see. > > > > > > > > > > > > Perhaps we need to step back and understand what this code is supposed > > > > > > to be doing. As I recall, it was created because some systems do not > > > > > > allow debugfs anymore, and they wanted the debugging information that > > > > > > the dmabuf code was exposing to debugfs on a "normal" system. Moving > > > > > > that logic to sysfs made sense, but now I am wondering why we didn't see > > > > > > these issues in the debugfs code previously? > > > > > > > > > > > > Perhaps we should go just one step further and make a misc device node > > > > > > for dmabug debugging information to be in and just have userspace > > > > > > poll/read on the device node and we spit the info that used to be in > > > > > > debugfs out through that? That way this only affects systems when they > > > > > > want to read the information and not normal code paths? Yeah that's a > > > > > > hack, but this whole thing feels overly complex now. > > > > > A bit late on this discussion, but just wanted to add my +1 that we should > > > > > either redesign the uapi, or fix the underlying latency issue in sysfs, or > > > > > whatever else is deemed the proper fix. > > > > > > > > > > Making uapi interfaces async in ways that userspace can't discover is a > > > > > hack that we really shouldn't consider, at least for upstream. All kinds > > > > > of hilarious things might start to happen when an object exists, but not > > > > > consistently in all the places where it should be visible. There's a > > > > > reason sysfs has all these neat property groups so that absolutely > > > > > everything is added atomically. Doing stuff later on just because usually > > > > > no one notices that the illusion falls apart isn't great. > > > > > > > > > > Unfortunately I don't have a clear idea here what would be the right > > > > > solution :-/ One idea perhaps: Should we dynamically enumerate the objects > > > > > when userspace does a readdir()? That's absolutely not how sysfs works, > > > > > but procfs works like that and there's discussions going around about > > > > > moving these optimizations to other kernfs implementations. At least there > > > > > was a recent lwn article on this: > > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0 > > > > > > > > > > But that would be serious amounts of work I guess. > > > > > -Daniel > > > > > -- > > > > > Daniel Vetter" > > > > > Software Engineer, Intel Corporation > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0 > > > > Hi Daniel, > > > > > > > > My team has been discussing this, and I think we're approaching a > > > > consensus on a way forward that involves deprecating the existing > > > > uapi. > > > > > > > > I actually proposed a similar (but less elegant) idea to the readdir() > > > > one. A new "dump_dmabuf_data" sysfs file that a user would write to, > > > > which would cause a one-time creation of the per-buffer files. These > > > > could be left around to become stale, or get cleaned up after first > > > > read. However to me it seems impossible to correctly deal with > > > > multiple simultaneous users with this technique. We're not currently > > > > planning to pursue this. > > > > > > > > Thanks for the link to the article. That on-demand creation sounds > > > > like it would allow us to keep the existing structure and files for > > > > DMA-buf, assuming there is not a similar lock contention issue when > > > > adding a new node to the virtual tree. :) > > > > > > I think that this on demand creation is even worse than the existing ideas, > > > but if you can get Greg to accept the required sysfs changes than that's at > > > least outside of my maintenance domain any more :) > > > > I think doing it cleanly in sysfs without changing the current uapi sounds > > pretty good. The hand-rolled "touch a magic file to force update all the > > files into existence" sounds like a horror show to me :-) Plus I don't see > > how you can actually avoid the locking pain with that since once the files > > are created, you have to remove them synchronously again, plus you get to > > deal with races on top (and likely some locking inversion fun on top). > > -Daniel > > Yes, lots of reasons not to pursue that angle. :) > > So I asked Greg about modifying sysfs for this purpose, and he's quite > convincing that it's not the right way to approach this problem. So > that leaves deprecating the per-buffer statistics. It looks like we > can maintain the userspace functionality that depended on this by > replacing it with a single sysfs node for "dmabuf_total_size" along > with adding exporter information to procfs (via Kalesh's path patch > [1]). However there is a separate effort to account dmabufs from heaps > with cgroups [2], so if I'm able to make that work then we would not > need the new "dmabuf_total_size" file since this same information > could be obtained from the root cgroup instead. So I'd like to try > that first before falling back to adding a new dmabuf_total_size file. Sounds like a plan. -Daniel > > [1] https://lore.kernel.org/lkml/875yll1fp1.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m43a3d345f821a02babd4ebb1f4257982d027c9e4 > [2] https://lore.kernel.org/lkml/CABdmKX1xvm87WMEDkMc9Aye46E4zv1-scenwgaRxHesrOCsaYg@xxxxxxxxxxxxxx/T/#mb82eca5438a4ea7ab157ab9cd7f044cbcfeb5509 > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch