> -----Original Message----- > From: Dave Hansen [mailto:dave.hansen@xxxxxxxxx] > Sent: Friday, April 30, 2021 10:39 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; tiantao (H) > <tiantao6@xxxxxxxxxxxxx>; corbet@xxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx > Cc: linux-doc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Rafael J. > Wysocki <rafael@xxxxxxxxxx>; Peter Zijlstra <peterz@xxxxxxxxxxxxx>; Valentin > Schneider <valentin.schneider@xxxxxxx>; Dave Hansen > <dave.hansen@xxxxxxxxxxxxxxx>; Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> > Subject: Re: [PATCH 1/2] CPU, NUMA topology ABIs: clarify the overflow issue > of sysfs pagebuf > > On 4/29/21 3:32 PM, Song Bao Hua (Barry Song) wrote: > > $ strace numactl --hardware 2>&1 | grep cpu > > openat(AT_FDCWD, "/sys/devices/system/cpu", > > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 > > openat(AT_FDCWD, "/sys/devices/system/node/node0/cpumap", O_RDONLY) = 3 > > openat(AT_FDCWD, "/sys/devices/system/node/node1/cpumap", O_RDONLY) = 3 > > openat(AT_FDCWD, "/sys/devices/system/node/node2/cpumap", O_RDONLY) = 3 > > openat(AT_FDCWD, "/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3 > > > > If we move to binary, it means we have to change those applications. > > I thought Greg was saying to using a sysfs binary attribute using > something like like sysfs_create_bin_file(). Those don't have the > PAGE_SIZE limitation. But, there's also nothing to keep us from spewing > nice human-readable text via the "binary" file. > > We don't need to change the file format, just the internal kernel API > that we produce the files with. Sorry for waking-up the old thread. I am not sure how common a regular device_attribute will be larger than 4KB and have to work around by bin_attribute. But I wrote a prototype which can initially support large regular sysfs entry and be able to fill the entire buffer by only one show() entry. The other words to say, we don't need to call read() of bin_attribute multiple times for a large regular file. Right now, only read-only attribute is supported. Subject: [RFC PATCH] sysfs: support regular device attr which can be larger than PAGE_SIZE A regular sysfs ABI could be more than 4KB, right now, we are using bin_attribute to workaround and break this limit. A better solution would be supporting long device attribute. In this case, we will still be able to enjoy the advantages of buffering and seeking of seq file and only need to fill the entire buffer of sysfs entry once. Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx> --- drivers/base/core.c | 2 +- fs/seq_file.c | 22 ++++++++++++++++++++++ fs/sysfs/file.c | 40 +++++++++++++++++++++++++++++++++++++++- fs/sysfs/group.c | 4 ++-- include/linux/device.h | 20 ++++++++++++++++++++ include/linux/seq_file.h | 1 + include/linux/sysfs.h | 1 + 7 files changed, 86 insertions(+), 4 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index cadcade65825..0cd4ed165154 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2053,7 +2053,7 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, if (dev_attr->show) ret = dev_attr->show(dev, dev_attr, buf); - if (ret >= (ssize_t)PAGE_SIZE) { + if (ret >= (ssize_t)PAGE_SIZE && !(attr->mode & SYSFS_LONGATTR)) { printk("dev_attr_show: %pS returned bad count\n", dev_attr->show); } diff --git a/fs/seq_file.c b/fs/seq_file.c index b117b212ef28..9054615f8f19 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -84,6 +84,28 @@ int seq_open(struct file *file, const struct seq_operations *op) } EXPORT_SYMBOL(seq_open); +/** + * seq_open_prealloc_buf - allow seq_file users to preallocate buf + * @file: file we initialize + * @size: the maximum size of the file + * + * apply to those scenerios single_open_size() doesn't apply. for + * example, while a regular sysfs entry is more than PAGE_SIZE; + * this will permit users to fill the entire buffer by calling + * device_attr show() only once + */ +int seq_open_prealloc_buf(struct file *file, unsigned long size) +{ + void *buf = seq_buf_alloc(size); + if (buf) + return -ENOMEM; + + ((struct seq_file *)file->private_data)->buf = buf; + ((struct seq_file *)file->private_data)->size = size; + + return 0; +} + static int traverse(struct seq_file *m, loff_t offset) { loff_t pos = 0; diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 9aefa7779b29..09ae12c2326c 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -16,6 +16,7 @@ #include <linux/mutex.h> #include <linux/seq_file.h> #include <linux/mm.h> +#include <linux/device.h> /* for device's longattr support */ #include "sysfs.h" @@ -202,6 +203,32 @@ void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr) } EXPORT_SYMBOL_GPL(sysfs_notify); +static int sysfs_kf_longattr_seq_show(struct seq_file *sf, void *v) +{ + struct kernfs_open_file *of = sf->private; + struct kobject *kobj = of->kn->parent->priv; + const struct sysfs_ops *ops = sysfs_file_ops(of->kn); + ssize_t count; + char *buf; + + count = seq_get_buf(sf, &buf); + + if (ops->show) { + count = ops->show(kobj, of->kn->priv, buf); + if (count < 0) + return count; + } + + seq_commit(sf, count); + return 0; +} + +static int sysfs_longattr_open(struct kernfs_open_file *of) +{ + struct device_long_attribute *lattr = (struct device_long_attribute *)of->kn->priv; + return seq_open_prealloc_buf(of->file, lattr->size); +} + static const struct kernfs_ops sysfs_file_kfops_empty = { }; @@ -223,6 +250,11 @@ static const struct kernfs_ops sysfs_prealloc_kfops_ro = { .prealloc = true, }; +static struct kernfs_ops sysfs_longattr_kfops_ro = { + .open = sysfs_longattr_open, + .seq_show = sysfs_kf_longattr_seq_show, +}; + static const struct kernfs_ops sysfs_prealloc_kfops_wo = { .write = sysfs_kf_write, .prealloc = true, @@ -276,6 +308,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, if (sysfs_ops->show && sysfs_ops->store) { if (mode & SYSFS_PREALLOC) ops = &sysfs_prealloc_kfops_rw; + else if (mode & SYSFS_LONGATTR) + ops = &sysfs_longattr_kfops_ro; else ops = &sysfs_file_kfops_rw; } else if (sysfs_ops->show) { @@ -291,7 +325,11 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent, } else ops = &sysfs_file_kfops_empty; - size = PAGE_SIZE; + if (mode & SYSFS_LONGATTR) { + size = ((struct device_long_attribute *)attr)->size; + } else { + size = PAGE_SIZE; + } } else { struct bin_attribute *battr = (void *)attr; diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c index 64e6a6698935..2d80b05550d1 100644 --- a/fs/sysfs/group.c +++ b/fs/sysfs/group.c @@ -56,11 +56,11 @@ static int create_files(struct kernfs_node *parent, struct kobject *kobj, continue; } - WARN(mode & ~(SYSFS_PREALLOC | 0664), + WARN(mode & ~(SYSFS_LONGATTR | SYSFS_PREALLOC | 0664), "Attribute %s: Invalid permissions 0%o\n", (*attr)->name, mode); - mode &= SYSFS_PREALLOC | 0664; + mode &= SYSFS_LONGATTR | SYSFS_PREALLOC | 0664; error = sysfs_add_file_mode_ns(parent, *attr, false, mode, uid, gid, NULL); if (unlikely(error)) diff --git a/include/linux/device.h b/include/linux/device.h index 59940f1744c1..791a3d25f0bb 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -150,6 +150,26 @@ ssize_t device_store_bool(struct device *dev, struct device_attribute *attr, struct device_attribute dev_attr_##_name = \ __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) +/* + * for those devices whose attributes are larger than 4KB but still want + * to take the advantages of regular sysfs, like show() method is able to + * fill the entire buffer by one read operation + */ +struct device_long_attribute { + struct device_attribute attr; + size_t size; +}; + +#define __LONG_ATTR_RO(_name, _size) { \ + .attr.attr = {.name = __stringify(_name), \ + .mode = SYSFS_LONGATTR | 0444 }, \ + .attr.show = _name##_show, \ + .size = _size, \ +} + +#define LONG_ATTR_RO(_name, _size) \ +struct device_long_attribute dev_attr_##_name = __LONG_ATTR_RO(_name, _size) + int device_create_file(struct device *device, const struct device_attribute *entry); void device_remove_file(struct device *dev, diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index dd99569595fd..e7357fc91c1c 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -106,6 +106,7 @@ void seq_pad(struct seq_file *m, char c); char *mangle_path(char *s, const char *p, const char *esc); int seq_open(struct file *, const struct seq_operations *); +int seq_open_prealloc_buf(struct file *, unsigned long); ssize_t seq_read(struct file *, char __user *, size_t, loff_t *); ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter); loff_t seq_lseek(struct file *, loff_t, int); diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index a12556a4b93a..9198afd46fb0 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -97,6 +97,7 @@ struct attribute_group { */ #define SYSFS_PREALLOC 010000 +#define SYSFS_LONGATTR 020000 #define __ATTR(_name, _mode, _show, _store) { \ .attr = {.name = __stringify(_name), \ -- 2.25.1 very simple way to use it: Add some long attribute by: LONG_ATTR_RO(xxx, 2 * PAGE_SIZE); LONG_ATTR_RO(yyy, 2 * PAGE_SIZE); .... Then add xxx and yyy to attribute list as we usually do for normal device_attribute: struct attribute *attrs[] = { &dev_attr_xxx.attr.attr, &dev_attr_yyy.attr.attr, } Not quite sure if it is valuable. If it is yes, I will split the code and send a RFC patchset. Thanks Barry