On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote: > Kent Overstreet <koverstreet@xxxxxxxxxx> writes: > > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote: > >> Kent Overstreet <koverstreet@xxxxxxxxxx> writes: > >> > >> > So, I and other people keep running into things where we really need to > >> > add an interface to pass some auxiliary... stuff along with a pread() or > >> > pwrite(). > >> > > >> > A few examples: > >> > > >> > * IO scheduler hints. Some userspace program wants to, per IO, specify > >> > either priorities or a cgroup - by specifying a cgroup you can have a > >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth > >> > quotas. > >> > >> You can do this today by splitting I/O between processes and placing > >> those processes in different cgroups. For io priority, there is > >> ioprio_set, which incurs an extra system call, but can be used. Not > >> elegant, but possible. > > > > Yes - those are things I'm trying to replace. Doing it that way is a > > real pain, both as it's a lousy interface for this and it does impact > > performance (ioprio_set doesn't really work too well with aio, too). > > ioprio_set works fine with aio, since the I/O is issued in the caller's > context. Perhaps you're thinking of writeback I/O? Until you want to issue different IOs with different priorities... > >> > * Cache hints. For bcache and other things, userspace may want to specify > >> > "this data should be cached", "this data should bypass the cache", etc. > >> > >> Please explain how you will differentiate this from posix_fadvise. > > > > Oh sorry, I think about SSD caching so much I forget to say that's what > > I'm talking about. posix_fadvise is for the page cache, we want > > something different for an SSD cache (IMO it'd be really ugly to use it > > for both, and posix_fadvise() can't really specifify everything we'd > > want to for an SSD cache). > > DESCRIPTION > Programs can use posix_fadvise() to announce an intention to > access file data in a specific pattern in the future, thus > allowing the kernel to perform appropriate optimizations. > > That description seems broad enough to include disk caches as well. You > haven't exactly stated what's missing. It _could_ work for SSD caches, but that doesn't mean you'd want it to - it doesn't have any way of specifying which cache you want the hint to apply to, and there are certainly circumstances under which you _wouldn't_ want it to apply to both. And making it apply to SSD caches would be silently changing the behavior, and also like I mentioned it's not sufficient for SSD caches. > >> > Hence, AIO attributes. > >> > >> *No.* Start with the non-AIO case first. > > > > Why? It is orthogonal to AIO (and I should make that clearer), but to do > > it for sync IO we'd need new syscalls that take an extra argument so IMO > > it's a bit easier to start with AIO. > > > > Might be worth implementing the sync interface sooner rather than later > > just to discover any potential issues, I suppose. > > Looking back to preadv and pwritev, it was wrong to only add them to > libaio (and that later got corrected). I'd just like to see things > start out with the sync interfaces, since you'll get more eyes on the > code (not everyone cares about aio) and that helps to work out any > interface issues. I agree we don't want to leave out sync versions, but honestly this stuff is more useful with AIO and that's the easier place to start. > > It's not possible in general - consider stacking block devices, and > > attrs that are supported only by specific block drivers. I.e. if you've > > got lvm on top of bcache or bcache on top of md, we can pass the attr > > down with the IO but we can't determine ahead of time, in general, where > > the IO is going to go. > > If the io stack is static (meaning you setup a device once, then open it > and do io to it, and it doesn't change while you're doing io), how would > you not know where the IO is going to go? With something like dm, md or bcache - you've got multiple underlying devices, and of those underlying devices which one the IO goes to is not something you can in general predict ahead of time. > > But that probably isn't true for most attrs so it probably would be a > > good idea to have an interface for querying what's supported, and even > > for device specific ones you could query what a device supports. > > OK. > > >> > One could imagine sticking the return in the attribute itself, but I > >> > don't want to do this. For some things (checksums), the attribute will > >> > contain a pointer to a buffer - that's fine. But I don't want the > >> > attributes themselves to be writeable. > >> > >> One could imagine that attributes don't return anything, because, well, > >> they're properties of something else, and properties don't return > >> anything. > > > > With a strict definition of attribute, yeah. One of the real uses cases > > we have for this is per IO timings, for aio - right now we've got an > > interface for the kernel to tell userspace how long a syscall took > > (don't think it's upstream yet - Paul's been behind that stuff), but it > > only really makes sense with synchronous syscalls. > > Something beyond recording the time spent in the kernel? Paul who? I > agree the per io timing for aio may be coarse-grained today (you can > time the difference between io_submit returning and the event being > returned by io_getevents, but that says nothing of when the io was > issued to the block layer). I'm curious to know exactly what > granularity you want here, and what an application would do with that > information. You can currently access a whole lot of detail of the io > path through blktrace, but that is not easily done from within an > application. Paul Turner, scheduler guy. Believe it's both syscall time and IO time (i.e. what you'd get from blktrace). It's basically used for visibility in filesystem type stuff, for monitoring latency - rpc latency isn't enough, you really need to know why things are slow and that could be as simple as a disk going bad. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html