> +/* > + * You can control how the buffer in userspace is filled with this mode > + * parameters: I agree that we don't have any good mechanisms for looking at the page cache from userspace. I've hacked some things up using mincore() and they weren't pretty, so I welcome _something_ like this. But, is this trying to do too many things at once? Do we have solid use cases spelled out for each of these modes? Have we thought out how they will be used in practice? The biggest question for me, though, is whether we want to start designing these per-page interfaces to consider different page sizes, or whether we're going to just continue to pretend that the entire world is 4k pages. Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit silly, for instance. > + * - FINCORE_BMAP: > + * the page status is returned in a vector of bytes. > + * The least significant bit of each byte is 1 if the referenced page > + * is in memory, otherwise it is zero. I know this is consistent with mincore(), but it did always bother me that mincore() was so sparse. Seems like it is wasting 7/8 of its bits. > + * - FINCORE_PGOFF: > + * if this flag is set, fincore() doesn't store any information about > + * holes. Instead each records per page has the entry of page offset, > + * using 8 bytes. This mode is useful if we handle a large file and > + * only few pages are on memory. This bothers me a bit. How would someone know how sparse file was before calling this? If it's not sparse, and they use this, they'll end up using 8x the memory they would have using FINCORE_BMAP. If it *is* sparse, and they use FINCORE_BMAP, they will either waste tons of memory on buffers, or have to make a ton of calls. I guess this could also be used to do *searches*, which would let you search out holes. Let's say you have a 2TB file. You could call this with a buffer size of 1 entry and do searches, say 0->1TB. If you get your one entry back, you know it's not completely sparse. But, that wouldn't work with it as-designed. The length of the buffer and the range of the file being checked are coupled together, so you can't say: vec = malloc(sizeof(long)); fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra); without overflowing vec. Is it really right to say this is going to be 8 bytes? Would we want it to share types with something else, like be an loff_t? > + * - FINCORE_PFN: > + * stores pfn, using 8 bytes. These are all an unprivileged operations from what I can tell. I know we're going to a lot of trouble to hide kernel addresses from being seen in userspace. This seems like it would be undesirable for the folks that care about not leaking kernel addresses, especially for unprivileged users. This would essentially tell userspace where in the kernel's address space some user-controlled data will be. > + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK. > + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page > + * information is stored like this: Instead of specifying the ordering in the manpages alone, would it be smarter to just say that the ordering of the items is dependent on the ordering of the flags? In other words if FINCORE_PFN < FINCORE_PAGEFLAGS, then its field comes first? -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html