On Tue, 19 Jan 2010, Kay Sievers wrote: > > Most of the RAID metadata formats can probably get a minimal size > requirement. It is probably not really be neccesary to support a 1.4 > MB RAID auto-assembly :) Setting the debug mask to all ones, I get some crazy sh*t: [19] swsuspend: sb-buffer read() off=1024 len=3072 sb-buffer read() off=4096 len=4096 sb-buffer read() off=8192 len=8192 sb-buffer read() off=16384 len=16384 sb-buffer read() off=32768 len=32768 and it turns out that it looks like the fundamental problem is that the sb-buffer size is simply much much too big. We don't actually want to read 32kB at all - that swsuspend thing is lookign for a small signature at various offsets. So why do we do those reads of increasing size? The sb-buffer code makes the first 68kB special, and uses the logic that even if you want to read just 1kB, we'll read the whole beginning of disk up to the point you needed. So no wonder we always end up reading lots of data off those disks - the whole logic seems to assume that it's totally free to read a single big area. Which is simply not worth it to begin with. If it's a fast disk, it doesn't matter one whit if we do it as 5 reads or as one read. Sure, the five reads are going to be slower, but since blkid does a total of ten reads or so, performance does not _matter_ for the fast disk case! It's not like we're reading gigabytes of data, where it matters a lot whether we do the reads in 4kB chunks or 256kB chunks! So the only case where performance matters is for the really slow disks, and there we're much better off transferring less data. So the whole buffering is actually detrimental to performance. So with this trivial patch, things seem to improve. The swsuspend case, for example, becomes [19] swsuspend: sb-buffer read() off=1024 len=3072 extra-buffer read: off=7168 len=1024 extra-buffer read: off=15360 len=1024 extra-buffer read: off=31744 len=1024 extra-buffer read: off=64512 len=1024 instead. Leave the caching to the kernel. The kernel will cache things properly if there are buffers that get read over and over again (unless you close and re-open the device, which blkid doesn't do). With this trivial one-liner change, the timings change from - before (with the previous patch to not do read-ahead): [root@EeePC util-linux-ng]# time /sbin/blkid -p /dev/sdc real 0m4.914s user 0m0.001s sys 0m0.005s - after (with this additional one-liner): [root@EeePC util-linux-ng]# time misc-utils/blkid -p /dev/sdc real 0m2.887s user 0m0.011s sys 0m0.032s so it still reads way too much, but it's gone from five to three seconds, with a simple one-liner change to not buffer so much in user space. It still too slow, I think, but considering that it used to be almost 15s, things have certainly improved a lot. And getting rid of the crazy things (like probing for swsuspend at all on a 505kB disk!) would improve it further. NOTE! This patch does mean that blkid will do more 'read()' system calls, but they will be (a) smaller and (b) many of them will overlap, and the kernel caching kicks in. For example, we have hot offsets like the 7kB and 8kB marks: [19] swsuspend: extra-buffer read: off=7168 len=1024 [20] swap: extra-buffer read: off=7168 len=1024 and [27] reiserfs: extra-buffer read: off=8192 len=1024 [32] zfs: extra-buffer read: off=8192 len=1024 [35] ufs: extra-buffer read: off=8192 len=1377 [36] hpfs: extra-buffer read: off=8192 len=1024 [45] ocfs: extra-buffer read: off=8192 len=1024 [46] ocfs2: extra-buffer read: off=8192 len=1024 but from a performance standpoint, none of that matters, because the kernel will read those locations just once, and then just copy from the cache for subsequent ones. Before, none of those cases caused duplicate reads, because blkid cached the whole first 68kB in that BLKID_SB_BUFSIZ buffer. But it did so by reading much more than necessary. You can see the system call timings changing radically too (again, filtering only system calls that take more than a tenth of a second): - before: read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.256844> read(3, "\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.256971> read(3, "\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.514547> read(3, "\353<\220){'`j"..., 16384) = 16384 <1.033542> read(3, "\0\0\0\0\0\0\0\0"..., 32768) = 32768 <2.064483> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257442> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257447> read(3, "\0\0\0\0\0\0\0\0"..., 1377) = 1377 <0.256989> - after: read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257850> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.256907> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257585> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257583> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.258587> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257433> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.256586> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.257574> read(3, "\0\0\0\0\0\0\0\0"..., 1024) = 1024 <0.258016> read(3, "\0\0\0\0\0\0\0\0"..., 1377) = 1377 <0.256531> read(3, "\0\0\0\0\0\0\0\0"..., 512) = 512 <0.257059> ie we do more reads (the above ignores all the ones that end up being duplicate and come from the cache, and thus take no time at all), but because they are much smaller, they are much faster. That quarter second is now always the time to fill one 4kB block (which is the kernel caching granularity), which is why the time is the same for 512 bytes as for 1377 bytes (and consistent with that 15kB/s "performance" of the drive). (Btw, you could just remove the whole BLKID_SB_BUFSIZ logic in the first place, but I didn't do that. I intentionally went for "smallest possible patch" rather than "get rid of the user-level caching entirely"). Linus --- shlibs/blkid/src/blkidP.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/shlibs/blkid/src/blkidP.h b/shlibs/blkid/src/blkidP.h index 4a3ae20..1e3d5dc 100644 --- a/shlibs/blkid/src/blkidP.h +++ b/shlibs/blkid/src/blkidP.h @@ -133,7 +133,7 @@ struct blkid_prval struct blkid_chain *chain; /* owner */ }; -#define BLKID_SB_BUFSIZ 0x11000 +#define BLKID_SB_BUFSIZ 4096 /* * Filesystem / Raid magic strings -- To unsubscribe from this list: send the line "unsubscribe util-linux-ng" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html