On Tue, 19 Jan 2010, Karel Zak wrote: > > This is correct. The last read() (16384 bytes) is a read from FAT > root directory where we look for LABEL. The size of this segment is > not fixed and it depends on the directory size. > > Anyway, it seems that we don't have to read all FAT root directory > entries, it's better to call read() for each directory entry (32 > bytes) and stop at end of the directory or when we found the label. > After this change: > > $ LD_LIBRARY_PATH=shlibs/blkid/src/.libs strace -s -T8 -e read \ > ./misc-utils/.libs/blkid -O 16384 -p ~/fat12-linus.img > read(3, ""..., 832) = 832 > read(3, ""..., 832) = 832 > read(3, ""..., 832) = 832 > read(3, ""..., 1024) = 1024 > read(3, ""..., 32) = 32 > /home/kzak/fat12-linus.img: SEC_TYPE="msdos" VERSION="FAT12" TYPE="vfat" USAGE="filesystem" > > It means 1056 bytes for your disk image! Looks like a nice > improvement ;-) The change is pushed to the repository, so "git pull" > is your friend. Ok, so the above is fine, but since we do end up doing all reads as whole blocks anyway (and we always round up to a page size in order to do caching well), it doesn't really matter _that_ much once you get down to a single page (4kB) reads. Of course, depending on alignment etc, smaller reads can still be better (ie if that avoids crossing a page boundary), but realistically there are very much diminishing returns. A 4kB-aligned 4kB block is going to do as much IO as a 32-byte block does (assuming the 32-byte one doesn't cross a page). Also, I suspect that it's not a great idea to optimize for a totally empty medium. Sure, it's not a totally uncommon case, but I suspect that it's less common than having a couple of root directory entries (camera devices will almost always have exactly one subdirectory entry etc). But what you quote above will certainly work well for my case. Realistically, I'd suggest going for a 4kB blocking factor, though. That's likely a good mid-way point between "read all of it whether it's useful or not" and "assume it's almost empty". > > and I have _no_ idea why that close() takes so long, but I'm guessing that > > it is flushing the page cache and has to wait for any read-ahead to > > finish. > > hmm.. I think application should not be affected by read-ahead. Well, it doesn't affect the end result, but it obviously does affect timing. MOST of the time read-ahead is a good idea, but there's a reason we have that posix_fadvise() (and madvise) system call. Sometimes read-ahead doesn't capture the real pattern, and then it just sucks. > > That is arguably a kernel problem, and I'll take a look at it. Although > > disabling read-ahead in blkid is probably a good thing regardless. > > Applied, thanks for the patch. Side note: I did take a look at the kernel, and sent off an email to the person who has done most of the read-ahead code lately. However, the "wait for all read-ahead to finish" part seems to be pretty unvoidable, because we may end up doing something final after the close (like mounting it, which will need a clean slate), so the only option on a kernel level is to not read ahead at all. And sadly, the kernel looks at the "oh, we did two reads at the beginning of the disk/partition" as a hint that it is worth reading agead. Which is probably true in many cases, but not for something like blkid. So the patch I sent out is likely the best option right now. blkid knows that it's going to seek around and read from different places, so using POSIX_FADV_RANDOM is likely really the right thing to do. But I do think the kernel read-ahead could be improved, so I'll follow up on that with Wu Fengguang. > > However, the real problem is that you didn't fix the whole-disk case: > > Yes, ignore this problem for now. The ideal solution is not to call > blkid for the whole disk. We have to resolve this problem. So in the meantime, what's a good rule to disable it on a per-machine basis? Linus -- To unsubscribe from this list: send the line "unsubscribe util-linux-ng" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html