On Tue, 31 Jul 2012, Jim Meyering wrote: > Hugh Dickins wrote: > > On Thu, 12 Jul 2012, Jeff Liu wrote: > > > On 07/12/2012 07:01 AM, Dave Chinner wrote: > > > > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote: > > > >> > > > >> But your vote would count for a lot more if you know of some app which > > > >> would really benefit from this functionality in tmpfs: I've heard > > > >> of none. > ... > [Jeff mentioned "cp"] > > grep is another tool that would benefit. > I often put very large files (often sparse, too) on tmpfs file systems > and would like "grep -r PAT /tmp" to work well in spite of those files. > > Please consider restoring SEEK_HOLE/SEEK_DATA support for tmpfs. Many thanks, Jim, for taking the time to explain your case and provide the links below. I am sorry to have given you guys trouble by adding and then reverting the tmpfs SEEK_HOLE before 3.5 final. I really wanted to respond to you positively. But now that (I think) I understand the grep case better, I believe the best I can do is to keep an eye on ext4 development, and reintroduce tmpfs SEEK_HOLE/SEEK_DATA in the same release that proper support comes to ext4, whenever that is (I notice that Zheng Liu has posted patches for it, but I have no idea how close to ready they are). > > The lack of cross-FS support in SEEK_HOLE/SEEK_DATA support is a bit of a > thorn in our sides. FIEMAP is not a viable option, and SEEK_HOLE support > works only if you happen to be using btrfs, xfs, ocfs2 or 3.5.0-rcN tmpfs. > Not something we can rely on for a feature whose lack can convert grep -r > into a memory-hogging apparently-hung job or OOM-killer-target. > > What would you like to happen when you run > (deliberately or inadvertently) grep on a large sparse file? > I want it to search only the non-HOLE sections of that file, That will be a nice optimization once SEEK_DATA is more widely supported, yes. > especially when examining a hole involves accumulating a > "line" that may be so long that it exhausts virtual memory. I think you're conflating two different things there. As I understand it, there's a bug (or silliness) in grep such that it ends up exhausting memory when it cannot divide the file up into sensibly lengthed lines. I agree that you don't want grep to collapse with "memory exhausted" in such a case; and you'll know better than me what a sensible maximum length for a grep line should be. But what has that got to do with holes? I have not tried grepping a larger-than-memory+swap file containing neither NULs nor newlines, so please correct me if I'm wrong: but I expect it to fail grep's binary file test (with or without proper SEEK_HOLE support), but then to hit the memory exhausted error. What it's got to do with holes is that simple text files don't have NULs in them, and there tends to be a correlation between files containing NULs and files not divided into lines of text (though I'm blissfully ignorant of even the most common document formats). There's also a correlation between files containing NULs and files containing NULs in their first 32k. grep's file_is_binary() looks for a NUL in that first 32k (?); and the heuristic has been "enhanced" to check for sparse files too (whether by st_blocks or now SEEK_HOLE) - though no attempt to check for isolated NULs beyond the first 32k. And there's even a "big-hole" test for this behaviour in the tree. But wouldn't the developer's common case (object files amidst source in the tree) usually be handled by that check on the first 32k? And all this stuff about holes: shouldn't grep instead just be checking for over-long lines instead of running out of memory? I apologize, this is not the right forum to discuss grep code: let's take it away from linux-fsdevel if followup is needed. I'll happily reintroduce tmpfs SEEK_DATA and SEEK_HOLE when ext4 joins btrfs and xfs in supporting it. Thanks, Hugh > We're not quite there, but for now can at least avoid the > VM-abusing behavior with --binary-file=without-match option, > which says to treat "binary" (sparse) files as if they contain no match. > Sometimes. > > With working SEEK_HOLE support, grep does the right thing here: > > (${AWK-awk} 'BEGIN{ for (i=0;i<1000;i++) printf "%080d\n", 0 }' < /dev/null > echo x | dd bs=1024k seek=8000000 > ) >8T-or-so > > $ env time --format=%e grep x 8T-or-so > 0.00 > > But without SEEK_HOLE support, and with a lot of memory, grep takes a > long time to allocate all of that space before it finally chokes or is killed. > Here, it takes 46 seconds before running out of memory: > > $ env time grep --binary-file=without-match x 8T-or-so > grep: memory exhausted > 3.15user 25.48system 0:46.46elapsed 61%CPU\ > (0avgtext+0avgdata 12583712maxresident)k > 0inputs+8outputs (0major+2733623minor)pagefaults 0swaps > [Exit 2] > > Until very recently, grep was trying to guess whether an input > has a hole using st_blocks and st_size, but with file systems now > using compression, that method it too subject to false-positives. > > Ideally we would use SEEK_HOLE/SEEK_DATA, but until that is useful on > more linux file systems, I suspect we'll have to choose our method based > on the file system type (at the cost of a statvfs call for each st_dev), > possibly in combination with the linux kernel version. > > Here's some background/discussion on the topic, including the > original report about the st_blocks-based heuristic not working: > > http://thread.gmane.org/gmane.comp.gnu.grep.bugs/4604/focus=4610 > > In case you want to see the SEEK_HOLE-using code, grep's file_is_binary > function is here: > > http://git.savannah.gnu.org/cgit/grep.git/tree/src/main.c#n439 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html