Re: hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'

Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> · Sun, 7 Apr 2013 14:41:08 +0100 (BST)

--- On Sun, 7/4/13, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:

> Hi Hin-Tak,
> 
> On Apr 5, 2013, at 12:57 AM, Hin-Tak Leung wrote:
> 
> > Hi Michael,
> > 
> > Argh, that looks suspiciously like the recurring
> problem I have been trying to pin down for the much of the
> last year. My current thinking is that one of the patches
> posted a couple of weeks ago might help.
> 
> As I remember, you can easily reproduce the issue that you
> are investigating. Does the issue reproducible with enabled
> debug output? Can you reproduce the issue with fully enabled
> debug output (I mean to enable all debug flags)? If you can
> reproduce the issue with enabled debug output then could you
> share this debug output with me?

That's correct - I can trigger the error condition with debug enabled quite reasonably "reliably". I remembered having done that once, I think with catalog and extent debugging on. The problem was that it generated too much information; since I needed to run "du" on a large directory (~million files) to trigger the condition, the catalog debugging info is a few lines per file, and "du" gets at every of the ~million files, so we are talking about dumping a few hundred MBs into /var/log/messages :-(.

Hence another reason to switching to dynamic debugging also - so that one can switch on/off per debugging lines. Even that is not ideal.

> Thanks,
> Vyacheslav Dubeyko.
> 
> > That patch addresses out-of-memory conditions in
> caching of metadata, in a nutshell. I think if (1) the
> system is under memory stress, (2) one is doing something
> which transverse the file system very quickly, (3) on a
> mult-CPU/core system, it is possible to run some mutexed
> non-re-entrant code in the hfsplus simultaneously without a
> mutex lock, and therefore get it a bit confused. This idea
> at least explains why (1) adding an inner mutex lock can
> delay the problem although supposedly the outer mutex should
> have prevented more than one copy of the non-re-entrant code
> from being run and the inner mutex lock should have no
> effect at all, (2) the on-disk data is always fsck'ed okay -
> it is just the driver itself getting confused.
> > 
> > So I have a few questions for you:
> > 
> > 1. You are on a quad-core system, correct? This is
> according to your /proc/cpuinfo below.
> > 
> > 2. You are certainly doing fast file system transversal
> (updatedb), but are you actually doing it *on top of the
> hfsplus* file system? I am asking this because updatedb is
> usually configured not the index removable media under /mnt
> or /media . But you mentioned you have the hfsplus system
> mounted under /home - please confirm that and include some
> more details if you can.
> > 
> > 3. How full and populous is that hfs+ file system? i.e.
> the output of both "df" and "df -i" while it is mounted. Is
> this your Mac OS X system (root / ) disk?
> > 
> > 4. Is your system under memory stress at the moment the
> problem happens - e.g. you have a web browser with a few
> hundred tabs open?
> > 
> > Hin-Tak
> > 
> > --- On Thu, 4/4/13, Vyacheslav Dubeyko <slava@xxxxxxxxxxx>
> wrote:
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html