Re: hfsplus BUG(), kmap and journalling.

Hin-Tak Leung <htl10@xxxxxxxxxxxxxxxxxxxxx> · Sat, 20 Oct 2012 07:24:01 +0100 (BST)

--- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:

> Hi Hin-Tak,
> 
> On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
> > Hi,
> >
> > While looking at a few of the older BUG() traces I have
> consistently
> > running du on a somewhat large directory with lots of
> small files and
> > small directories, I noticed that it tends to have two
> sleeping "?
> > hfs_bnode_read()" towards the top. As it is a very
> small and simple
> > function which just reads a b-tree node record -
> sometimes only a few
> > bytes between a kmap/kunmap, I see that it might just
> be the number of
> > simultaneous kmap() being run. So I put a mutex around
> it just to make
> > sure only one copy of hfs_bnode_read() is run at a
> time.
> 
> Yeah, you touch very important problem. It needs to rework
> hfsplus
> driver from using kmap()/kunmap() because kmap() is slow,
> theoretically
> deadlocky and is deprecated. The alternative is
> kunmap_atomic() but it
> needs to dive more deeply in every case of kmap() using in
> hfsplus
> driver.
> 
> The mutex is useless. It simply hides the issue.

Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below.

I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. 

> > This seems to make it much harder to get a BUG() - I
> needed to run du
> > a few times over and over to get it again. Of course it
> might just be
> > a mutex slowing the driver down to make it less likely
> to get
> > confused, but as I read that the number of simultaneous
> kmap() in the
> > kernel is limited, I think I might be on to something.
> > Also this shifts the problem onto multiple copies of
> "?
> > hfsplus_bmap()". (which also kmap()/kunmap()'s, but
> much more
> > complicated).
> 
> Namely, the mutex hides the issue.
> 
> > I thought of doing hfsplus_kmap()/etc(which seems to
> exist a long time
> > ago but removed!) , but this might cause dead locks
> since some of the
> > hfsplus code is kmapping/kunmapping all the time, and
> recursively. So
> > a better way might be just to make sure only one
> instance of some of
> > the routines are only run one at a time. i.e. multiple
> mutexes.
> > This is both ugly and sounds like voodoo though. Also I
> am not sure
> > why the existing mutex'es, which protects some of the
> internal
> > structures, doesn't protect against too many kmap's.
> (maybe they
> > protect "writes", but not against too many simultaneous
> reads).
> > So does anybody has an idea how many kmaps are allowed
> and how to tell
> > that I am close to my machine's limit?
> 
> As I can understand, the hfsplus_kmap() doesn't do something
> useful. It
> really needs to rework kmap()/kunmap() using instead of
> mutex using.
> 
> Could you try to fix this issue? :-)

Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page.

I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that  it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. 

> > Also a side note on the Netgear journalling code: I see
> that it
> > jounrnals the volume header, some of the special files
> (the catalog,
> > allocation bitmap, etc), but (1) it has some code to
> journal the
> > attribute file, but it was actually non-functional,
> since without
> > Vyacheslav's recent patches, the linux kernel doesn't
> even read/write
> > that correctly, let alone doing *journalled* read/write
> correctly, (2)
> > there is a part which tries to do data-page
> journalling, but it seems
> > to be wrong - or at least, not quite working. (this I
> found while I
> > was looking at some curious warning messages and how
> they come about).
> > Luckily that codes just bails out when it gets confused
> - i.e. it does
> > non-journalled writes, rather than writing wrong
> journal to disk. So
> > it doesn't harm data under routine normal use. (i.e.
> mount/unmount
> > cleanly).
> > But that got me worrying a bit about inter-operability:
> it is probably
> > unsafe to use Linux to replay the journal written by
> Mac OS X, and
> > vice versa. i.e. if you have a dual boot machine, or a
> portable disk
> > that you use between two OSes, if it
> disconnects/unplugs/crashes under
> > one OS, it is better to plug it right back and let the
> same OS
> > replaying the journal then unmount cleanly before using
> it under the
> > other OS.
> 
> The journal should be replayed during every mount in the
> case of
> presence of valid transactions. A HFS+ volume shouldn't be
> mounted
> without journal replaying. Otherwise, it is possible to
> achieve
> corrupted partition. Just imagine, you have mounted HFS+
> partition with
> not empty journal then add some data on volume. It means
> that you modify
> metadata. If you will mount such HFS+ volume under Mac OS X
> then journal
> will be replayed and metadata will be corrupted.

Both OSes try to replay on first mount - but I doubt that they create/use journal the same way so inter-operability is not gaurannteed - i.e. it is not recommended to reboot to Mac OS X from an unclean shutdown of linux or vice versa. Although the netgear code seems to be consistent - i.e. it replays journals created by itself okay, I should hope.

What is clearly a problem is that the netgear code bails out too often and *not* write a journal (and therefore not clear the transaction after the data write) for some data writes, and basically just write data without an accompanying journal & its finishing transaction.

Hin-Tak

> With the best regards,
> Vyacheslav Dubeyko.
> 
> > I'll be interested on hearing any tips on finding out
> kmap's limit at
> > run time, if anybody has any idea...
> > 
> > Hin-Tak
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html