Re: hfsplus BUG(), kmap and journalling.

Vyacheslav Dubeyko <slava@xxxxxxxxxxx> · Mon, 22 Oct 2012 13:02:12 +0400

On Sat, 2012-10-20 at 07:24 +0100, Hin-Tak Leung wrote:
> --- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:
> 
> > Hi Hin-Tak,
> > 
> > On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote:
> > > Hi,
> > >
> > > While looking at a few of the older BUG() traces I have
> > consistently
> > > running du on a somewhat large directory with lots of
> > small files and
> > > small directories, I noticed that it tends to have two
> > sleeping "?
> > > hfs_bnode_read()" towards the top. As it is a very
> > small and simple
> > > function which just reads a b-tree node record -
> > sometimes only a few
> > > bytes between a kmap/kunmap, I see that it might just
> > be the number of
> > > simultaneous kmap() being run. So I put a mutex around
> > it just to make
> > > sure only one copy of hfs_bnode_read() is run at a
> > time.
> > 
> > Yeah, you touch very important problem. It needs to rework
> > hfsplus
> > driver from using kmap()/kunmap() because kmap() is slow,
> > theoretically
> > deadlocky and is deprecated. The alternative is
> > kunmap_atomic() but it
> > needs to dive more deeply in every case of kmap() using in
> > hfsplus
> > driver.
> > 
> > The mutex is useless. It simply hides the issue.
> 
> Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below.
> 

Usually, hfs_bnode_read() is called after searching of some object in
b-tree. It needs to initialize struct hfsplus_find_data by means of
hfs_find_init() before any search and operations inside b-tree node. And
then, it needs to call hfs_find_exit(). The hfsplus_find_data structure
contains mutex that it locks of b-tree during hfs_find_init() call and
unlock during hfs_find_exit(). And, usually, hfs_bnode_read() is placed between hfs_find_init()/hfs_find_exit() calls. So, as I can understand, your mutex inside hfs_bnode_read() is useless. But, maybe, not all hfs_bnode_read() calls are placed inside hfs_find_init()/hfs_find_exit() calls. It needs to check.

I can't clear understand about what simultaneous kmap()'s you are talking. As I can see, usually, (especially, in the case of hfs_bnode_read()) pairs of kmap()/kunmap() localize in small parts of code and I expect that it executes fast. Do I misunderstand something?

> I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. 
> 

Could you describe in more details about warning?

I thought that in hfs_bnode_read() pairs of kmap()/kunmap() localize in
small parts of code and can be changed on kmap_atomic()/kunmap_atomic().

> > > This seems to make it much harder to get a BUG() - I
> > needed to run du
> > > a few times over and over to get it again. Of course it
> > might just be
> > > a mutex slowing the driver down to make it less likely
> > to get
> > > confused, but as I read that the number of simultaneous
> > kmap() in the
> > > kernel is limited, I think I might be on to something.
> > > Also this shifts the problem onto multiple copies of
> > "?
> > > hfsplus_bmap()". (which also kmap()/kunmap()'s, but
> > much more
> > > complicated).
> > 
> > Namely, the mutex hides the issue.
> > 
> > > I thought of doing hfsplus_kmap()/etc(which seems to
> > exist a long time
> > > ago but removed!) , but this might cause dead locks
> > since some of the
> > > hfsplus code is kmapping/kunmapping all the time, and
> > recursively. So
> > > a better way might be just to make sure only one
> > instance of some of
> > > the routines are only run one at a time. i.e. multiple
> > mutexes.
> > > This is both ugly and sounds like voodoo though. Also I
> > am not sure
> > > why the existing mutex'es, which protects some of the
> > internal
> > > structures, doesn't protect against too many kmap's.
> > (maybe they
> > > protect "writes", but not against too many simultaneous
> > reads).
> > > So does anybody has an idea how many kmaps are allowed
> > and how to tell
> > > that I am close to my machine's limit?
> > 
> > As I can understand, the hfsplus_kmap() doesn't do something
> > useful. It
> > really needs to rework kmap()/kunmap() using instead of
> > mutex using.
> > 
> > Could you try to fix this issue? :-)
> 
> Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page.
> 

Sorry, I am not fully understand your description. Currently, I think that current implementation of b-tree functionality looks like good. And I can't see any real necessity to change this.

Could you describe your trying in more details?

> I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that  it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. 
> 
> > > Also a side note on the Netgear journalling code: I see
> > that it
> > > jounrnals the volume header, some of the special files
> > (the catalog,
> > > allocation bitmap, etc), but (1) it has some code to
> > journal the
> > > attribute file, but it was actually non-functional,
> > since without
> > > Vyacheslav's recent patches, the linux kernel doesn't
> > even read/write
> > > that correctly, let alone doing *journalled* read/write
> > correctly, (2)
> > > there is a part which tries to do data-page
> > journalling, but it seems
> > > to be wrong - or at least, not quite working. (this I
> > found while I
> > > was looking at some curious warning messages and how
> > they come about).
> > > Luckily that codes just bails out when it gets confused
> > - i.e. it does
> > > non-journalled writes, rather than writing wrong
> > journal to disk. So
> > > it doesn't harm data under routine normal use. (i.e.
> > mount/unmount
> > > cleanly).
> > > But that got me worrying a bit about inter-operability:
> > it is probably
> > > unsafe to use Linux to replay the journal written by
> > Mac OS X, and
> > > vice versa. i.e. if you have a dual boot machine, or a
> > portable disk
> > > that you use between two OSes, if it
> > disconnects/unplugs/crashes under
> > > one OS, it is better to plug it right back and let the
> > same OS
> > > replaying the journal then unmount cleanly before using
> > it under the
> > > other OS.
> > 
> > The journal should be replayed during every mount in the
> > case of
> > presence of valid transactions. A HFS+ volume shouldn't be
> > mounted
> > without journal replaying. Otherwise, it is possible to
> > achieve
> > corrupted partition. Just imagine, you have mounted HFS+
> > partition with
> > not empty journal then add some data on volume. It means
> > that you modify
> > metadata. If you will mount such HFS+ volume under Mac OS X
> > then journal
> > will be replayed and metadata will be corrupted.
> 
> Both OSes try to replay on first mount - but I doubt that they
create/use journal the same way so inter-operability is not gaurannteed
- i.e. it is not recommended to reboot to Mac OS X from an unclean
shutdown of linux or vice versa. Although the netgear code seems to be
consistent - i.e. it replays journals created by itself okay, I should
hope.
> 
> What is clearly a problem is that the netgear code bails out too often
and *not* write a journal (and therefore not clear the transaction after
the data write) for some data writes, and basically just write data
without an accompanying journal & its finishing transaction.
> 

I think that situation is simple. :-) It is possible to state that we have support of HFS+ journaling under Linux if the Linux code can correctly replay the journal (filled by transaction as under Mac OS X as under Linux) and operates under HFS+ journaled partition without any complains of fsck and possibility to mount successfully under Mac OS X.

As I can understand (maybe I am wrong), current existing under Linux implementation of HFS+ journaling support only tries to replay journal during HFS+ volume mount and then it doesn't work with journal. It is possible to work with journaled HFS+ volume in such way. So, the journal after umount under Linux can be empty. And Mac OS X can mount such volume without any troubles. But good support of HFS+ journaling should work with journal during mounted state under Linux.

With the best regards,
Vyacheslav Dubeyko.

> Hin-Tak
> 
> > With the best regards,
> > Vyacheslav Dubeyko.
> > 
> > > I'll be interested on hearing any tips on finding out
> > kmap's limit at
> > > run time, if anybody has any idea...
> > > 
> > > Hin-Tak
> > 
> > 
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html