On Sat, 2012-10-20 at 07:24 +0100, Hin-Tak Leung wrote: > --- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote: > > > Hi Hin-Tak, > > > > On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote: > > > Hi, > > > > > > While looking at a few of the older BUG() traces I have > > consistently > > > running du on a somewhat large directory with lots of > > small files and > > > small directories, I noticed that it tends to have two > > sleeping "? > > > hfs_bnode_read()" towards the top. As it is a very > > small and simple > > > function which just reads a b-tree node record - > > sometimes only a few > > > bytes between a kmap/kunmap, I see that it might just > > be the number of > > > simultaneous kmap() being run. So I put a mutex around > > it just to make > > > sure only one copy of hfs_bnode_read() is run at a > > time. > > > > Yeah, you touch very important problem. It needs to rework > > hfsplus > > driver from using kmap()/kunmap() because kmap() is slow, > > theoretically > > deadlocky and is deprecated. The alternative is > > kunmap_atomic() but it > > needs to dive more deeply in every case of kmap() using in > > hfsplus > > driver. > > > > The mutex is useless. It simply hides the issue. > > Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below. > Usually, hfs_bnode_read() is called after searching of some object in b-tree. It needs to initialize struct hfsplus_find_data by means of hfs_find_init() before any search and operations inside b-tree node. And then, it needs to call hfs_find_exit(). The hfsplus_find_data structure contains mutex that it locks of b-tree during hfs_find_init() call and unlock during hfs_find_exit(). And, usually, hfs_bnode_read() is placed between hfs_find_init()/hfs_find_exit() calls. So, as I can understand, your mutex inside hfs_bnode_read() is useless. But, maybe, not all hfs_bnode_read() calls are placed inside hfs_find_init()/hfs_find_exit() calls. It needs to check. I can't clear understand about what simultaneous kmap()'s you are talking. As I can see, usually, (especially, in the case of hfs_bnode_read()) pairs of kmap()/kunmap() localize in small parts of code and I expect that it executes fast. Do I misunderstand something? > I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. > Could you describe in more details about warning? I thought that in hfs_bnode_read() pairs of kmap()/kunmap() localize in small parts of code and can be changed on kmap_atomic()/kunmap_atomic(). > > > This seems to make it much harder to get a BUG() - I > > needed to run du > > > a few times over and over to get it again. Of course it > > might just be > > > a mutex slowing the driver down to make it less likely > > to get > > > confused, but as I read that the number of simultaneous > > kmap() in the > > > kernel is limited, I think I might be on to something. > > > Also this shifts the problem onto multiple copies of > > "? > > > hfsplus_bmap()". (which also kmap()/kunmap()'s, but > > much more > > > complicated). > > > > Namely, the mutex hides the issue. > > > > > I thought of doing hfsplus_kmap()/etc(which seems to > > exist a long time > > > ago but removed!) , but this might cause dead locks > > since some of the > > > hfsplus code is kmapping/kunmapping all the time, and > > recursively. So > > > a better way might be just to make sure only one > > instance of some of > > > the routines are only run one at a time. i.e. multiple > > mutexes. > > > This is both ugly and sounds like voodoo though. Also I > > am not sure > > > why the existing mutex'es, which protects some of the > > internal > > > structures, doesn't protect against too many kmap's. > > (maybe they > > > protect "writes", but not against too many simultaneous > > reads). > > > So does anybody has an idea how many kmaps are allowed > > and how to tell > > > that I am close to my machine's limit? > > > > As I can understand, the hfsplus_kmap() doesn't do something > > useful. It > > really needs to rework kmap()/kunmap() using instead of > > mutex using. > > > > Could you try to fix this issue? :-) > > Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page. > Sorry, I am not fully understand your description. Currently, I think that current implementation of b-tree functionality looks like good. And I can't see any real necessity to change this. Could you describe your trying in more details? > I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. > > > > Also a side note on the Netgear journalling code: I see > > that it > > > jounrnals the volume header, some of the special files > > (the catalog, > > > allocation bitmap, etc), but (1) it has some code to > > journal the > > > attribute file, but it was actually non-functional, > > since without > > > Vyacheslav's recent patches, the linux kernel doesn't > > even read/write > > > that correctly, let alone doing *journalled* read/write > > correctly, (2) > > > there is a part which tries to do data-page > > journalling, but it seems > > > to be wrong - or at least, not quite working. (this I > > found while I > > > was looking at some curious warning messages and how > > they come about). > > > Luckily that codes just bails out when it gets confused > > - i.e. it does > > > non-journalled writes, rather than writing wrong > > journal to disk. So > > > it doesn't harm data under routine normal use. (i.e. > > mount/unmount > > > cleanly). > > > But that got me worrying a bit about inter-operability: > > it is probably > > > unsafe to use Linux to replay the journal written by > > Mac OS X, and > > > vice versa. i.e. if you have a dual boot machine, or a > > portable disk > > > that you use between two OSes, if it > > disconnects/unplugs/crashes under > > > one OS, it is better to plug it right back and let the > > same OS > > > replaying the journal then unmount cleanly before using > > it under the > > > other OS. > > > > The journal should be replayed during every mount in the > > case of > > presence of valid transactions. A HFS+ volume shouldn't be > > mounted > > without journal replaying. Otherwise, it is possible to > > achieve > > corrupted partition. Just imagine, you have mounted HFS+ > > partition with > > not empty journal then add some data on volume. It means > > that you modify > > metadata. If you will mount such HFS+ volume under Mac OS X > > then journal > > will be replayed and metadata will be corrupted. > > Both OSes try to replay on first mount - but I doubt that they create/use journal the same way so inter-operability is not gaurannteed - i.e. it is not recommended to reboot to Mac OS X from an unclean shutdown of linux or vice versa. Although the netgear code seems to be consistent - i.e. it replays journals created by itself okay, I should hope. > > What is clearly a problem is that the netgear code bails out too often and *not* write a journal (and therefore not clear the transaction after the data write) for some data writes, and basically just write data without an accompanying journal & its finishing transaction. > I think that situation is simple. :-) It is possible to state that we have support of HFS+ journaling under Linux if the Linux code can correctly replay the journal (filled by transaction as under Mac OS X as under Linux) and operates under HFS+ journaled partition without any complains of fsck and possibility to mount successfully under Mac OS X. As I can understand (maybe I am wrong), current existing under Linux implementation of HFS+ journaling support only tries to replay journal during HFS+ volume mount and then it doesn't work with journal. It is possible to work with journaled HFS+ volume in such way. So, the journal after umount under Linux can be empty. And Mac OS X can mount such volume without any troubles. But good support of HFS+ journaling should work with journal during mounted state under Linux. With the best regards, Vyacheslav Dubeyko. > Hin-Tak > > > With the best regards, > > Vyacheslav Dubeyko. > > > > > I'll be interested on hearing any tips on finding out > > kmap's limit at > > > run time, if anybody has any idea... > > > > > > Hin-Tak > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html