--- On Fri, 19/10/12, Vyacheslav Dubeyko <slava@xxxxxxxxxxx> wrote: > Hi Hin-Tak, > > On Thu, 2012-10-18 at 17:55 +0100, Hin-Tak Leung wrote: > > Hi, > > > > While looking at a few of the older BUG() traces I have > consistently > > running du on a somewhat large directory with lots of > small files and > > small directories, I noticed that it tends to have two > sleeping "? > > hfs_bnode_read()" towards the top. As it is a very > small and simple > > function which just reads a b-tree node record - > sometimes only a few > > bytes between a kmap/kunmap, I see that it might just > be the number of > > simultaneous kmap() being run. So I put a mutex around > it just to make > > sure only one copy of hfs_bnode_read() is run at a > time. > > Yeah, you touch very important problem. It needs to rework > hfsplus > driver from using kmap()/kunmap() because kmap() is slow, > theoretically > deadlocky and is deprecated. The alternative is > kunmap_atomic() but it > needs to dive more deeply in every case of kmap() using in > hfsplus > driver. > > The mutex is useless. It simply hides the issue. Yes, I am aware of that - putting mutex'es in just makes fewer kmap calls, but the limit of simultaneous kmap()'s can still be reached - and reasonably easily - just run 'du' a few more times, as I wrote below. I tried swapping those by kmap_atomic()/kunmap_atomic() (beware the arguments are different for unmap) - but the kernel immediately warned that *_atom() code is used where code can sleep. > > This seems to make it much harder to get a BUG() - I > needed to run du > > a few times over and over to get it again. Of course it > might just be > > a mutex slowing the driver down to make it less likely > to get > > confused, but as I read that the number of simultaneous > kmap() in the > > kernel is limited, I think I might be on to something. > > Also this shifts the problem onto multiple copies of > "? > > hfsplus_bmap()". (which also kmap()/kunmap()'s, but > much more > > complicated). > > Namely, the mutex hides the issue. > > > I thought of doing hfsplus_kmap()/etc(which seems to > exist a long time > > ago but removed!) , but this might cause dead locks > since some of the > > hfsplus code is kmapping/kunmapping all the time, and > recursively. So > > a better way might be just to make sure only one > instance of some of > > the routines are only run one at a time. i.e. multiple > mutexes. > > This is both ugly and sounds like voodoo though. Also I > am not sure > > why the existing mutex'es, which protects some of the > internal > > structures, doesn't protect against too many kmap's. > (maybe they > > protect "writes", but not against too many simultaneous > reads). > > So does anybody has an idea how many kmaps are allowed > and how to tell > > that I am close to my machine's limit? > > As I can understand, the hfsplus_kmap() doesn't do something > useful. It > really needs to rework kmap()/kunmap() using instead of > mutex using. > > Could you try to fix this issue? :-) Am *trying* :-). Hence this request for discussion & help. I do think that the hfsplpus driver is kmap'ping/unmapping too often - and doing so on very small pieces of data, which does not map. I think one possibility of improving is to organize the internal representation of the b-tree - translate to a more page-filling structure, if that make sense, rather than mapping/unmapping pages all the time to read very small pieces off each page. I still cannot quite get around my head how, (1) essentially read-only operations can get worse and worse if you run it a few more times, (2) it seems that it is just the kernel's internal representation of the filesystem get more and more confusing - there does not seem to be any write on unmount, and if you unmount and run fsck it is "no need to do anything", and you can re-mount and play with 'du' again. > > Also a side note on the Netgear journalling code: I see > that it > > jounrnals the volume header, some of the special files > (the catalog, > > allocation bitmap, etc), but (1) it has some code to > journal the > > attribute file, but it was actually non-functional, > since without > > Vyacheslav's recent patches, the linux kernel doesn't > even read/write > > that correctly, let alone doing *journalled* read/write > correctly, (2) > > there is a part which tries to do data-page > journalling, but it seems > > to be wrong - or at least, not quite working. (this I > found while I > > was looking at some curious warning messages and how > they come about). > > Luckily that codes just bails out when it gets confused > - i.e. it does > > non-journalled writes, rather than writing wrong > journal to disk. So > > it doesn't harm data under routine normal use. (i.e. > mount/unmount > > cleanly). > > But that got me worrying a bit about inter-operability: > it is probably > > unsafe to use Linux to replay the journal written by > Mac OS X, and > > vice versa. i.e. if you have a dual boot machine, or a > portable disk > > that you use between two OSes, if it > disconnects/unplugs/crashes under > > one OS, it is better to plug it right back and let the > same OS > > replaying the journal then unmount cleanly before using > it under the > > other OS. > > The journal should be replayed during every mount in the > case of > presence of valid transactions. A HFS+ volume shouldn't be > mounted > without journal replaying. Otherwise, it is possible to > achieve > corrupted partition. Just imagine, you have mounted HFS+ > partition with > not empty journal then add some data on volume. It means > that you modify > metadata. If you will mount such HFS+ volume under Mac OS X > then journal > will be replayed and metadata will be corrupted. Both OSes try to replay on first mount - but I doubt that they create/use journal the same way so inter-operability is not gaurannteed - i.e. it is not recommended to reboot to Mac OS X from an unclean shutdown of linux or vice versa. Although the netgear code seems to be consistent - i.e. it replays journals created by itself okay, I should hope. What is clearly a problem is that the netgear code bails out too often and *not* write a journal (and therefore not clear the transaction after the data write) for some data writes, and basically just write data without an accompanying journal & its finishing transaction. Hin-Tak > With the best regards, > Vyacheslav Dubeyko. > > > I'll be interested on hearing any tips on finding out > kmap's limit at > > run time, if anybody has any idea... > > > > Hin-Tak > > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html