Hi Russel/Nick: Thanks for the replies - I would like to reply ASAP to the mails (out of interest also) - but have been unable to for some reasons. I donot understand all that has been talked about in the discussions and would spend more time before I reply to your suggestions. We are working on arm 11 based embedded systems and were debugging an issue with mkfs.xfs (uses O_DIRECT) utility. After quite a bit of debugging, we found a pattern in addresses of user space buffers allocated with memalign. For some address (of these buffers), the format would result in corrupt disk. By applying this patch, I was able to allocate memory that was page color aligned (as required by arm technical ref. manual). We have observed the same problems in VIVT caches (arm 9 processor) based. We also found that we could do away with the problem by disabling cache. Rest, the problem was observed not during O_DIRECT write (but at time of read). When the same buffer was used and when user virtual address (UVA) and kernal virtual address (KVA) did not meet alignment requirements, the d_cache_invalidate operation (in consistent_sync func in consistent.c) on KVA did not have the desired affect. When the same memory was accessed from UVA (that should have been invalidated along with KVA invalidation), it corrupted the buffer we had just read. This in mkfs.xfs utility, corrupted the disk format. In lay man terms, looked like KVA and UVA occupied seperate locations in cache, when they were used one after the other (specifically KVA after UVA). Hope this would be of immidiate help. I would comment on the strageies proposed by nick in next mail. Further as i have said, our kernel version in linux-2.6.18.5 , uclibc-0.9.28 , mkfs is irrelevant. Thanks, Naval On 11/20/08, Russell King - ARM Linux <linux@xxxxxxxxxxxxxxxx> wrote: > On Thu, Nov 20, 2008 at 05:59:00PM +1100, Nick Piggin wrote: > > Basically, an O_DIRECT write involves: > > > > - The program storing into some virtual address, then passing that virtual > > address as the buffer to write(2). > > > > - The kernel will get_user_pages() to get the struct page * of that user > > virtual address. At this point, get_user_pages does flush_dcache_page. > > (Which should write back the user caches?) > > > > - Then the struct page is sent to the block layer (it won't tend to be > > touched by the kernel via the kernel linear map, unless we have like an > > "emulated" block device block device like 'brd'). > > > > - Even if it is read via the kernel linear map, AFAIKS, we should be OK > > due to the flush_dcache_page(). > > > That seems sane, and yes, flush_dcache_page() will write back and > invalidate dirty cache lines in both the kernel and user mappings. > > > > An O_DIRECT read involves: > > > > - Same first 2 steps as O_DIRECT write, including flush_dcache_page. So the > > user mapping should not have any previously dirtied lines around. > > > > - The page is sent to the block layer, which stores into the page. Some > > block devices like 'brd' will potentially store via the kernel linear map > > here, and they probably don't do enough cache flushing. But a regular > > block device should go via DMA, which AFAIK should be OK? (the user address > > should remain invalidated because it would be a bug to read from the buffer > > before the read has completed) > > > This is where things get icky with lots of drivers - DMA is fine, but > many PIO based drivers don't handle the implications of writing to the > kernel page cache page when there may be CPU cache side effects. > > If the cache is in read allocate mode, then in this case there shouldn't > be any dirty cache lines. (That's not always the case though, esp. via > conventional IO.) If the cache is in write allocate mode, PIO data will > sit in the kernel mapping and won't be visible to userspace. > > That is a years-old bug, one that I've been unable to run tests for here > (because my platforms don't have the right combinations of CPUs supporting > write alloc and/or a problem block driver.) I've even been accused of > being uncooperative over testing possible bug fixes by various people > (if I don't have hardware which can show the problem, how can I test > possible fixes?) So I've given up with that issue - as far as I'm > concerned, it's a problem for others to sort out. > > Do we know what hardware, which IO drivers are being used, and any > relevent configuration of the drivers? > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html