I had some time this weekend to revisit an old problem: http://www.linux-mips.org/archives/linux-mips/2010-05/msg00009.html PROBLEM #1: When testing Dezhong's patch, I found that un-kmapped pages could get passed into dma_map_sg(). The original patch did not handle this case. This resulted in serious DMA coherency issues on my system, along the lines of "/sbin/init segfaults on boot." I found that this calling sequence, seen during SATA disk I/O, was the culprit: [<8002b3d4>] dma_map_sg+0x2ac/0x2b8 [<802cdb70>] ata_qc_issue+0x274/0x37c [<802d383c>] ata_scsi_translate+0xb0/0x1e8 [<802d7274>] ata_scsi_queuecmd+0xc8/0x2b8 [<802b0630>] scsi_dispatch_cmd+0x118/0x2b8 [<802b6c64>] scsi_request_fn+0x420/0x4d8 [<80239550>] __blk_run_queue+0x84/0x19c [<802364a0>] elv_insert+0x158/0x2e4 [<8023a398>] __make_request+0x11c/0x4d0 [<80238144>] generic_make_request+0x348/0x4a0 [<80238330>] submit_bio+0x94/0x13c [<80114024>] mpage_bio_submit+0x30/0x40 [<80115a2c>] mpage_readpages+0x134/0x170 [<800a4fd4>] __do_page_cache_readahead+0x214/0x314 __do_page_cache_readahead() allocates a bunch of pages for the I/O requests, but I do not see them getting mapped to kernel addresses via kmap() / kmap_atomic(). This explains why (PageHighMem(page) && !kmap_high_get(page)) can be true. I do not think there is anything wrong with this behavior, so it seems like dma_map_sg() will need to handle it somehow. Looking at other architectures: The PPC approach (__dma_sync_page_highmem()) is to disable IRQs and call kmap_atomic() to create a temporary mapping for the page. Disabling IRQs is necessary because it is possible (but not required) for the dma_* functions to be invoked from interrupt context; it also disables preemption. kmap_atomic() guarantees a unique mapping per CPU. Interestingly, PPC does not use kmap_high_get() at all. The ARM approach (kmap_high_l1_vipt()) recognizes that it is not always desirable to keep IRQs disabled during DMA flushes, so the ARM maintainers implemented a sort of "reentrant kmap_atomic()" that allows multiple contexts to share the same pte by saving and restoring whatever was there prior to the DMA sync. Due to its use of kmap_high_get(), the ARM approach suffers from problem #2, below. It is also more complex and harder to test. I'm not sure how to recreate some of the worst corner cases, e.g. hardirq cacheflush interrupts a softirq cacheflush which interrupted a user cacheflush. Dezhong's patch was largely based on the ARM scheme, minus the kmap_high_l1_vipt() logic. I am sending an update to this patch which attempts to imitate the PPC approach instead. There is a considerable amount of reuse since both strategies require similar modifications to dma-default.c, in order to get that code to pass around "struct page" pointers rather than kseg0 addresses. PROBLEM #2: Consider this sequence: mm/highmem.c:kmap_high() mm/highmem.c:map_new_virtual() mm/highmem.c:flush_all_zero_pkmaps() arch/mips/kernel/smp.c:flush_tlb_kernel_range() kernel/softirq.c:on_each_cpu() kernel/smp.c:smp_call_function() The first thing kmap_high() does is lock_kmap(), which disables interrupts on kmap_high_get() architectures: #ifdef ARCH_NEEDS_KMAP_HIGH_GET #define lock_kmap() spin_lock_irq(&kmap_lock) ... #else #define lock_kmap() spin_lock(&kmap_lock) #endif smp_call_function() may not be called with interrupts disabled: * You must not call this function with disabled interrupts or from a * hardware interrupt handler or from a bottom half handler. So, on SMP, we get warnings like: ------------[ cut here ]------------ WARNING: at kernel/smp.c:293 smp_call_function_single+0x17c/0x260() Modules linked in: Call Trace: [<800160f0>] dump_stack+0x8/0x34 [<8004a544>] warn_slowpath_common+0x78/0xa4 [<8004a588>] warn_slowpath_null+0x18/0x24 [<800876e4>] smp_call_function_single+0x17c/0x260 [<80087ca8>] smp_call_function+0x28/0x38 [<800531b4>] on_each_cpu+0x1c/0x80 [<80026d20>] flush_tlb_kernel_range+0x28/0x34 [<800b95f4>] kmap_high+0x1dc/0x270 [<8002a570>] __kmap+0x60/0x7c [<801f15b4>] do_readpage+0x68/0x540 [<801f218c>] ubifs_write_begin+0xbc/0x53c [<8009a558>] generic_perform_write+0xd4/0x1f0 [<8009a6e4>] generic_file_buffered_write+0x70/0xbc [<8009dd3c>] __generic_file_aio_write+0x2fc/0x600 [<8009e0b0>] generic_file_aio_write+0x70/0xf4 [<800da864>] do_sync_write+0xc4/0x13c [<800db50c>] vfs_write+0xc0/0x168 [<800db6ac>] sys_write+0x4c/0xa4 [<80003d9c>] stack_done+0x20/0x3c ---[ end trace 8abde6adefbcc81f ]--- I did some digging and found that ARM runs into the same problem. For processors that cannot "broadcast" TLB operations, SMP + HIGHMEM are deemed incompatible: http://www.spinics.net/lists/arm-kernel/msg74563.html So, I would opt for the PPC approach in order to avoid this conflict. Disabling SMP will result in a much nastier performance problem than blocking interrupts during flushes. PROBLEM #3: Regarding David's flush_data_cache_page() concern: http://www.linux-mips.org/archives/linux-mips/2008-03/msg00011.html There are at least 4 reasons to flush the cache: 1) Boot time (cache contents are undefined at reset) 2) DMA coherence 3) I$/D$ coherence (self-modifying code) 4) Zap cache aliases My interpretation of the code is that flush_dcache_page() is only called for #3 and #4: #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 static inline void flush_dcache_page(struct page *page) { if (cpu_has_dc_aliases || !cpu_has_ic_fills_f_dc) __flush_dcache_page(page); } The fact that the L2 / board cache is not flushed in c-r4k.c leads me to believe that #2 is not an intended use of this function: static inline void local_r4k_flush_data_cache_page(void * addr) { r4k_blast_dcache_page((unsigned long) addr); } Besides, we already have a standard, documented DMA API that should be used instead. Since this is not boot time, and HIGHMEM is incompatible with cache aliases, that leaves #3 as a possible issue. Is this function ever used for I$/D$ coherence? I should note that ARM has special handling in __flush_dcache_page() for high pages. IIRC, at one point they tried to support HIGHMEM on systems with cache aliases, before deciding it was not feasible. Maybe this code is just a relic - or maybe it's really needed for something.