Re: Corruption with O_DIRECT and unaligned user buffers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

On Fri, Dec 19, 2008 at 03:34:20PM +0900, KOSAKI Motohiro wrote:
> I think gup_pte_range() doesn't change pte attribute.
> Could you explain why get_user_pages_fast() is evil?

It's evil because it was assumed that by just relying on the
local_irq_disable() to prevent the smp tlb flush IPI to run, it'd be
enough to simulate a 'current' pagetable walk that allowed the current
task to run entirely lockless.

Problem is that by being totally lockless it prevents us to know if a
page is under direct-io or not. And if a page is under direct IO with
writing to memory (reading from memory we cannot care less, it's
always ok) we can't merge pages in ksm or we can't mark the pte
readonly in fork etc... If we do things break. The entirely lockless
(but atomic) pagetable walk done by the cpu is different from gup_fast
because the one done by the cpu will never end up writing to the page
through the pci bus in DMA, so the moment the IPI runs whatever I/O is
interrupted (not the case for gup_fast, when gup_fast returns and the
IPI runs and page is then available for sharing to ksm or pte marked
readonly, the direct DMA is still in flight). That's why gup_fast
*can't* be 100% lockless as today, otherwise it's unfixable and broken
and it's not just ksm. This very O_DIRECT bug in fork is 100%
unfixable without adding some serialization to gup_fast. So my patch
fixes it fully only for kernels before the introduction of gup_fast...

My suggestion is to reintroduced the big reader lock (br_lock) of
2.4 and have gup_fast take the read side of it, and fork/ksm take the
write side. It must no be a write-starving lock like the 2.4 one
though or fork would hang forever on large smp. It should be still
faster than get_user_pages.

> Why rhel can't use memory barrier?

Oh it can, just I didn't implemented immediately as I wanted to ship a
simpler patch first, but given the 27% slowdown measured in later
email, I'll definitely have to replace the TestSetPageLocked with
smb_rmb and see if the introduced overhead goes away.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux