Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/04/2011 10:19 PM, Jan Kara wrote:
On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
On 05/03/2011 06:19 PM, Jan Kara wrote:
On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
(2011/04/16 2:13), Jan Kara wrote:
Hello,

On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
For ext3 or ext4 without delayed allocation we block inside writepage()
function. But as I wrote to Dave Chinner, ->page_mkwrite() should
probably
get modified to block while minor-faulting the page on frozen fs
because
when blocks are already allocated we may skip starting a transaction
and so
we could possibly modify the filesystem.
OK. I think ->page_mkwrite() should also block writing the
minor-faulting pages.

(minor-pagefault)
->   do_wp_page()
->   page_mkwrite(= ext4_mkwrite())
=>   BLOCK!

(major-pagefault)
->   do_liner_fault()
->   page_mkwrite(= ext4_mkwrite())
=>   BLOCK!


Mizuma-san's reproducer also writes the data which maps to the
file (mmap).
The original problem happens after the fsfreeze operation is done.
I understand the normal write operation (not mmap) can be blocked
while
fsfreezing. So, I guess we don't always block all the write
operation
while fsfreezing.
Technically speaking, we block all the transaction starts which
means we
end up blocking all the writes from going to disk. But that does
not mean
we block all the writes from going to in-memory cache - as you
properly
note the mmap case is one of such exceptions.
Hm, I also think we can allow the writes to in-memory cache but we
can't allow
the writes to disk while fsfreezing. I am considering that mmap
path can
write to disk while fsfreezing because this deadlock problem
happens after
fsfreeze operation is done...
I'm sorry I don't understand now - are you speaking about the case
above
when writepage() does not wait for filesystem being frozen or something
else?
Sorry, I didn't understand around the page fault path.
So, I had read the kernel source code around it, then I maybe
understand...

I worry whether we can update the file data in mmap case while
fsfreezing.
Of course, I understand that we can write to in-memory cache, and it
is not a
problem. However, if we can write to disk while fsfreezing, it is a
problem.
So, I summarize the cases whether we can write to disk or not.

--------------------------------------------------------------------------

Cases (Whether we can write the data mmapped to the file on the disk
while fsfreezing)

[1] One of the page which has been mmapped is not bound. And
the page is not allocated yet. (major fault?)

(1) user dirtys a page
(2) a page fault occurs (do_page_fault)
(3) __do_falut is called.
(4) ext4_page_mkwrite is called
(5) ext4_write_begin is called
(6) ext4_journal_start_sb =>   We can STOP!

[2] One of the page which has been mmapped is not bound. But
the page is already allocated, and the buffer_heads of the page
are not mapped (BH_Mapped). (minor fault?)

(1) user dirtys a page
(2) a page fault occurs (do_page_fault)
(3) do_wp_page is called.
(4) ext4_page_mkwrite is called
(5) ext4_write_begin is called
(6) ext4_journal_start_sb =>   We can STOP!

What happens in the case as follows:

Task 1: Mmapped writes
t1)ext4_page_mkwrite()
   t2) ext4_write_begin() (FS is thawed so we proceed)
   t3) ext4_write_end() (journal is stopped now)
-----Pre-empted-----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has
completed execution.

Task 1: Mmapped writes
tn+1) ext4_page_mkwrite() returns 0.
tn+2) __do_fault() gets control, code gets executed.
tn+3) _do_fault() marks the page dirty if the intent is to write to
a file based page which faulted.

So you end up dirtying the page cache when the F.S is frozen? No?
   You are right ext4_page_mkrite() as currently implemented has problems.
You have to return the page locked (and check for frozen fs with page lock
held) to avoid races.

If you check for frozen fs with page lock held, you are guaranteed that
freezing code must wait for the page to get unlocked before proceeding. And
before the page is unlocked, it is marked dirty by the pagefault code which
makes freezing code write the page and writeprotect it again. So everything
will be safe.
For the locked page to be a part of the freeze initiated sync,
should its owner inode not be dirtied? The page fault handler
dirties the page, but who ensures that the inode is dirtied at this
point?
Well, I mean it as follows:

Doesn't the writeback code (invoked via sync_filesystem(sb)) write all the dirty pages of all the _dirty_ inodes of a superblock?

So in the window from the point where ext4_page_mkwrite returns to __do_fault() _till_ you mark the inode dirty (in __mark_inode_dirty()), you can have a race with freeze i.e if freeze happens meanwhile, then the sync initiated by freeze will not consider this locked page as the owner inode is _clean_ (or not dirtied yet) at that point?

Key: tx: time at unit x

P1: mmapped writes
t1) __do_page_fault()
   t2) ext4_page_mkwrite()
      // owner inode of the page is in _clean_ state - not yet dirtied
   --- pre-empted---

P2: Freeze_super
tn) freeze_super gets control
freezes the F.S, skips the owner inode as it is in the clean state. syncs all the other dirty inodes. page cache is now clean.


P1: mmapped writes (resume)
tn+x)__do_page_fault() gets control back:
   tn+x+1) set_page_dirty()
     tn+x+2) __set_page_dirty_buffers()
        tn+x+3) __set_page_dirty()
 	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)

So don't we end up dirtying the page cache when the F.S is frozen?

Again, apologies if I understood the writeback code or something else wrong!

Warm Regards,
Surbhi.

   Follow the path from set_page_dirty() ->  __set_page_dirty_buffers()
->  __set_page_dirty() ->  __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);



   More code reading would save you (and me) some typing ;).
P/S: Sorry about that!


								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux