[NILFS2][BUG FIX][STATUS - February 7, 2013] flush kernel thread issue is under fix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[REPORTED BUGS]
1. Flush kernel thread issue - [under investigation and fix].
2. A lot of NILFS: bad btree node messages (readonly fs) - [under investigation].
3. Odd problem starting nilfs_cleanerd due to an eMMC misbehaviour - [TODO: investigation].
4. Kernel panic in nilfs - [TODO: investigation].
5. Uncleanable file system because of time issue - [TODO: fix].

[UNDER FIX]
Flush kernel thread issue

SYMPTOMS:
You can see flush kernel thread that uses 60 - 100% of CPU time during 5 - 40 minutes.

REPRODUCING PATH:
1. Generate some file about 100 - 500 GB (for example, by: "dd if=/dev/zero of=<file_name> bs=1048576 count=102400"). The size of the file defines the duration of the issue reproducibility.
2. Try to execute "rm", "truncate" or "dd" command. In the case of "dd" it needs to use without notrunc option (for example, "dd if=/dev/urandom of=<file_name> bs=1024 seek=51200 count=10").

INVESTIGATION RESUME:
We have such visible flush kernel thread behavior because of flow of
dirty pages are marked as "for_kupdate" or
"for_background" (writeback_control structure) in nilfs_mdt_write_page()
and nilfs_writepage() methods. It is only called
redirty_page_for_writepage() and unlock_page() methods for such case.
Thereby, as a result, dirty pages return again and again into
nilfs_mdt_write_page() and nilfs_writepage() in the case of delete or
truncate operation.

But the reason of such issue is the nature of delete or truncate
operation in the NILFS2. Execution of metadata operations is protected
by pair of methods:
nilfs_transaction_begin()/nilfs_transaction_commit(). These methods use
semaphore for protecting of indivisible file operation. The file's
delete operation is a long operation by nature in the case of big files
because of necessity to use and to modify DAT file content. As a result,
we have to go through two b-tree - DAT file's b-tree and deleted file's
b-tree (that is described file's blocks). So, DAT file's b-tree and
deleted file's b-tree are located in significant count of blocks. Then,
we have situation when nilfs_bmap_truncate() operation needs to read and
to write many blocks for operation execution. Another threads (segctor
and nilfs_cleanerd) are blocked during this long operation (10 - 40
minutes).

I think that under heavy load I/O this situation can result in very bad
things.

DISCUSSION:
Unfortunately, this issue can't be resolved by simple fix. But I think
that in this case we can use the nature of the issue for adding a new
file system feature. I mean that, of course, we have opportunities for
some optimization of delete/truncate operation but, anyway, the nature
of the operation is to be a long operation. The duration of the
operation can grow linearly with a file's size growing. And, if we think
about delete/truncate operation as indivisible transaction then we will
have issue with segstor and nilfs_cleanerd threads blocking during
significant time (10 - 60 minutes) in the case of big files.

I can see only one way to make the delete/truncate operation faster and
safety operation for the case of big/huge files. The fundamental goal of
NILFS2 is to be reliable file system with opportunity to rollback into
some old state (snapshot). One of the possible technique can be not to
do immediately the real delete/truncate because it can be done by
mistake. It is possible to do by means of "rm" operation only moving of
deleting file into special folder that will keep deleted file for some
duration of time. Finally, such deleted files will be really deleted by
GC (or by special thread) in background. But while deleted file is kept
in special folder then it can be easily restored from this special
folder.

In the case of truncate operation we can save information about
truncation details only in the on-disk inode and to leave b-tree
temporary untouched. Blocks that were virtually freed during truncation
can be really freed by GC in background.

PLANNED:
1. To implement partial prefetch of DAT file's b-tree during mount.
2. To improve b-tree's nodes read-ahead technique for the case of delete operation.
3. To implement technique of "virtual" delete and truncate.

With the best regards,
Vyacheslav Dubeyko.






--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux BTRFS]     [Linux CIFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux