On Fri, Oct 23, 2015 at 01:04:54PM +0900, Masanari Iida wrote: > Hello Developer, > I have a question about ext4's internal. > > OS: RHEL6.2 > Filesystem EXT4 > mount option = ordered > > My understanding on ext4 with ordered mode, > When a file is created, data is written to FS block, > At the same time, metadata is stored into journal, > and then meta data on journal is written to the inode block. > What is the next? Well, that's not quite a complete picture. Ext4 has an advanced feature called delayed allocation, which means that the FS block is not allocated until writeback occurs. So there is some file system metadata which is modified as soon as the file is created (i.e., the directory, inode allocation bitmap, the inode table block itself), and this is held in memory until the journal commit is triggered (either every five seconds, or if the size of the transaction grows beyond a certian size, or an fsync), at which point the metadata blocks that have been modified since the last commit are written into the journal, and once the commit block is written, the modified metadata blocks are _allowed_ to be written back to disk by the normal writeback mechanisms. When the data writeback timer expires (30 seconds by default), then writeback happens. It's only then that the location on disk is determined, and when the block is allocated, this will result in more metadata blocks getting modified, which are handled as described above. In general once we've allocated the block, the write to disk is immediately scheduled, and the commit that commits the will happen shortly after. > My question is > Does the kernel remove the meta data on journal after each successful > transaction? The journal is a circular buffer. Once all of the blocks that participated in the a jbd2 transaction have been written back to their final location on disk, the transaction gets retired. However, we don't necessarily automatically update the jbd2 superblock's tail pointer each time a transaction can be retired, because doing this to "remove" one or more transsaction requires a write to the jbd2 superblock, and we want to minimize unnecessary writes. This might mean that when we recover after a crash, we might end up replaying some transcations that don't need to be replayed, but that should be an uncommon case that we shouldn't be optimizing for. > As I see the contents of journal entries in EXT4 using debugfs(8), > the journal entries are growing when creating or deleting the files. > I am curious to know what make the system to remove journal entries > while mounted the fs. > > Background of the question. > I have encountered a case that when I delete and create some files, > journal entry for deleting the file exist > But journal entry for creating the file was not exist. > FYI, the file itself exist when I see it by using debugfs. > > I created snapshot of the filesystem and run fsck on copy image. > Then the file was _removed_ by fsck operation. > This is why I want to know how journal on EXT4 were controlled. This is too vague for me to comment. If you give very detailed of what file system operations you might have been trying to do, and whether you called fsync(2) or not, and how long you waited before taking the snapshot, that would be helpful. I will observe that because of delayed allocation, if you don't wait for the writeback timer to expire, if you take a snapshot or there is a crash immediately after writing the file, what you might find after the recovery process is a zero-length file. If you want to make sure a file and its contents will be there after a crash, make sure you call the fsync() system call. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html