On Tue 05-11-13 17:28:35, Andiry Xu wrote: > Hi, > > On Tue, Nov 5, 2013 at 6:32 AM, Jan Kara <jack@xxxxxxx> wrote: > > Hello, > > > > On Mon 04-11-13 18:37:40, Andiry Xu wrote: > >> On Mon, Nov 4, 2013 at 4:37 PM, Jan Kara <jack@xxxxxxx> wrote: > >> > Hello, > >> > > >> > On Mon 04-11-13 14:31:34, Andiry Xu wrote: > >> >> When I'm trying XIP on ext2, I find that xip does not work on ext2 > >> >> with latest kernel. > >> >> > >> >> Reproduce steps: > >> >> Compile kernel with following configs: > >> >> CONFIG_BLK_DEV_XIP=y > >> >> CONFIG_EXT2_FS_XIP=y > >> >> > >> >> And run following commands: > >> >> # mke2fs -b 4096 /dev/ram0 > >> >> # mount -t ext2 -o xip /dev/ram0 /mnt/ramdisk/ > >> >> # dd if=/dev/zero of=/mnt/ramdisk/test1 bs=1M count=16 > >> >> > >> >> And it shows: > >> >> dd: writing `/mnt/ramdisk/test1': No space left on device > >> >> > >> >> df also shows /mnt/ramdisk is 100% full. Its default size is 64MB so a > >> >> 16MB write should only occupy 1/4 capacity. > >> >> > >> >> Criminal commit: > >> >> After git bisect, it points to the following commit: > >> >> 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b > >> >> Ext2: mark inode dirty after the function dquot_free_block_nodirty is called > >> > Thanks for report and the bisection! > >> > > >> >> Particularly, the following code: > >> >> @@ -1412,9 +1415,11 @@ allocated: > >> >> *errp = 0; > >> >> brelse(bitmap_bh); > >> >> - dquot_free_block_nodirty(inode, *count-num); > >> >> - mark_inode_dirty(inode); > >> >> - *count = num; > >> >> + if (num < *count) { > >> >> + dquot_free_block_nodirty(inode, *count-num); > >> >> + mark_inode_dirty(inode); > >> >> + *count = num; > >> >> + } > >> >> return ret_block; > >> >> > >> >> Not mark_inode_dirty() is called only when num is less than *count. > >> >> However, I've seen > >> >> with the dd command, there is case where num >= *count. > >> >> > >> >> Fix: > >> >> I've verified that the following patch fixes the issue: > >> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c > >> >> index 9f9992b..5446a52 100644 > >> >> --- a/fs/ext2/balloc.c > >> >> +++ b/fs/ext2/balloc.c > >> >> @@ -1406,11 +1406,10 @@ allocated: > >> >> > >> >> *errp = 0; > >> >> brelse(bitmap_bh); > >> >> - if (num < *count) { > >> >> + if (num <= *count) > >> >> dquot_free_block_nodirty(inode, *count-num); > >> >> - mark_inode_dirty(inode); > >> >> - *count = num; > >> >> - } > >> >> + mark_inode_dirty(inode); > >> >> + *count = num; > >> >> return ret_block; > >> >> > >> >> io_error: > >> >> > >> >> However, I'm not familiar with ext2 source code and cannot tell if > >> >> this is the correct fix. At least it fixes my issue. > >> > With this, you have essentially reverted a hunk from commit > >> > 8e3dffc651cb668e1ff4d8b89cc1c3dde7540d3b. But I don't see a reason why it > >> > should be reverted. num should never ever be greater than *count and when > >> > num == count, we the code inside if doesn't do anything useful. > >> > > >> > I've looked into the code and I think I see the problem. It is a long > >> > standing bug in __ext2_get_block() in fs/ext2/xip.c. It calls > >> > ext2_get_block() asking for 0 blocks to map (while we really want 1 block). > >> > ext2_get_block() just passes that request and ext2_get_blocks() actually > >> > allocates 1 block. And that's were the commit you have identified makes a > >> > difference because previously we returned that 1 block was allocated while > >> > now we return that 0 blocks were allocated and thus allocation is repeated > >> > until all free blocks are exhaused. > >> > > >> > Attached patch should fix the problem. > >> > > >> > >> Thanks for the reply. I've verified that your patch fixes my issue. > >> And it's absolutely better than my solution. > >> > >> Tested-by: Andiry Xu <andiry.xu@xxxxxxxxx> > > Thanks for testing! > > > >> I have another question about ext2 XIP performance, although it's not > >> quite related to this thread. > >> > >> I'm testing xip with ext2 on a ram disk drive, the driver is brd.c. > >> The RAM disk size is 2GB and I pre-fill it to guarantee that all pages > >> reside in main memory. > >> Then I use two different applications to write to the ram disk. One is > >> open() with O_DIRECT flag, and writing with Posix write(). Another is > >> open() with O_DIRECT, mmap() it to user space, then use memcpy() to > >> write data. I use different request size to write data, from 512 bytes > >> to 64MB. > >> > >> In my understanding, the mmap version bypasses the file system and > >> does not go to kernel space, hence it should have better performance > >> than the Posix-write version. However, my test result shows it's not > >> always true: when the request size is between 8KB and 1MB, the > >> Posix-write() version has bandwidth about 7GB/s and mmap version only > >> has 5GB/s. The test is performed on a i7-3770K machine with 8GB > >> memory, kernel 3.12. Also I have tested on kernel 3.2, in which mmap > >> has really bad performance, only 2GB/s for all request sizes. > >> > >> Do you know the reason why write() outperforms mmap() in some cases? I > >> know it's not related the thread but I really appreciate if you can > >> answer my question. > > Well, I'm not completely sure. mmap()ed memory always works on page-by-page > > basis - you first access the page, it gets faulted in and you can further > > access it. So for small (sub page size) accesses this is a win because you > > don't have an overhead of syscall and fs write path. For accesses larger > > than page size the overhead of syscall and some initial checks is well > > hidden by other things. I guess write() ends up being more efficient > > because write path taken for each page is somewhat lighter than full page > > fault. But you'd need to look into perf data to get some hard numbers on > > where the time is spent. > > > > Thanks for the reply. However I have filled up the whole RAM disk > before doing the test, i.e. asked the brd driver to allocate all the > pages initially. Well, pages in ramdisk are always present, that's not an issue. But you will get a page fault to map a particular physical page in process' virtual address space when you first access that virtual address in the mapping from the process. The cost of setting up this virtual->physical mapping is what I'm talking about. If you had a process which first mmaps the file and writes to all pages in the mapping and *then* measure the cost of another round of writing to the mapping, I would expect you should see speeds close to those of memory bus. > Moreover I have done the test on PMFS, a file system > for persistent memory, and the result is the same. PMFS reserves some > physical memory during system boot and then use it to emulate the > persistent memory device, so there shouldn't be any page fault. Again I don't think page faults would be avoided here. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html