Re: Re: the max size of block device on 32bit os,when usingdo_generic_file_read() proceed.

"majianpeng" <majianpeng@xxxxxxxxx> · Mon, 28 May 2012 14:26:27 +0800

Sorry for late to reply.I reviewed the code again and found some probleam.
I created a soft-raid and the size was larger than 16T.
The os is ubuntu 12.04 32bit x86.
The udev create the block node is /dev dir(as tmpfs).
And I readed the tmpfs code :
in mm/shmem.c:shmem_fill_super()
>sb->s_maxbytes = MAX_LFS_FILESIZE;
In my computer, MAX_LFS_FILESZE is equal 8T -1.
But the read code:
generic_file_aio_read-->do_generic_file_read[not use direct flag
In function:do_generic_file_read():
>index = *ppos >> PAGE_CACHE_SHIFT;
index is the type of pgoff_t.
So if  *ppos is larger than 16T, the index is overflow.As you said, it will read low position data.

But I tested the write operation:
blkdev_aio_write->__generic_file_aio_write.
In function:__generic_file_aio_write()
It will check by function:generic_write_checks()
But In function
>if (likely(!isblk)) {
>		if (unlikely(*pos >= inode->i_sb->s_maxbytes)) {
>			if (*count || *pos > inode->i_sb->s_maxbytes) {
>				return -EFBIG;
>			}
>			/* zero-length writes at ->s_maxbytes are OK */
>		}

>		if (unlikely(*pos + *count > inode->i_sb->s_maxbytes))
>			*count = inode->i_sb->s_maxbytes - *pos;
>	} else {
>#ifdef CONFIG_BLOCK
>		loff_t isize;
>		if (bdev_read_only(I_BDEV(inode)))
>			return -EPERM;
>		isize = i_size_read(inode);
>		if (*pos >= isize) {
>			if (*count || *pos > isize)
>				return -ENOSPC;
>		}

>		if (*pos + *count > isize)
>			*count = isize - *pos;
>#else
>		return -EPERM;
>#endif
Although it check (s_maxbytes)MAX_LFS_FILESIZE.But is file is block device,it did not check,it only check the real size.
But there is also a bug.Because if block size > 16T,there was not error and execed continue.
When exec generic_file_buffered_write()[no odriect action] --->generic_perform_write-->write_begin[blkdev_write_begin]
--->block_write_begin
In function:block_write_begin()
>pgoff_t index = pos >> PAGE_CACHE_SHIFT;
index will overflow.

I once thought to patch those bug(I may be well-known ,haha).But I can't,as is generic_write_checks():
>/*
>	 * Are we about to exceed the fs block limit ?
>	 *
>	 * If we have written data it becomes a short write.  If we have
>	 * exceeded without writing data we send a signal and return EFBIG.
>	 * Linus frestrict idea will clean these up nicely..
>	 */
>	if (likely(!isblk)) {
how to deal with block? As a regular file or not?

------------------				 
majianpeng
2012-05-28

-------------------------------------------------------------
发件人：Hugh Dickins
发送日期：2012-05-27 05:24:13
收件人：majianpeng
抄送：Al Viro; Andrew Morton; linux-mm; linux-fsdevel
主题：Re: the max size of block device on 32bit os,when usingdo_generic_file_read() proceed.

On Thu, 24 May 2012, majianpeng wrote:
>   Hi all:
> 		I readed a raid5,which size 30T.OS is RHEL6 32bit.
> 	    I reaed the raid5(as a whole,not parted) and found read address which not i wanted.
> 		So I tested the newest kernel code,the problem is still.
> 		I review the code, in function do_generic_file_read()
> 
> 		index = *ppos >> PAGE_CACHE_SHIFT;
> 		index is u32.and *ppos is long long.
> 		So when *ppos is larger than 0xFFFF FFFF *  PAGE_CACHE_SHIFT(16T Byte),then the index is error.
> 
> 		I wonder this .In 32bit os ,block devices size do not large then 16T,in other words, if block devices larger than 16T,must parted.

I am not surprised that the page cache limitation prevents you from
reading the whole device with a 32-bit kernel.  See MAX_LFS_FILESIZE in
include/linux/fs.h.  Our answer to that is just to use a 64-bit kernel.

#if BITS_PER_LONG==32
#define MAX_LFS_FILESIZE (((u64)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1) 
#elif BITS_PER_LONG==64
#define MAX_LFS_FILESIZE 0x7fffffffffffffffUL
#endif

But I am a little surprised that you get as far as 16TiB (with 4k page):
I would have expected you to be stopped just before 8TiB (although I
suspect that the limitation to 8TiB rather than 16TiB is unnecessary).

And if I understand you correctly, read() or pread() gave you no error
at those large offsets, but supplied data from the low offset instead?

That does surprise me - have we missed a check there?

Hugh
?韬{.n?????%??檩??w?{.n???{饼?z鳐??骅w*jg????????G??⒏⒎?:+v????????????＂??????