On Fri, Dec 14, 2018 at 7:50 AM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote: > > I've searched the ceph-users archives and found no discussion to speak of of > Cephfs block sizes, and I wonder how much people have thought about it. > > The POSIX 'stat' system call reports for each file a block size, which is > usually defined vaguely as the smallest read or write size that is efficient. > It usually takes into account that small writes may require a > read-modify-write and there may be a minimum size on reads from backing > storage. > > One thing that uses this information is the stream I/O implementation > (fopen/fclose/fread/fwrite) in GNU libc. It always reads and usually writes > full blocks, buffering as necessary. > I tested fread on Fedora 28. fread does 8k read on even block size is 4M. > Most filesystems report this number as 4K. > NFS reports 1M block size > Ceph reports the stripe unit (stripe column size), which is the maximum size > of the RADOS objects that back the file. This is 4M by default. > > One result of this is that a program uses a thousand times more buffer space > when running against a Ceph file as against a traditional filesystem. > > And a really pernicious result occurs when you have a special file in Cephfs. > Block size doesn't make any sense at all for special files, and it's probably > a bad idea to use stream I/O to read one, but I've seen it done. The Chrony > clock synchronizer programs use fread to read random numbers from > /dev/urandom. Should /dev/urandom be in a Cephfs filesystem, with defaults, > it's going to generate 4M of random bits to satisfy a 4-byte request. On one > of my computers, that takes 7 seconds - and wipes out the entropy pool. > This patch should address this issue. diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index c50501c6005a..7f82ceff510a 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -907,6 +907,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page, case S_IFBLK: case S_IFCHR: case S_IFSOCK: + inode->i_blkbits = PAGE_SHIFT; init_special_inode(inode, inode->i_mode, inode->i_rdev); inode->i_op = &ceph_file_iops; break; > > Has stat block size been discussed much? Is there a good reason that it's > the RADOS object size? > > I'm thinking of modifying the cephfs filesystem driver to add a mount option > to specify a fixed block size to be reported for all files, and using 4K or > 64K. Would that break something? mount option should work. Regards Yan, Zheng > > -- > Bryan Henderson San Jose, California > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com