Re: cephfs file block size: must it be so big?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 14, 2018 at 7:50 AM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
>
> I've searched the ceph-users archives and found no discussion to speak of of
> Cephfs block sizes, and I wonder how much people have thought about it.
>
> The POSIX 'stat' system call reports for each file a block size, which is
> usually defined vaguely as the smallest read or write size that is efficient.
> It usually takes into account that small writes may require a
> read-modify-write and there may be a minimum size on reads from backing
> storage.
>
> One thing that uses this information is the stream I/O implementation
> (fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually writes
> full blocks, buffering as necessary.
>

I tested fread on Fedora 28. fread does 8k read on even block size is 4M.

> Most filesystems report this number as 4K.
>

NFS reports 1M block size

> Ceph reports the stripe unit (stripe column size), which is the maximum size
> of the RADOS objects that back the file.  This is 4M by default.
>
> One result of this is that a program uses a thousand times more buffer space
> when running against a Ceph file as against a traditional filesystem.
>
> And a really pernicious result occurs when you have a special file in Cephfs.
> Block size doesn't make any sense at all for special files, and it's probably
> a bad idea to use stream I/O to read one, but I've seen it done.  The Chrony
> clock synchronizer programs use fread to read random numbers from
> /dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with defaults,
> it's going to generate 4M of random bits to satisfy a 4-byte request.  On one
> of my computers, that takes 7 seconds - and wipes out the entropy pool.
>

This patch should address this issue.

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index c50501c6005a..7f82ceff510a 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -907,6 +907,7 @@ static int fill_inode(struct inode *inode, struct
page *locked_page,
        case S_IFBLK:
        case S_IFCHR:
        case S_IFSOCK:
+               inode->i_blkbits = PAGE_SHIFT;
                init_special_inode(inode, inode->i_mode, inode->i_rdev);
                inode->i_op = &ceph_file_iops;
                break;


>
> Has stat block size been discussed much?  Is there a good reason that it's
> the RADOS object size?
>
> I'm thinking of modifying the cephfs filesystem driver to add a mount option
> to specify a fixed block size to be reported for all files, and using 4K or
> 64K.  Would that break something?

mount option should work.

Regards
Yan, Zheng

>
> --
> Bryan Henderson                                   San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux