Re: cephfs file block size: must it be so big?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 13, 2018 at 3:31 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
I've searched the ceph-users archives and found no discussion to speak of of
Cephfs block sizes, and I wonder how much people have thought about it.

The POSIX 'stat' system call reports for each file a block size, which is
usually defined vaguely as the smallest read or write size that is efficient.
It usually takes into account that small writes may require a
read-modify-write and there may be a minimum size on reads from backing
storage.

One thing that uses this information is the stream I/O implementation
(fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually writes
full blocks, buffering as necessary.

Most filesystems report this number as 4K.

Ceph reports the stripe unit (stripe column size), which is the maximum size
of the RADOS objects that back the file.  This is 4M by default.

One result of this is that a program uses a thousand times more buffer space
when running against a Ceph file as against a traditional filesystem.

And a really pernicious result occurs when you have a special file in Cephfs.
Block size doesn't make any sense at all for special files, and it's probably
a bad idea to use stream I/O to read one, but I've seen it done.  The Chrony
clock synchronizer programs use fread to read random numbers from
/dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with defaults,
it's going to generate 4M of random bits to satisfy a 4-byte request.  On one
of my computers, that takes 7 seconds - and wipes out the entropy pool.


Has stat block size been discussed much?  Is there a good reason that it's
the RADOS object size?

I'm thinking of modifying the cephfs filesystem driver to add a mount option
to specify a fixed block size to be reported for all files, and using 4K or
64K.  Would that break something?

I remember this being a huge pain in the butt for a variety of reasons. Going back through the logs though it looks like the main reason we do a 4MiB block size is so that we have a chance of reporting actual cluster sizes to 32-bit systems, so obviously mount options to change it should work fine as long as there aren't any shortcuts in the code. (Given that we've previously switched from 4KiB to 4MiB, I wouldn't expect that to be a problem.) My main worry would be that we definitely want to make sure that the block size is appropriate for anybody using EC data pools, which may be a little more complicated than a simple 4KiB or 64KiB setting.

It was kind of fun switching though since it revealed a lot of ecosystem tools assuming the FS' block size was the same as a page size. :D
-Greg

 

--
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux