Re: cephfs file block size: must it be so big?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 13 Dec 2018 16:36:27 -0800

On Thu, Dec 13, 2018 at 3:31 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
I've searched the ceph-users archives and found no discussion to speak of of

Cephfs block sizes, and I wonder how much people have thought about it.

The POSIX 'stat' system call reports for each file a block size, which is

usually defined vaguely as the smallest read or write size that is efficient.

It usually takes into account that small writes may require a

read-modify-write and there may be a minimum size on reads from backing

storage.

One thing that uses this information is the stream I/O implementation

(fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually writes

full blocks, buffering as necessary.

Most filesystems report this number as 4K.

Ceph reports the stripe unit (stripe column size), which is the maximum size

of the RADOS objects that back the file.  This is 4M by default.

One result of this is that a program uses a thousand times more buffer space

when running against a Ceph file as against a traditional filesystem.

And a really pernicious result occurs when you have a special file in Cephfs.

Block size doesn't make any sense at all for special files, and it's probably

a bad idea to use stream I/O to read one, but I've seen it done.  The Chrony

clock synchronizer programs use fread to read random numbers from

/dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with defaults,

it's going to generate 4M of random bits to satisfy a 4-byte request.  On one

of my computers, that takes 7 seconds - and wipes out the entropy pool.

Has stat block size been discussed much?  Is there a good reason that it's

the RADOS object size?

I'm thinking of modifying the cephfs filesystem driver to add a mount option

to specify a fixed block size to be reported for all files, and using 4K or

64K.  Would that break something?

I remember this being a huge pain in the butt for a variety of reasons. Going back through the logs though it looks like the main reason we do a 4MiB block size is so that we have a chance of reporting actual cluster sizes to 32-bit systems, so obviously mount options to change it should work fine as long as there aren't any shortcuts in the code. (Given that we've previously switched from 4KiB to 4MiB, I wouldn't expect that to be a problem.) My main worry would be that we definitely want to make sure that the block size is appropriate for anybody using EC data pools, which may be a little more complicated than a simple 4KiB or 64KiB setting.

It was kind of fun switching though since it revealed a lot of ecosystem tools assuming the FS' block size was the same as a page size. :D
-Greg

-- 

Bryan Henderson                                   San Jose, California

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com