cephfs file block size: must it be so big?

Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> · 13 Dec 2018 23:24:42 +0000

I've searched the ceph-users archives and found no discussion to speak of of
Cephfs block sizes, and I wonder how much people have thought about it.

The POSIX 'stat' system call reports for each file a block size, which is
usually defined vaguely as the smallest read or write size that is efficient.
It usually takes into account that small writes may require a
read-modify-write and there may be a minimum size on reads from backing
storage.

One thing that uses this information is the stream I/O implementation
(fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually writes
full blocks, buffering as necessary.

Most filesystems report this number as 4K.

Ceph reports the stripe unit (stripe column size), which is the maximum size
of the RADOS objects that back the file.  This is 4M by default.

One result of this is that a program uses a thousand times more buffer space
when running against a Ceph file as against a traditional filesystem.

And a really pernicious result occurs when you have a special file in Cephfs.
Block size doesn't make any sense at all for special files, and it's probably
a bad idea to use stream I/O to read one, but I've seen it done.  The Chrony
clock synchronizer programs use fread to read random numbers from
/dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with defaults,
it's going to generate 4M of random bits to satisfy a 4-byte request.  On one
of my computers, that takes 7 seconds - and wipes out the entropy pool.

Has stat block size been discussed much?  Is there a good reason that it's
the RADOS object size?

I'm thinking of modifying the cephfs filesystem driver to add a mount option
to specify a fixed block size to be reported for all files, and using 4K or
64K.  Would that break something?

-- 
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com