I've searched the ceph-users archives and found no discussion to speak of of Cephfs block sizes, and I wonder how much people have thought about it. The POSIX 'stat' system call reports for each file a block size, which is usually defined vaguely as the smallest read or write size that is efficient. It usually takes into account that small writes may require a read-modify-write and there may be a minimum size on reads from backing storage. One thing that uses this information is the stream I/O implementation (fopen/fclose/fread/fwrite) in GNU libc. It always reads and usually writes full blocks, buffering as necessary. Most filesystems report this number as 4K. Ceph reports the stripe unit (stripe column size), which is the maximum size of the RADOS objects that back the file. This is 4M by default. One result of this is that a program uses a thousand times more buffer space when running against a Ceph file as against a traditional filesystem. And a really pernicious result occurs when you have a special file in Cephfs. Block size doesn't make any sense at all for special files, and it's probably a bad idea to use stream I/O to read one, but I've seen it done. The Chrony clock synchronizer programs use fread to read random numbers from /dev/urandom. Should /dev/urandom be in a Cephfs filesystem, with defaults, it's going to generate 4M of random bits to satisfy a 4-byte request. On one of my computers, that takes 7 seconds - and wipes out the entropy pool. Has stat block size been discussed much? Is there a good reason that it's the RADOS object size? I'm thinking of modifying the cephfs filesystem driver to add a mount option to specify a fixed block size to be reported for all files, and using 4K or 64K. Would that break something? -- Bryan Henderson San Jose, California _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com