fio rbd engine "perfectly" fragments filestore file systems

Christian Balzer <chibi@xxxxxxx> · Fri, 5 Aug 2016 16:39:06 +0900

Hello,

first nugget from my new staging/test cluster.

As mentioned yesterday, now running latest Hammer under Debian Jessie
(with sysvinit) and manually created OSDs. 
2 nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2 200GB DC S3610 for
OS and journals, 4 1GB 2.5" SATAs for OSDs.
For my amusement and edification the OSDs of one node are formatted with
XFS, the other one EXT4 (as all my production clusters).

There are 2 more nodes, all SSDs for cache tiering and SSD pool tests, but
they're not in the picture yet.

So of course the first thing one does with new gear is put it through its
paces, summoning up the ole rados bench and friends.
Since the compute note of this staging environment are not online yet, I
used fio with the rbd engine for the first time ever.

And incidentally the first invocation was this:
---
fio --size=4G --ioengine=rbd --rbdname=goat --pool=rbd --clientname=admin
--invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=32
---

I did various other runs with 4MB blocks, sequential writes, etc.
At some point slow requests crept up, OSDs were extremely busy and lagging
behind, something I would not expect on an idle cluster of this caliber.

As it turned out, the above command completely shredded (as in fragmented
the hell out of things) the OSDs.
At his point the "goat" image was the only data in the cluster and the XFS
OSDs were at 99% fragmentation and the EXT4 ones had a score of 83,
something I've never seen before.

To put this in perspective the busy cache pool SSDs on my main production
cluster have score of 4.

Now, I'm not sure how badly this would affect other data (images), but
clearly something is not quite right here, for one thing how does it even
manage to get things that fragmented down to the OSD FS level?
Something the devs might want to take a look at.

Defragging the OSDs or simply removing the image of course cleaned things
up.

In the quest to reproduce this, I also created another image, formatted
it and then used fio inside with the libaio engine. 
Of course not fragmentation to speak of happened. 

Running fio with the rbd engine and doing sequential, 4MB block writes
first also worked fine, it's the initial randwrite of 4KB blocks that
triggers this behavior.

The only thing I could think of that would come close this in terms of
small RADOS ops would be "rados bench" with 4K blocks, but while it does
individual objects, it obviously doesn't fragment things.

This is before such a rados run:
---
 Total/best extents                             729/729
 Average size per extent                        2959 KB
 Fragmentation score                            0
---

And this after:
---
 Total/best extents                             36276/36271
 Average size per extent                        63 KB
 Fragmentation score                            0
---
So yeah, lots of small files, but no fragmentation and after a cleanup of
course back to normal anyway.

For reference, my real production OSDs with lots of data on them tend to
trend close to the 4MB object size:
---
 Total/best extents                             228261/227885
 Average size per extent                        3966 KB
 Fragmentation score                            0
---

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com