Hello, first nugget from my new staging/test cluster. As mentioned yesterday, now running latest Hammer under Debian Jessie (with sysvinit) and manually created OSDs. 2 nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2 200GB DC S3610 for OS and journals, 4 1GB 2.5" SATAs for OSDs. For my amusement and edification the OSDs of one node are formatted with XFS, the other one EXT4 (as all my production clusters). There are 2 more nodes, all SSDs for cache tiering and SSD pool tests, but they're not in the picture yet. So of course the first thing one does with new gear is put it through its paces, summoning up the ole rados bench and friends. Since the compute note of this staging environment are not online yet, I used fio with the rbd engine for the first time ever. And incidentally the first invocation was this: --- fio --size=4G --ioengine=rbd --rbdname=goat --pool=rbd --clientname=admin --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=32 --- I did various other runs with 4MB blocks, sequential writes, etc. At some point slow requests crept up, OSDs were extremely busy and lagging behind, something I would not expect on an idle cluster of this caliber. As it turned out, the above command completely shredded (as in fragmented the hell out of things) the OSDs. At his point the "goat" image was the only data in the cluster and the XFS OSDs were at 99% fragmentation and the EXT4 ones had a score of 83, something I've never seen before. To put this in perspective the busy cache pool SSDs on my main production cluster have score of 4. Now, I'm not sure how badly this would affect other data (images), but clearly something is not quite right here, for one thing how does it even manage to get things that fragmented down to the OSD FS level? Something the devs might want to take a look at. Defragging the OSDs or simply removing the image of course cleaned things up. In the quest to reproduce this, I also created another image, formatted it and then used fio inside with the libaio engine. Of course not fragmentation to speak of happened. Running fio with the rbd engine and doing sequential, 4MB block writes first also worked fine, it's the initial randwrite of 4KB blocks that triggers this behavior. The only thing I could think of that would come close this in terms of small RADOS ops would be "rados bench" with 4K blocks, but while it does individual objects, it obviously doesn't fragment things. This is before such a rados run: --- Total/best extents 729/729 Average size per extent 2959 KB Fragmentation score 0 --- And this after: --- Total/best extents 36276/36271 Average size per extent 63 KB Fragmentation score 0 --- So yeah, lots of small files, but no fragmentation and after a cleanup of course back to normal anyway. For reference, my real production OSDs with lots of data on them tend to trend close to the 4MB object size: --- Total/best extents 228261/227885 Average size per extent 3966 KB Fragmentation score 0 --- Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com