RE: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Mon, 25 Mar 2013 13:59:01 +0000

Rephrase it to make it more clear

From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Chen, Xiaoxi
Sent: 2013年3月25日 17:02
To: 'ceph-users@xxxxxxxxxxxxxx' (ceph-users@xxxxxxxxxxxxxx)
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random writes.

Hi list,
         We have hit and reproduce this issue for several times, ceph will suicide because FileStore: sync_entry timed out after a very heavy random IO on top of the RBD.
         My test environment is:
                            4 Nodes ceph cluster with 20 HDDs for OSDs and 4 Intel DCS3700 ssds for journal per node, that is 80 spindles in total
                            48 VMs spread across 12 Physical nodes, 48 RBD attached to the VMs 1:1 via QEMU, The Qemu Cache disabled.
                            Ceph @ 0.58
                            XFS were used.
         I am running  Aiostress (something like FIO) inside VMS to produce random write requests on top of each RBDs.

         From Ceph-w , ceph reports a very high Ops (10000+ /s) , but technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random write.
         When digging into the code, from Filestore.cc::_write(), it's clear that the OSD open object files without O_DIRECT, that means data writes will be buffered by pagecache, and then returned.Although ::sync_file_range called , but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync data to disk before it returns ,instead, it just initiate the write out IOs. 
                    So the situation is , since all writes just go to pagecache , the backend OSD data disk **seems** extremely fast for random write, so we can see  such a high Ops from Ceph-w. However, when OSD Sync_thread trying to sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure that all dirty page in PageCache(relate with that particular FS)  had written into disk. This will obviously take long time and you can only expect 100 IOPS for non-btrfs filesystem.   The performance gap exists there, a SSD journal can do 4K random wirte @  1K IOPS +, but for 4 HDDs(journaled  by the same SSD), they can only provide 400IOPS.
With the random write pressure continuing , the amount of dirty page in PageCache will keep increasing , sooner or later, the ::syncfs() cannot return within 600s(the default value of filestore_commit_timeout ) and triggered the ASSERT to suicide ceph-osd process.

   I have tried to reproduce this by rados bench,but failed.Because rados bench **create** objects rather than modify them, a bucket of creates can be merged into a single big writes. So I assume if anyone like to reproduce this issue, you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow data disk, choosing a small filestore_commit_timeout may be helpful to reproduce this issue in a small scale environment.

         Could you please let me know if you need any more informations & have some solutions? Thanks
                                                                                                                                                                                                                                                            Xiaoxi
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f