RE: Ceph write path optimization

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Tue, 28 Jul 2015 22:03:48 +0000

Thanks Lukas for the response.
I didn't try with lazy-count , but, I tried with agcount. I saw a post that *reducing* agcount and directory size may alleviate xfsaild effect. I have ~7.5 TB drive, so, agcount is min 7..I moved to 4 TB partition and made agcount 4, but didn't help me much.
Also, I tried to put xfs journal log to different device and that didn't help me either (that's may be because it is all about syncing metadata to the same device)..
But, I will try with lazy-count=1 and increased agcount and keep you posted.

Regards
Somnath

-----Original Message-----
From: mr.erdk@xxxxxxxxx [mailto:mr.erdk@xxxxxxxxx] On Behalf Of Lukasz Redynk
Sent: Tuesday, July 28, 2015 2:46 PM
To: Somnath Roy
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: Ceph write path optimization

Hi,

Have you tried to tune XFS mkfs options? From mkfs.xfs(8)
a) (log section, -l)
lazy-count=value // by default is 0

This changes the method of logging various persistent counters in the superblock. Under metadata intensive workloads, these counters are updated and logged frequently enough that the superblock updates become a serialisation point in the filesystem. The value can be either 0 or 1.

and b) (data section, -d)

agcount=value // by default is 2 (?)

This is used to specify the number of allocation groups. The data section of the filesystem is divided into allocation groups to improve the performance of XFS. More allocation groups imply that more parallelism can be achieved when allocating blocks and inodes. The minimum allocation group size is 16 MiB; the maximum size is just under 1 TiB. The data section of the filesystem is divided into value allocation groups (default value is scaled automatically based on the underlying device size).

Lately I was experimenting with this two and appeared to setting lazy-count to 1 and increasing agcount shows positive impact on IOPS, but unfortunately I don't have any performance numbers on this.

-Lukas

2015-07-28 23:08 GMT+02:00 Somnath Roy <Somnath.Roy@xxxxxxxxxxx>:
> Hi,
> Eventually, I have a working prototype and able to gather some performance comparison data with the changes I was talking about in the last performance meeting. Mark's suggestion of a write up was long pending, so, trying to summarize what I am trying to do.
>
> Objective:
> -----------
>
> 1. Is to saturate SSD write bandwidth with ceph + filestore.
>      Most of the deployment of ceph + all flash so far (as far as I know) is having both data and journal on the same SSD. SSDs are far from saturate and the write performance of ceph is dismal (compare to HW). Can we improve that ?
>
> 2. Ceph write performance in most of the cases are not stable, can we have a stable performance out most of the time ?
>
>
> Findings/Optimization so far..
> ------------------------------------
>
> 1. I saw in flash environment you need to reduce the filestore_max_sync_interval a lot (from default 5min) and thus the benefit of syncfs coalescing and writing is going away.
>
> 2. We have some logic to determine the max sequence number it can commit. That is adding some latency (>1 ms or so).
>
> 3. This delay is filling up journals quickly if I remove all throttles from the filestore/journal.
>
> 4. Existing throttle scheme is very difficult to tune.
>
> 5. In case of write-ahead journaling the commit file is probably redundant as we can get the last committed seq number from journal headers during next OSD start. The fact that, the sync interval we need to reduce , this extra write will only add more to WA (also one extra fsync).
>
> The existing scheme is well suited for HDD environment, but, probably not for flash. So, I made the following changes.
>
> 1. First, I removed the extra commit seq file write and changed the journal replay stuff accordingly.
>
> 2. Each filestore Op threads is now doing O_DSYNC write followed by 
> posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);
>
> 3. I derived an algorithm that each worker thread is executing to determine the max seq it can trim the journal to.
>
> 4. Introduced a new throttle scheme that will throttle journal write based on the % space left.
>
> 5. I saw that this scheme is definitely emptying up the journal faster and able to saturate the SSD more.
>
> 6. But, even if we are not saturating any resources, if we are having both data and journal on the same drive, both writes are suffering latencies.
>
> 7. Separating out journal to different disk , the same code (and also stock)  is running faster. Not sure about the exact reason, but, something to do with underlying layer. Still investigating.
>
> 8. Now, if we want to separate out journal, SSD is *not an option*. The reason is, after some point we will be limited by SSD BW and all writes for N osds going to that SSD will wear out that SSD very fast. Also, this will be a very expensive solution considering high end journal SSD.
>
> 9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So, If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM durability is much higher).  The stock code as is (without throttle), the performance is becoming very spiky for obvious reason.
>
> 10. But, with the above mentioned changes, I am able to make a constant high performance out most of the time.
>
> 11. I am also trying the existing synfs codebase (without op_seq file) + the throttle scheme I mentioned in this setup to see if we can get out a stable improve performance out or not. This is still under investigation.
>
> 12. Initial benchmark with single OSD (no replication) looks promising and you can find the draft here.
>
>        
> https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjW
> G8bkNwKY8/edit?usp=sharing
>
> 13. I still need to try this out by increasing number of OSDs.
>
> 14. Also, need to see how this scheme is helping both data/journal on the same SSD.
>
> 15. The main challenge I am facing in both the scheme is XFS metadata flush process (xfsaild) is choking all the processes accessing the disk when it is waking up. I can delay it till max 30 sec and if there are lot of dirty metadata, there is a performance spike down for very brief amount of time. Even if we are acknowledging writes from say NVRAM journal write, still the opthreads are doing getattrs on the XFS and those threads are getting blocked. I tried with ext4 and this problem is not there since it is writing metadata synchronously by default, but, the overall performance of ext4 is much less. I am not an expert on filesystem, so, any help on this is much appreciated.
>
> Mark,
> If we have time, we can discuss this result in tomorrow's performance meeting.
>
> Thanks & Regards
> Somnath
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f