Re: About _setattr() optimazation and recovery accelerate

Haomai Wang <haomaiwang@xxxxxxxxx> · Mon, 9 Mar 2015 13:35:13 +0800

On Mon, Mar 9, 2015 at 1:26 PM, Nicheal <zay11022@xxxxxxxxx> wrote:
> 2015-03-07 16:43 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
>> On Sat, Mar 7, 2015 at 12:03 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> Hi!
>>>
>>> [copying ceph-devel]
>>>
>>> On Fri, 6 Mar 2015, Nicheal wrote:
>>>> Hi Sage,
>>>>
>>>> Cool for issue #3878, Duplicated pg_log write, which is post early in
>>>> my issue #3244 and Single omap_setkeys transaction improve the
>>>> performance in FileStore as in my previous testing (most of time cost
>>>> in FileStore is in the transaction omap_setkeys).
>>>
>>> I can't find #3244?
>>
>> I think it's https://github.com/ceph/ceph/pull/3244
>>
> Yeah, exactly it is.
>>>
>>>> Well, I think another performance issue is to the strategy of setattrs.
>>>> Here is some kernel log achieve from xfs behavious.
>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>> ceph._(6), value =.259)
>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>> forks data: 1
>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>>> di_anextents=0, di_forkoff=239
>>>>
>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>> ceph._(6), value =.259)
>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>> forks data: 2
>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=1,
>>>> di_anextents=1, di_forkoff=239
>>>>
>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>> ceph._(6), value =.259)
>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>> forks data: 2
>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>>> di_anextents=1, di_forkoff=239
>>>>
>>>> typedef enum xfs_dinode_fmt {
>>>> XFS_DINODE_FMT_DEV, /* xfs_dev_t */
>>>> XFS_DINODE_FMT_LOCAL, /* bulk data */
>>>> XFS_DINODE_FMT_EXTENTS, /* struct xfs_bmbt_rec */
>>>> XFS_DINODE_FMT_BTREE, /* struct xfs_bmdr_block */
>>>> XFS_DINODE_FMT_UUID /* uuid_t */
>>>> } xfs_dinode_fmt_t;
>>>>
>>>> while attr forks data = 2 means XFS_DINODE_FMT_EXTENTS (xattr is
>>>> stored in extent format), while attr forks data =1 means
>>>> XFS_DINODE_FMT_LOCAL(xattr is stored as inline attribute).
>>>>
>>>> However, in most cases, xattr attribute is stored in extent, not
>>>> inline. Please note that, I have already formatted the partition with
>>>> -i size=2048.  when the number of xattrs is larger than 10, it uses
>>>> XFS_DINODE_FMT_BTREE to accelerate key searching.
>>>
>>> Did you by chance look at what size the typical xattrs are?  I expected
>>> that the usual _ and snapset attrs would be small enough to fit inline..
>>> but if they're not then we should at a minimum adjust our recommendation
>>> on xfs inode size.
>>>
>>>> So, in _setattr(), we may just get xattr_key by using chain_flistxattr
>>>> instead of  _fgetattrs, which retrieve (key, value) pair, as value is
>>>> exactly no use here. and furthermore, we may consider the strategies
>>>> that we need move spill_out xattr to omap, while xfs only restricts
>>>> that each xattr value < 64K and each xattr key < 255byte.  And
>>>> duplicated read for XATTR_SPILL_OUT_NAME also occurs in:
>>>> r = chain_fgetxattr(**fd, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
>>>> r = _fgetattrs(**fd, inline_set);
>>>> And I try to ignore the _fgetattrs() logic and just update xattr
>>>> update in _setattr(), my ssd cluster will be improved about 2% - 3%
>>>> performance.
>>>
>>> I'm not quite following... do you have a patch we can look at?
>>
>> I think his meaning is that we can use minimal xattr attrs and avoid
>> xattr-chains but using omap.
>>
> Yes, make a basic assumption, for example, we just allow user.ceph._
> and user.ceph.snapset as xattr attrs. Then we may simplify the logic a
> lots. Actually, the purpose to implement automatic decision to
> redirect the xattr into omap is served for cephfs, which may save user
> defined xattrs.  For rbd case, no this problem since it is just two
> xattr attrs (user.ceph._  and user.ceph.snapset), and for ecpool, one
> more for hash, which is predictable. Furthermore, I prefer to stop
> recording user.ceph.snapset when there is too much fragment. There is
> a huge performance penalty when user.ceph.snapset is large.
> Since both of Extent and BTREE layout is remote xattr, not inline
> xattr in xfs, I think using omap will not cause much performance
> penalty, especially for HDD based FileStore.

Yeah, I agree with this. So this is a little dive into XFS internal if
we want to
do better this xattrs. If xfs can export this or boundary of xattr
type(btree, inline or list)
would be great.

So do you have any detail about xfs xattr size client usage and how to
optimize FileStore xattr decision? In other word, is it make sense
that FileStore
can aware of XFS xattr layout online or when initing FileStore and so
we can decide the right way to store it
at least.
>
>>>
>>>> Another issue about an idea of recovery is showed in
>>>> https://github.com/ceph/ceph/pull/3837
>>>> Can you give some suggestion about that?
>>>
>>> I think this direction has a lot of potential, although it will add a fair
>>> bit of complexity.
>>>
>>> I think you can avoid the truncate field and infer that from the dirtied
>>> interval and the new object size.  Need to look at the patch more closely
>>> still, though...
> Uh, Yeah, the purpose I use truncate field is to deal with the situation below:
>     1) A is down
>     2) B do the truncate operation to truncate the entirely 4M object
> to 3M, then do some  write may extend the object to 4M again, but with
> lots of holes between [3M - 4M]
>     3) calling recovery will find two objects are both 4M so that
> there is no need to truncate, but a partially data recovery between
> [3M - 4M] may cause data inconsistency, so we also need to do a
> truncate operation for the object in A.
>     We can still infer those cases if we mark [3M-4M] dirty when
> truncate occurs in B, while it will be incompatible with sparse
> read/write since we need to read out all holes between [3M-4M] and
> write it to A. How do you think that? @Haomai Wang.
>
>>
>> For xattr and omap optimization I expect this PR mostly
>> https://github.com/ceph/ceph/pull/2972
>>
>>
> This is patch submitted by my teammate. But we do more, this is one of
> the patch we exactly confirm that it can work well under any
> circumstance(either HDD or SSD). And we do some other effects liking
> moving pg_info and epoch to per pg xattr attrs. It works better better
> under SSD environment but not HDD. Writing data in Omap will be
> costing since huge write amplification. It exhausts bandwidth of the
> SSD, especially under 4k-write cases (every write operation need to
> update pg_info and epoch to Omap). But using file-system xattr attrs
> can benefit from the system page cache and update pg_info once when we
> call sync().
> An alternative is that we keep the newest pg_info on-disk content in
> cache and flush them on disk when we before call sync (sync_entry in
> FileStore) and still use omap to store pg_info.
>
> we actually do lots of affects and testing on the performance on how
> to dealing with xattrs. Under SSD cluster, it is much better to store
> information as extent attribute of a file-system, while under HDD
> cluster, omap always performs better because remote extent attribute
> need to read or store the contents of xattr in other disk blocks, not
> containing in inode. So when we call open("file_name"), it just load
> the inode but the value of xattr is not load into memory and we need a
> second disk read operation when we call getattr(), which becomes a
> random read/write issue for HDD. Also double write operations when
> calling sync(), omap use LOG to bypass those effects.
>
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html