Fwd: About _setattr() optimazation and recovery accelerate

Nicheal <zay11022@xxxxxxxxx> · Tue, 10 Mar 2015 11:39:56 +0800



>Yeah, I agree with this. So this is a little dive into XFS internal if
>we want to
>do better this xattrs. If xfs can export this or boundary of xattr
>type(btree, inline or list)
>would be great.
yeah, but need modified version of kernel and self-defined system
call. I do not prefer this way

>So do you have any detail about xfs xattr size client usage and how to
>optimize FileStore xattr decision? In other word, is it make sense
>that FileStore
>can aware of XFS xattr layout online or when initing FileStore and so
>we can decide the right way to store it
>at least.
Based on my analysis result, inline attribute should be the value <
254bytes if we make the inode size maximum(says -i size=2048 when
mkfs).
Before version 0.80.5, the value of ceph._ is less than 250bytes which
is definitely inline, while after version 0.87, the value of ceph._ is
259bytes so that it becomes extent formats. The reason is that
local_mtime and filestore_hobject_key_t is added.  One of the solution
is doing compression for object_info_t encoding value. Or clearing the
unused elements in object_info_t. Any other suggestions?
Anohter thing is that if we take snapshot, then we need to keep the
value of ceph.snapset < 254bytes, which means if there exits too many
fragment, do not record the overlaps.


Regards
Ning Yao


2015-03-09 13:35 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
> On Mon, Mar 9, 2015 at 1:26 PM, Nicheal <zay11022@xxxxxxxxx> wrote:
>> 2015-03-07 16:43 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
>>> On Sat, Mar 7, 2015 at 12:03 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>> Hi!
>>>>
>>>> [copying ceph-devel]
>>>>
>>>> On Fri, 6 Mar 2015, Nicheal wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Cool for issue #3878, Duplicated pg_log write, which is post early in
>>>>> my issue #3244 and Single omap_setkeys transaction improve the
>>>>> performance in FileStore as in my previous testing (most of time cost
>>>>> in FileStore is in the transaction omap_setkeys).
>>>>
>>>> I can't find #3244?
>>>
>>> I think it's https://github.com/ceph/ceph/pull/3244
>>>
>> Yeah, exactly it is.
>>>>
>>>>> Well, I think another performance issue is to the strategy of setattrs.
>>>>> Here is some kernel log achieve from xfs behavious.
>>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>>> ceph._(6), value =.259)
>>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>>> forks data: 1
>>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>>>> di_anextents=0, di_forkoff=239
>>>>>
>>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>>> ceph._(6), value =.259)
>>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>>> forks data: 2
>>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=1,
>>>>> di_anextents=1, di_forkoff=239
>>>>>
>>>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>>>> ceph._(6), value =.259)
>>>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>>>> forks data: 2
>>>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>>>> di_anextents=1, di_forkoff=239
>>>>>
>>>>> typedef enum xfs_dinode_fmt {
>>>>> XFS_DINODE_FMT_DEV, /* xfs_dev_t */
>>>>> XFS_DINODE_FMT_LOCAL, /* bulk data */
>>>>> XFS_DINODE_FMT_EXTENTS, /* struct xfs_bmbt_rec */
>>>>> XFS_DINODE_FMT_BTREE, /* struct xfs_bmdr_block */
>>>>> XFS_DINODE_FMT_UUID /* uuid_t */
>>>>> } xfs_dinode_fmt_t;
>>>>>
>>>>> while attr forks data = 2 means XFS_DINODE_FMT_EXTENTS (xattr is
>>>>> stored in extent format), while attr forks data =1 means
>>>>> XFS_DINODE_FMT_LOCAL(xattr is stored as inline attribute).
>>>>>
>>>>> However, in most cases, xattr attribute is stored in extent, not
>>>>> inline. Please note that, I have already formatted the partition with
>>>>> -i size=2048.  when the number of xattrs is larger than 10, it uses
>>>>> XFS_DINODE_FMT_BTREE to accelerate key searching.
>>>>
>>>> Did you by chance look at what size the typical xattrs are?  I expected
>>>> that the usual _ and snapset attrs would be small enough to fit inline..
>>>> but if they're not then we should at a minimum adjust our recommendation
>>>> on xfs inode size.
>>>>
>>>>> So, in _setattr(), we may just get xattr_key by using chain_flistxattr
>>>>> instead of  _fgetattrs, which retrieve (key, value) pair, as value is
>>>>> exactly no use here. and furthermore, we may consider the strategies
>>>>> that we need move spill_out xattr to omap, while xfs only restricts
>>>>> that each xattr value < 64K and each xattr key < 255byte.  And
>>>>> duplicated read for XATTR_SPILL_OUT_NAME also occurs in:
>>>>> r = chain_fgetxattr(**fd, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
>>>>> r = _fgetattrs(**fd, inline_set);
>>>>> And I try to ignore the _fgetattrs() logic and just update xattr
>>>>> update in _setattr(), my ssd cluster will be improved about 2% - 3%
>>>>> performance.
>>>>
>>>> I'm not quite following... do you have a patch we can look at?
>>>
>>> I think his meaning is that we can use minimal xattr attrs and avoid
>>> xattr-chains but using omap.
>>>
>> Yes, make a basic assumption, for example, we just allow user.ceph._
>> and user.ceph.snapset as xattr attrs. Then we may simplify the logic a
>> lots. Actually, the purpose to implement automatic decision to
>> redirect the xattr into omap is served for cephfs, which may save user
>> defined xattrs.  For rbd case, no this problem since it is just two
>> xattr attrs (user.ceph._  and user.ceph.snapset), and for ecpool, one
>> more for hash, which is predictable. Furthermore, I prefer to stop
>> recording user.ceph.snapset when there is too much fragment. There is
>> a huge performance penalty when user.ceph.snapset is large.
>> Since both of Extent and BTREE layout is remote xattr, not inline
>> xattr in xfs, I think using omap will not cause much performance
>> penalty, especially for HDD based FileStore.
>
> Yeah, I agree with this. So this is a little dive into XFS internal if
> we want to
> do better this xattrs. If xfs can export this or boundary of xattr
> type(btree, inline or list)
> would be great.
>
> So do you have any detail about xfs xattr size client usage and how to
> optimize FileStore xattr decision? In other word, is it make sense
> that FileStore
> can aware of XFS xattr layout online or when initing FileStore and so
> we can decide the right way to store it
> at least.
>>
>>>>
>>>>> Another issue about an idea of recovery is showed in
>>>>> https://github.com/ceph/ceph/pull/3837
>>>>> Can you give some suggestion about that?
>>>>
>>>> I think this direction has a lot of potential, although it will add a fair
>>>> bit of complexity.
>>>>
>>>> I think you can avoid the truncate field and infer that from the dirtied
>>>> interval and the new object size.  Need to look at the patch more closely
>>>> still, though...
>> Uh, Yeah, the purpose I use truncate field is to deal with the situation below:
>>     1) A is down
>>     2) B do the truncate operation to truncate the entirely 4M object
>> to 3M, then do some  write may extend the object to 4M again, but with
>> lots of holes between [3M - 4M]
>>     3) calling recovery will find two objects are both 4M so that
>> there is no need to truncate, but a partially data recovery between
>> [3M - 4M] may cause data inconsistency, so we also need to do a
>> truncate operation for the object in A.
>>     We can still infer those cases if we mark [3M-4M] dirty when
>> truncate occurs in B, while it will be incompatible with sparse
>> read/write since we need to read out all holes between [3M-4M] and
>> write it to A. How do you think that? @Haomai Wang.
>>
>>>
>>> For xattr and omap optimization I expect this PR mostly
>>> https://github.com/ceph/ceph/pull/2972
>>>
>>>
>> This is patch submitted by my teammate. But we do more, this is one of
>> the patch we exactly confirm that it can work well under any
>> circumstance(either HDD or SSD). And we do some other effects liking
>> moving pg_info and epoch to per pg xattr attrs. It works better better
>> under SSD environment but not HDD. Writing data in Omap will be
>> costing since huge write amplification. It exhausts bandwidth of the
>> SSD, especially under 4k-write cases (every write operation need to
>> update pg_info and epoch to Omap). But using file-system xattr attrs
>> can benefit from the system page cache and update pg_info once when we
>> call sync().
>> An alternative is that we keep the newest pg_info on-disk content in
>> cache and flush them on disk when we before call sync (sync_entry in
>> FileStore) and still use omap to store pg_info.
>>
>> we actually do lots of affects and testing on the performance on how
>> to dealing with xattrs. Under SSD cluster, it is much better to store
>> information as extent attribute of a file-system, while under HDD
>> cluster, omap always performs better because remote extent attribute
>> need to read or store the contents of xattr in other disk blocks, not
>> containing in inode. So when we call open("file_name"), it just load
>> the inode but the value of xattr is not load into memory and we need a
>> second disk read operation when we call getattr(), which becomes a
>> random read/write issue for HDD. Also double write operations when
>> calling sync(), omap use LOG to bypass those effects.
>>
>>>>
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html