Re: About _setattr() optimazation and recovery accelerate

Nicheal <zay11022@xxxxxxxxx> · Mon, 9 Mar 2015 13:26:15 +0800

2015-03-07 16:43 GMT+08:00 Haomai Wang <haomaiwang@xxxxxxxxx>:
> On Sat, Mar 7, 2015 at 12:03 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> Hi!
>>
>> [copying ceph-devel]
>>
>> On Fri, 6 Mar 2015, Nicheal wrote:
>>> Hi Sage,
>>>
>>> Cool for issue #3878, Duplicated pg_log write, which is post early in
>>> my issue #3244 and Single omap_setkeys transaction improve the
>>> performance in FileStore as in my previous testing (most of time cost
>>> in FileStore is in the transaction omap_setkeys).
>>
>> I can't find #3244?
>
> I think it's https://github.com/ceph/ceph/pull/3244
>
Yeah, exactly it is.
>>
>>> Well, I think another performance issue is to the strategy of setattrs.
>>> Here is some kernel log achieve from xfs behavious.
>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>> ceph._(6), value =.259)
>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>> forks data: 1
>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>> di_anextents=0, di_forkoff=239
>>>
>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>> ceph._(6), value =.259)
>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>> forks data: 2
>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=1,
>>> di_anextents=1, di_forkoff=239
>>>
>>> Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
>>> ceph._(6), value =.259)
>>> Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
>>> forks data: 2
>>> Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
>>> di_anextents=1, di_forkoff=239
>>>
>>> typedef enum xfs_dinode_fmt {
>>> XFS_DINODE_FMT_DEV, /* xfs_dev_t */
>>> XFS_DINODE_FMT_LOCAL, /* bulk data */
>>> XFS_DINODE_FMT_EXTENTS, /* struct xfs_bmbt_rec */
>>> XFS_DINODE_FMT_BTREE, /* struct xfs_bmdr_block */
>>> XFS_DINODE_FMT_UUID /* uuid_t */
>>> } xfs_dinode_fmt_t;
>>>
>>> while attr forks data = 2 means XFS_DINODE_FMT_EXTENTS (xattr is
>>> stored in extent format), while attr forks data =1 means
>>> XFS_DINODE_FMT_LOCAL(xattr is stored as inline attribute).
>>>
>>> However, in most cases, xattr attribute is stored in extent, not
>>> inline. Please note that, I have already formatted the partition with
>>> -i size=2048.  when the number of xattrs is larger than 10, it uses
>>> XFS_DINODE_FMT_BTREE to accelerate key searching.
>>
>> Did you by chance look at what size the typical xattrs are?  I expected
>> that the usual _ and snapset attrs would be small enough to fit inline..
>> but if they're not then we should at a minimum adjust our recommendation
>> on xfs inode size.
>>
>>> So, in _setattr(), we may just get xattr_key by using chain_flistxattr
>>> instead of  _fgetattrs, which retrieve (key, value) pair, as value is
>>> exactly no use here. and furthermore, we may consider the strategies
>>> that we need move spill_out xattr to omap, while xfs only restricts
>>> that each xattr value < 64K and each xattr key < 255byte.  And
>>> duplicated read for XATTR_SPILL_OUT_NAME also occurs in:
>>> r = chain_fgetxattr(**fd, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
>>> r = _fgetattrs(**fd, inline_set);
>>> And I try to ignore the _fgetattrs() logic and just update xattr
>>> update in _setattr(), my ssd cluster will be improved about 2% - 3%
>>> performance.
>>
>> I'm not quite following... do you have a patch we can look at?
>
> I think his meaning is that we can use minimal xattr attrs and avoid
> xattr-chains but using omap.
>
Yes, make a basic assumption, for example, we just allow user.ceph._
and user.ceph.snapset as xattr attrs. Then we may simplify the logic a
lots. Actually, the purpose to implement automatic decision to
redirect the xattr into omap is served for cephfs, which may save user
defined xattrs.  For rbd case, no this problem since it is just two
xattr attrs (user.ceph._  and user.ceph.snapset), and for ecpool, one
more for hash, which is predictable. Furthermore, I prefer to stop
recording user.ceph.snapset when there is too much fragment. There is
a huge performance penalty when user.ceph.snapset is large.
Since both of Extent and BTREE layout is remote xattr, not inline
xattr in xfs, I think using omap will not cause much performance
penalty, especially for HDD based FileStore.

>>
>>> Another issue about an idea of recovery is showed in
>>> https://github.com/ceph/ceph/pull/3837
>>> Can you give some suggestion about that?
>>
>> I think this direction has a lot of potential, although it will add a fair
>> bit of complexity.
>>
>> I think you can avoid the truncate field and infer that from the dirtied
>> interval and the new object size.  Need to look at the patch more closely
>> still, though...
Uh, Yeah, the purpose I use truncate field is to deal with the situation below:
    1) A is down
    2) B do the truncate operation to truncate the entirely 4M object
to 3M, then do some  write may extend the object to 4M again, but with
lots of holes between [3M - 4M]
    3) calling recovery will find two objects are both 4M so that
there is no need to truncate, but a partially data recovery between
[3M - 4M] may cause data inconsistency, so we also need to do a
truncate operation for the object in A.
    We can still infer those cases if we mark [3M-4M] dirty when
truncate occurs in B, while it will be incompatible with sparse
read/write since we need to read out all holes between [3M-4M] and
write it to A. How do you think that? @Haomai Wang.

>
> For xattr and omap optimization I expect this PR mostly
> https://github.com/ceph/ceph/pull/2972
>
>
This is patch submitted by my teammate. But we do more, this is one of
the patch we exactly confirm that it can work well under any
circumstance(either HDD or SSD). And we do some other effects liking
moving pg_info and epoch to per pg xattr attrs. It works better better
under SSD environment but not HDD. Writing data in Omap will be
costing since huge write amplification. It exhausts bandwidth of the
SSD, especially under 4k-write cases (every write operation need to
update pg_info and epoch to Omap). But using file-system xattr attrs
can benefit from the system page cache and update pg_info once when we
call sync().
An alternative is that we keep the newest pg_info on-disk content in
cache and flush them on disk when we before call sync (sync_entry in
FileStore) and still use omap to store pg_info.

we actually do lots of affects and testing on the performance on how
to dealing with xattrs. Under SSD cluster, it is much better to store
information as extent attribute of a file-system, while under HDD
cluster, omap always performs better because remote extent attribute
need to read or store the contents of xattr in other disk blocks, not
containing in inode. So when we call open("file_name"), it just load
the inode but the value of xattr is not load into memory and we need a
second disk read operation when we call getattr(), which becomes a
random read/write issue for HDD. Also double write operations when
calling sync(), omap use LOG to bypass those effects.

>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html