Re: Write performance issue under rocksdb kvstore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Got your point. It is not only about the object data itself, but also ceph internal metadata.

The best option seems to be your RP and wip-newstore-frags branch. :-)


Thanks.
Zhi Zhang (David)


> Date: Tue, 20 Oct 2015 06:25:43 -0700
> From: sage@xxxxxxxxxxxx
> To: zhangz.david@xxxxxxxxxxx
> CC: ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
>
> On Tue, 20 Oct 2015, Z Zhang wrote:
> > Thanks, Sage, for pointing out the PR and ceph branch. I will take a
> > closer look.
> >
> > Yes, I am trying KVStore backend. The reason we are trying it is that
> > few user doesn't have such high requirement on data loss occasionally.
> > It seems KVStore backend without synchronized WAL could achieve better
> > performance than filestore. And only data still in page cache would get
> > lost on machine crashing, not process crashing, if we use WAL but no
> > synchronization. What do you think?
>
> That sounds dangerous. The OSDs are recording internal metadata about the
> cluster (peering, replication, etc.)... even if you don't care so much
> about recent user data writes you probably don't want to risk breaking
> RADOS itself. If the kv backend is giving you a stale point-in-time
> consistent copy it's not so bad, but in a power-loss event it could give
> you problems...
>
> sage
>
> >
> > Thanks. Zhi Zhang (David)
> >
> > Date: Tue, 20 Oct 2015 05:47:44 -0700
> > From: sage@xxxxxxxxxxxx
> > To: zhangz.david@xxxxxxxxxxx
> > CC: ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
> > Subject: Re: [ceph-users] Write performance issue under rocksdb kvstore
> >
> > On Tue, 20 Oct 2015, Z Zhang wrote:
> > > Hi Guys,
> > >
> > > I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with
> > > rocksdb 3.11 as OSD backend. I use rbd to test performance and following
> > > is my cluster info.
> > >
> > > [ceph@xxx ~]$ ceph -s
> > > cluster b74f3944-d77f-4401-a531-fa5282995808
> > > health HEALTH_OK
> > > monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
> > > election epoch 1, quorum 0 xxx
> > > osdmap e338: 44 osds: 44 up, 44 in
> > > flags sortbitwise
> > > pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
> > > 1940 MB used, 81930 GB / 81932 GB avail
> > > 2048 active+clean
> > >
> > > All the disks are spinning ones with write cache turning on. Rocksdb's
> > > WAL and sst files are on the same disk as every OSD.
> >
> > Are you using the KeyValueStore backend?
> >
> > > Using fio to generate following write load:
> > > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1
> > >
> > > Test result:
> > > WAL enabled + sync: false + disk write cache: on will get ~700 IOPS.
> > > WAL enabled + sync: true (default) + disk write cache: on|off will get only ~25 IOPS.
> > >
> > > I tuned some other rocksdb options, but with no lock.
> >
> > The wip-newstore-frags branch sets some defaults for rocksdb that I think
> > look pretty reasonable (at least given how newstore is using rocksdb).
> >
> > > I tracked down the rocksdb code and found each writer's Sync operation
> > > would take ~30ms to finish. And as shown above, it is strange that
> > > performance has no much difference no matters disk write cache is on or
> > > off.
> > >
> > > Do your guys encounter the similar issue? Or do I miss something to
> > > cause rocksdb's poor write performance?
> >
> > Yes, I saw the same thing. This PR addresses the problem and is nearing
> > merge upstream:
> >
> > https://github.com/facebook/rocksdb/pull/746
> >
> > There is also an XFS performance bug that is contributing to the problem,
> > but it looks like Dave Chinner just put together a fix for that.
> >
> > But... we likely won't be using KeyValueStore in its current form over
> > rocksdb (or any other kv backend). It stripes object data over key/value
> > pairs, which IMO is not the best approach.
> >
> > sage
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux