RE: K/V interface buffer transaction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Sage !
Let me understand the K/V code base more in-depth and will come back to you on this.

Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx] 
Sent: Thursday, February 19, 2015 11:46 AM
To: Somnath Roy
Cc: Haomai Wang; sjust@xxxxxxxxxx; Gregory Farnum; Ceph Development
Subject: RE: K/V interface buffer transaction

On Thu, 19 Feb 2015, Somnath Roy wrote:
> Sage/Haomai,
> Some more questions.
> 
> 1. I am not able to figure out why the KeyValueDB interface is so 
> dependent on iter based approach ? If a db supports range queries, 
> can't we get rid of these iterator interfaces ?
> 
> 2. Also, the function like ::_generic_read() is calling 
> StripObjectMap::get_values_with_header -> GenericObjectMap::scan(). 
> Scan is just looping over the keys and still calling 
> iter->lower_bound() , why not calling direct get call ? In case, the 
> db supports range queries , we can handover the db these keys and it 
> will return array of key/value pair itself. Why to bother about that 
> from generic keyvaluestore interface ? If dbs are not supporting range 
> queries, we can implement similar logic in the shim layer like 
> leveldbstore/rocksdbstore, isn't it ?

The KeyValueDB is the interface that seemed necessary when Sam was implementing the original DBObjectMap a couple years ago.  It's based on what leveldb was providing and what was needed at the time.  We are more than happy to change it!

A few things:

1. Adding a call that returns multiple k/v pairs sounds fine as long as there is a limmit so we don't get an unbounded result size.

2. I'm concerned (in general) about the efficiency of this interface.  
Right now pretty much everything is fetched and returned in the form of an STL structure and I'm worried that there will be a bunch of data copying on the implementation to conform to that.  On the flip side, lots of callers are currently rejiggering their requests into those maps too.  I'd be very interested in hearing about how you think we can make this fit more efficiently to whatever backend you're currently working with.  
Leveldb and rocksdb will I think be the most common backends, but we want to perform well with others too.

3. One simple example of this is there are several places where we have an encoded bufferlist of map<string,bufferlist> that we are doing a set on (or are pulling out).  Currently we end up decoding into an STL map and feeding to the interface, but I suspect lots of callers could benefit from a set of calls that go direct to/from such a buffer and skip the map<>.

4. There a trivial patch in my newstore wip branch that adds a get(prefix, key, *value) so that you don't get to pass in a set<string> for a single fetch.  It's somewhere in the pile at 

	https://github.com/liewegas/ceph/commits/wip-newstore

sage

> 
> Let me know if I am missing anything here.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
> Sent: Wednesday, February 11, 2015 11:35 PM
> To: Somnath Roy
> Cc: sjust@xxxxxxxxxx; Sage Weil; Gregory Farnum; Ceph Development
> Subject: Re: K/V interface buffer transaction
> 
> On Thu, Feb 12, 2015 at 3:26 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> > Haomai,
> >
> > << KeyValueStore will only write one for duplicate entry in ordering
> >
> > I saw K/v store (keyvaluestore.cc) itself is not removing the duplicates , are you saying the shim layer like leveldbstore/rocksdbstore is removing the duplicates or the leveldb/rocksdb ?
> 
> Oh no, sorry. That's just I want to do in mind. I forget I haven't impl it.
> 
> Each ObjectStore::Transaction in KeyValueStore has corresponding BufferTransaction will store all kvs needed to store. We could let submit_transaction do it at last instead of calling backend each op.
> 
> Yeah, we could resolve it in KeyValueStore clearly.
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
> > Sent: Wednesday, February 11, 2015 7:36 PM
> > To: Somnath Roy
> > Cc: sjust@xxxxxxxxxx; Sage Weil; Gregory Farnum; Ceph Development
> > Subject: Re: K/V interface buffer transaction
> >
> > On Thu, Feb 12, 2015 at 6:53 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> >> Yeah, thanks!
> >> Not sure if level-db is handling duplicate entries within a transaction properly or not, if not, in case of filestore (and also for K/V stores) we are having an extra (redundant) OMAP write in the Write-Path.
> >
> > KeyValueStore will only write one for duplicate entry in ordering.
> >
> > But FileStore will write redundant omap.
> >
> > And from dump log, the duplicate entry looks like from pglog
> >
> >>
> >> Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: Samuel Just [mailto:sam.just@xxxxxxxxxxx]
> >> Sent: Wednesday, February 11, 2015 2:36 PM
> >> To: Somnath Roy
> >> Cc: Sage Weil; Gregory Farnum; Haomai Wang (haomaiwang@xxxxxxxxx); 
> >> Ceph Development
> >> Subject: Re: K/V interface buffer transaction
> >>
> >> Well, the transaction is atomic, so if the key is set twice, you can certainly ignore the first one.
> >> -Sam
> >>
> >> On Wed, Feb 11, 2015 at 2:20 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> >>> Hi,
> >>> My code had a bug during printing log. I was using map to store 
> >>> the attribute keys in sorted order and that was discarding the 
> >>> duplicates
> >>> :-)
> >>>
> >>> This is what I found out coming during transaction.
> >>>
> >>> 2015-02-05 15:58:12.311738 7f27b5429700  0 queue_transactions ::
> >>> before _do_transactions
> >>> 2015-02-05 15:58:12.311754 7f27b5429700  0 
> >>> _do_transactions::before _do_transaction
> >>> 2015-02-05 15:58:12.311770 7f27b5429700  0 
> >>> Transaction::OP_WRITE::cid = 1.a3_head oid =
> >>> 680256a3/rbd_data.100974b0dc51.0000000000000631/head//1 offset =
> >>> 3997696 len = 65536
> >>> 2015-02-05 15:58:12.311800 7f27b5429700  0 
> >>> Transaction::OP_SETATTR::cid = 1.a3_head oid =
> >>> 680256a3/rbd_data.100974b0dc51.0000000000000631/head//1 attr_name 
> >>> = _ attr_value_len = 273
> >>> 2015-02-05 15:58:12.311822 7f27b5429700  0 
> >>> Transaction::OP_SETATTR::cid = 1.a3_head oid =
> >>> 680256a3/rbd_data.100974b0dc51.0000000000000631/head//1 attr_name 
> >>> = snapset attr_value_len = 31
> >>> 2015-02-05 15:58:12.311840 7f27b5429700  0 
> >>> Transaction::OP_OMAP_SETKEYS::cid = 1.a3_head oid = a3//head//1
> >>> 2015-02-05 15:58:12.311845 7f27b5429700  0 OMAP_KEY = 0000000102.00000000000000001592 Value = buffer::list(len=178,
> >>>         buffer::ptr(0~4 0x3efc21000 in raw 0x3efc21000 len 4096 nref 6),
> >>>         buffer::ptr(0~170 0x3d74840 in raw 0x3d74840 len 688 nref 3),
> >>>         buffer::ptr(4~4 0x3efc21004 in raw 0x3efc21000 len 4096 
> >>> nref
> >>> 6)
> >>> )
> >>> 2015-02-05 15:58:12.311931 7f27b5429700  0 
> >>> Transaction::OP_OMAP_SETKEYS::cid = 1.a3_head oid = a3//head//1
> >>> 2015-02-05 15:58:12.311938 7f27b5429700  0 OMAP_KEY = _epoch Value = buffer::list(len=4,
> >>>         buffer::ptr(0~4 0x3efc1f000 in raw 0x3efc1f000 len 4096 
> >>> nref
> >>> 3)
> >>> )
> >>> 2015-02-05 15:58:12.311943 7f27b5429700  0 OMAP_KEY = _info Value = buffer::list(len=713,
> >>>         buffer::ptr(0~713 0x3efc1e000 in raw 0x3efc1e000 len 4096 
> >>> nref
> >>> 3)
> >>> )
> >>> 2015-02-05 15:58:12.311965 7f27b5429700  0 
> >>> Transaction::OP_OMAP_SETKEYS::cid = 1.a3_head oid = a3//head//1
> >>> 2015-02-05 15:58:12.311969 7f27b5429700  0 OMAP_KEY = 0000000102.00000000000000001592 Value = buffer::list(len=178,
> >>>         buffer::ptr(0~4 0x3d75e40 in raw 0x3d75e40 len 688 nref 6),
> >>>         buffer::ptr(0~170 0x3d75b80 in raw 0x3d75b80 len 688 nref 3),
> >>>         buffer::ptr(4~4 0x3d75e44 in raw 0x3d75e40 len 688 nref 6)
> >>> )
> >>> 2015-02-05 15:58:12.311980 7f27b5429700  0 OMAP_KEY = can_rollback_to Value = buffer::list(len=12,
> >>>         buffer::ptr(0~12 0x3efc25000 in raw 0x3efc25000 len 4096 
> >>> nref
> >>> 3)
> >>> )
> >>> 2015-02-05 15:58:12.311985 7f27b5429700  0 OMAP_KEY = rollback_info_trimmed_to Value = buffer::list(len=12,
> >>>         buffer::ptr(0~12 0x3efc24000 in raw 0x3efc24000 len 4096 
> >>> nref
> >>> 3)
> >>> )
> >>>
> >>>
> >>>
> >>> So, the OMAP_KEY = 0000000102.00000000000000001592 is coming twice !
> >>>
> >>> Is there any reason, why ? What is this attribute by the way ?
> >>> Can we safely discard the first OP_OMAP_SETKEYS call for the same key ?
> >>>
> >>> Thanks & Regards
> >>> Somnath
> >>>
> >>> -----Original Message-----
> >>> From: Somnath Roy
> >>> Sent: Tuesday, February 10, 2015 4:36 PM
> >>> To: 'Sage Weil'; Gregory Farnum
> >>> Cc: sjust@xxxxxxxxxx; Haomai Wang (haomaiwang@xxxxxxxxx); Ceph 
> >>> Development
> >>> Subject: RE: K/V interface buffer transaction
> >>>
> >>> Thanks Greg/Sam/Sage !
> >>> For now, we will be doing our testing by sorting the keys and will keep an eye on the duplicates.
> >>> Another point, why do we need the K/V store thread pool for processing transactions anymore ?
> >>> I got rid of that and calling _do_transaction() directly from the ::queue_trasaction , this is giving me ~3X performance improvement.
> >>>
> >>> Regards
> >>> Somnath
> >>>
> >>> -----Original Message-----
> >>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> >>> Sent: Tuesday, February 10, 2015 10:44 AM
> >>> To: Gregory Farnum
> >>> Cc: Somnath Roy; sjust@xxxxxxxxxx; Haomai Wang 
> >>> (haomaiwang@xxxxxxxxx); Ceph Development
> >>> Subject: Re: K/V interface buffer transaction
> >>>
> >>> On Tue, 10 Feb 2015, Gregory Farnum wrote:
> >>>> On Tue, Feb 10, 2015 at 10:26 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>>> > On Tue, 10 Feb 2015, Somnath Roy wrote:
> >>>> >> Thanks Sam !
> >>>> >> So, is it safe to do ordering if in a transaction *no* remove/truncate/create/add call ?
> >>>> >> For example, do we need to preserve ordering in case of the below transaction ?
> >>>> >> It will be helpful if you can give some insight in what scenario preserving order is *must*.
> >>>> >
> >>>> > If I'm not mistaken teh only time ordering would matter at all 
> >>>> > in an transaction is when the same key is updated twice, right?  
> >>>> > The whole thing is committed atomically.  If there *are* dups, 
> >>>> > then the order there obviously should be preserved.
> >>>> >
> >>>> > Maybe a first pass would be add an assert or something that 
> >>>> > there are no dup keys and see if anything every falls out of that...
> >>>> > hopefully there are none!
> >>>>
> >>>> I'm pretty sure some of the transaction analysis discussions 
> >>>> people have had say that we do double-updates at times. IIRC it 
> >>>> might have been the pglog head getting set twice in most transactions?
> >>>
> >>> Oh yeah, could be.  There was the snapset xattr update, but that was resetting it to an existing value (not the same value inside the same txn).  I forget if there were others.
> >>>
> >>> sage
> >>>
> >>> ________________________________
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> >>>
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> 
> 
> 
> --
> Best Regards,
> 
> Wheat
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j ??f???h??????w???
???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux