I see. Besides keeping a record of performed operations, is there any other reason to remember the order of the operations? For recovery? On Fri, Jul 22, 2016 at 1:35 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Well, multiple writers to the same PG do *work* -- they get completed > in the order in which they arrive at the primary (and can be pipelined > so the IO overlaps in the backend). The problem isn't the PG lock -- > that's merely an implementation detail. The problem is that the > protocols used to ensure consistency depend on a PG-wide ordered log > of writes which all replicas agree on (up to a possibly divergent, > logically un-committed head). The problem with your proposed > modification is that you can no longer control the ordering. The > problem isn't performance, it's correctness. Even if you ensure a > single writer at a time, you still have a problem ensuring that a > write makes it to all of the replicas in the event of client death. > This is solvable, but how you do it will depend on what consistency > properties you are trying to create and how you plan to deal with > failure scenarios. > -Sam > > On Fri, Jul 22, 2016 at 10:07 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >> I have read that paper. I see. Even with current design, this PG lock >> is there, so multiple client writes to the same PG in parallel will >> not work, right? >> If I only allow one client write to OSDs in parallel, will that be a problem? >> >> On Fri, Jul 22, 2016 at 11:36 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>> There is a per-pg log of recent operations (see PGLog.h/cc). It has >>> an order. If you allow multiple clients to submit operations to >>> replicas in parallel, different replicas may have different log >>> orderings (worse, in the general case, you have no guarantee that >>> every log entry -- and the write which it represents -- actually makes >>> it to every replica). That would pretty much completely break the >>> peering process. You might want to read the rados paper >>> (http://ceph.com/papers/weil-rados-pdsw07.pdf). >>> -Sam >>> >>> On Fri, Jul 22, 2016 at 8:30 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>> I am confused. Could you describe a little bit more about that? >>>> >>>> Sugang >>>> >>>> On Fri, Jul 22, 2016 at 11:27 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>> Not if you want the PG log to have consistent ordering. >>>>> -Sam >>>>> >>>>> On Fri, Jul 22, 2016 at 7:00 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>> Actually write lock the object only. Is that gonna work? >>>>>> >>>>>> Sugang >>>>>> >>>>>> On Thu, Jul 21, 2016 at 5:59 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>> Write lock on the whole pg? How do parallel clients work? >>>>>>> -Sam >>>>>>> >>>>>>> On Thu, Jul 21, 2016 at 12:36 PM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>> The error above occurs when I am sending MOSOp to the replicas, and I >>>>>>>> have to fix that first. >>>>>>>> >>>>>>>> For the consistency, we are still using the Primary OSD as a control >>>>>>>> center. That is, the client always goes to Primary OSD to ask for a >>>>>>>> write lock, then write the replica. >>>>>>>> >>>>>>>> Sugang >>>>>>>> >>>>>>>> On Thu, Jul 21, 2016 at 3:28 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>> Well, they are actually different types with different encodings and >>>>>>>>> different contents. The client doesn't really have the information >>>>>>>>> needed to build a MSG_OSD_REPOP. Your best bet will be to send an >>>>>>>>> MOSDOp to the replicas and hack up a write path that makes that work. >>>>>>>>> >>>>>>>>> How do you plan to address the consistency problems? >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Thu, Jul 21, 2016 at 11:11 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>> So, to start with, I think one naive way is to make the replica think >>>>>>>>>> it receives an op from the primary OSD, which actually comes from the >>>>>>>>>> client. And the branching point looks like started from >>>>>>>>>> OSD::dispatch_op_fast, where handle_op or handle_replica_op is called >>>>>>>>>> based on the type of the request. So my question is, at the client >>>>>>>>>> side, is there a way that I could set the corresponding variables >>>>>>>>>> referred by "op->get_req()->get_type()" to MSG_OSD_SUBOP or >>>>>>>>>> MSG_OSD_REPOP? >>>>>>>>>> >>>>>>>>>> Sugang >>>>>>>>>> >>>>>>>>>> On Thu, Jul 21, 2016 at 12:03 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>>>> Parallel read will be a *lot* easier since read-from-replica already >>>>>>>>>>> works. Write to replica, however, is tough. The write path uses a >>>>>>>>>>> lot of structures which are only populated on the primary. You're >>>>>>>>>>> going to have to hack up most of the write path to bypass the existing >>>>>>>>>>> replication machinery. Beyond that, maintaining consistency will >>>>>>>>>>> obviously be a challenge. >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 21, 2016 at 8:49 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>>> My goal is to achieve parallel write/read from the client instead of >>>>>>>>>>>> the primary OSD. >>>>>>>>>>>> >>>>>>>>>>>> Sugang >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 21, 2016 at 11:47 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>>>>>> I may be misunderstanding your goal. What are you trying to achieve? >>>>>>>>>>>>> -Sam >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 21, 2016 at 8:43 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>>>>>>> Well, that assert is asserting that the object is in the pool that the >>>>>>>>>>>>>> pg operating on it belongs to. Something very wrong must have >>>>>>>>>>>>>> happened for it to be not true. Also, replicas have basically none of >>>>>>>>>>>>>> the code required to handle a write, so I'm kind of surprised it got >>>>>>>>>>>>>> that far. I suggest that you read the debug logging and read the OSD >>>>>>>>>>>>>> op handling path. >>>>>>>>>>>>>> -Sam >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jul 21, 2016 at 8:34 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>> Yes, I understand that. I was introduced to Ceph only 1 month ago, but >>>>>>>>>>>>>>> I have the basic idea of Ceph communication pattern now. I have not >>>>>>>>>>>>>>> make any changes to OSD yet. So I was wondering what is purpose of >>>>>>>>>>>>>>> this "assert(oid.pool == static_cast<int64_t>(info.pgid.pool()))", and >>>>>>>>>>>>>>> to change the code in OSD, what are the main aspects I should pay >>>>>>>>>>>>>>> attention to? >>>>>>>>>>>>>>> Since this is only a research project, the implementation does not >>>>>>>>>>>>>>> have to be very sophisticated. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I know my question is kinda too broad, any hints or suggestions will >>>>>>>>>>>>>>> be highly appreciated. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sugang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jul 21, 2016 at 11:22 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>>>>>>>>> Oh, that's a much more complicated change. You are going to need to >>>>>>>>>>>>>>>> make extensive changes to the OSD to make that work. >>>>>>>>>>>>>>>> -Sam >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Jul 21, 2016 at 8:21 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> Hi Sam, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for the quick reply. The main modification I made is to call >>>>>>>>>>>>>>>>> calc_target within librados::IoCtxImpl::aio_operate before op_submit, >>>>>>>>>>>>>>>>> so that I can get all replicated OSDs' id, and send a write op to each >>>>>>>>>>>>>>>>> of them. I can also attach the modified code if necessary. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I just reproduced this error with the conf you provided, please see below: >>>>>>>>>>>>>>>>> osd/ReplicatedPG.cc: In function 'int >>>>>>>>>>>>>>>>> ReplicatedPG::find_object_context(const hobject_t&, ObjectContextRef*, >>>>>>>>>>>>>>>>> bool, bool, hobject_t*)' thread 7fd6aba59700 time 2016-07-21 >>>>>>>>>>>>>>>>> 15:09:26.431436 >>>>>>>>>>>>>>>>> osd/ReplicatedPG.cc: 9042: FAILED assert(oid.pool == >>>>>>>>>>>>>>>>> static_cast<int64_t>(info.pgid.pool())) >>>>>>>>>>>>>>>>> ceph version 10.2.0-2562-g0793a28 (0793a2844baa38f6bcc5c1724a1ceb9f8f1bbd9c) >>>>>>>>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>>>>>>>>>>>>>> const*)+0x8b) [0x7fd6c5733e8b] >>>>>>>>>>>>>>>>> 2: (ReplicatedPG::find_object_context(hobject_t const&, >>>>>>>>>>>>>>>>> std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x1e54) >>>>>>>>>>>>>>>>> [0x7fd6c51ef7c4] >>>>>>>>>>>>>>>>> 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x186e) [0x7fd6c521fe9e] >>>>>>>>>>>>>>>>> 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, >>>>>>>>>>>>>>>>> ThreadPool::TPHandle&)+0x73c) [0x7fd6c51dca3c] >>>>>>>>>>>>>>>>> 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, >>>>>>>>>>>>>>>>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) >>>>>>>>>>>>>>>>> [0x7fd6c5094d65] >>>>>>>>>>>>>>>>> 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> >>>>>>>>>>>>>>>>> const&)+0x5d) [0x7fd6c5094f8d] >>>>>>>>>>>>>>>>> 7: (OSD::ShardedOpWQ::_process(unsigned int, >>>>>>>>>>>>>>>>> ceph::heartbeat_handle_d*)+0x86c) [0x7fd6c50b603c] >>>>>>>>>>>>>>>>> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) >>>>>>>>>>>>>>>>> [0x7fd6c5724117] >>>>>>>>>>>>>>>>> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fd6c5726270] >>>>>>>>>>>>>>>>> 10: (()+0x8184) [0x7fd6c3b98184] >>>>>>>>>>>>>>>>> 11: (clone()+0x6d) [0x7fd6c1aa937d] >>>>>>>>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>>>>>>>>>>>>>>> needed to interpret this. >>>>>>>>>>>>>>>>> 2016-07-21 15:09:26.454854 7fd6aba59700 -1 osd/ReplicatedPG.cc: In >>>>>>>>>>>>>>>>> function 'int ReplicatedPG::find_object_context(const hobject_t&, >>>>>>>>>>>>>>>>> ObjectContextRef*, bool, bool, hobject_t*)' thread 7fd6aba59700 time >>>>>>>>>>>>>>>>> 2016-07-21 15:09:26.431436 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This error occurs three times since I wrote to three OSDs. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sugang >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Jul 21, 2016 at 10:54 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>> Hmm. Can you provide more information about the poison op? If you >>>>>>>>>>>>>>>>>> can reproduce with >>>>>>>>>>>>>>>>>> debug osd = 20 >>>>>>>>>>>>>>>>>> debug filestore = 20 >>>>>>>>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>>>>>>>> it should be easier to work out what is going on. >>>>>>>>>>>>>>>>>> -Sam >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Jul 21, 2016 at 7:13 AM, Sugang Li <sugangli@xxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am working on a research project which requires multiple write >>>>>>>>>>>>>>>>>>> operations for the same object at the same time from the client. At >>>>>>>>>>>>>>>>>>> the OSD side, I got this error: >>>>>>>>>>>>>>>>>>> osd/ReplicatedPG.cc: In function 'int >>>>>>>>>>>>>>>>>>> ReplicatedPG::find_object_context(const hobject_t&, ObjectContextRef*, >>>>>>>>>>>>>>>>>>> bool, bool, hobject_t*)' thread 7f0586193700 time 2016-07-21 >>>>>>>>>>>>>>>>>>> 14:02:04.218448 >>>>>>>>>>>>>>>>>>> osd/ReplicatedPG.cc: 9041: FAILED assert(oid.pool == >>>>>>>>>>>>>>>>>>> static_cast<int64_t>(info.pgid.pool())) >>>>>>>>>>>>>>>>>>> ceph version 10.2.0-2562-g0793a28 (0793a2844baa38f6bcc5c1724a1ceb9f8f1bbd9c) >>>>>>>>>>>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>>>>>>>>>>>>>>>> const*)+0x8b) [0x7f059fe6dd7b] >>>>>>>>>>>>>>>>>>> 2: (ReplicatedPG::find_object_context(hobject_t const&, >>>>>>>>>>>>>>>>>>> std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x1dbb) >>>>>>>>>>>>>>>>>>> [0x7f059f9296fb] >>>>>>>>>>>>>>>>>>> 3: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x186e) [0x7f059f959d7e] >>>>>>>>>>>>>>>>>>> 4: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, >>>>>>>>>>>>>>>>>>> ThreadPool::TPHandle&)+0x73c) [0x7f059f916a0c] >>>>>>>>>>>>>>>>>>> 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, >>>>>>>>>>>>>>>>>>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f5) >>>>>>>>>>>>>>>>>>> [0x7f059f7ced65] >>>>>>>>>>>>>>>>>>> 6: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> >>>>>>>>>>>>>>>>>>> const&)+0x5d) [0x7f059f7cef8d] >>>>>>>>>>>>>>>>>>> 7: (OSD::ShardedOpWQ::_process(unsigned int, >>>>>>>>>>>>>>>>>>> ceph::heartbeat_handle_d*)+0x86c) [0x7f059f7f003c] >>>>>>>>>>>>>>>>>>> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) >>>>>>>>>>>>>>>>>>> [0x7f059fe5e007] >>>>>>>>>>>>>>>>>>> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f059fe60160] >>>>>>>>>>>>>>>>>>> 10: (()+0x8184) [0x7f059e2d2184] >>>>>>>>>>>>>>>>>>> 11: (clone()+0x6d) [0x7f059c1e337d] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> And at the client side, I got segmentation fault. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am wondering what will be the possible reason that cause the assert fail? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sugang >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html