On Wed, Jun 3, 2015 at 9:13 AM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote: >> > In contrast to the current journal code used by CephFS, the new journal >> > code will use sequence numbers to identify journal entries, instead of >> > offsets within the journal. >> >> Am I misremembering what actually got done with our journal v2 format? >> I think this is done — or at least we made a move in this direction. > > Assuming journal v2 is the code in osdc/Journaler.cc, there is a new "resilient" format that helps in detecting corruption, but it appears to be still largely based upon offsets and using the Filer/Striper for I/O. This does remind me that I probably want to include a magic preamble value at the start of each journal entry to facilitate recovery. Ah yeah, I was confusing the changes we did there and in our MDLog wrapper bits. Ignore me on this bit. > >> > A new journal object class method will be used to submit journal entry >> > append requests. This will act as a gatekeeper for the concurrent client >> > case. >> >> The object class is going to be a big barrier to using EC pools; >> unless you want to block the use of EC pools on EC pools supporting >> object classes. :( > > Josh mentioned (via Sam) that reads were not currently supported by object classes on EC pools. Are appends not supported either? We discussed this briefly and certain object class functions might work "by mistake" on EC pools, but you should assume nothing does (is my recollection of the conclusions). For instance, even if it's technically possible, the append thing is really hard for this sort of write; I think I mentioned in Josh's thread about needing to have an entire stripe at a time (and the smallest you could even think about doing reasonably is 4KB * N, and really that's not big enough given metadata overheads). > >> >A successful append will indicate whether or not the journal is now full >> >(larger than the max object size), indicating to the client that a new >> >journal object should be used. If the journal is too large, an error code >> >responce would alert the client that it needs to write to the current >> >active journal object. In practice, the only time the journaler should >> >expect to see such a response would be in the case where multiple clients >> >are using the same journal and the active object update notification has >> >yet to be received. >> >> I'm confused. How does this work with the splay count thing you >> mentioned above? Can you define <splay count>? > > Similar to the stripe width. Okay, that sort of makes sense but I don't see how you could legally be writing to different "sets" so why not just make it an explicit striping thing and move all journal entries for that "set" at once? ...Actually, doesn't *not* forcing a coordinated move from one object set to another mean that you don't actually have an ordering guarantee across tags if you replay the journal objects in order? > >> What happens if users submit sequenced entries substantially out of >> order? It sounds like if you have multiple writers (or even just a >> misbehaving client) it would not be hard for one of them to grab >> sequence value N, for another to fill up one of the journal entry >> objects with sequences in the range [N+1]...[N+x] and then for the >> user of N to get an error response. > > I was thinking that when a client submits their journal entry payload, the journaler will allocate the next available sequence number, compute which active journal object that sequence should be submitted to, and start an AIO append op to write the journal entry. The next journal entry to be appended to the same journal object would be <splay count/width> entries later. This does bring up a good point that if you are generating journal entries fast enough, the delayed response saying the object is full could cause multiple later journal entry ops to need to be resent to the new (non-full) object. Given that, it might be best to scrap the hard error when the journal object gets full and just let the journaler eventually switch to a new object when it receives a response saying the object is now full. I was misunderstanding where the seqs came from and that they were associated with the tag, not the journal. So this shouldn't be such a problem. > >> > >> > Since the journal is designed to be append-only, there needs to be support >> > for cases where journal entry needs to be updated out-of-band (e.g. fixing >> > a corrupt entry similar to CephFS's current journal recovery tools). The >> > proposed solution is to just append a new journal entry with the same >> > sequence number as the record to be replaced to the end of the journal >> > (i.e. last entry for a given sequence number wins). This also protects >> > against accidental replays of the original append operation. An >> > alternative suggestion would be to use a compare-and-swap mechanism to >> > update the full journal object with the updated contents. >> >> I'm confused by this bit. It seems to imply that fetching a single >> entry requires checking the entire object to make sure there's no >> replacement. Certainly if we were doing replay we couldn't just apply >> each entry sequentially any more because an overwritten entry might >> have its value replaced by a later (by sequence number) entry that >> occurs earlier (by offset) in the journal. > > The goal would be to use prefetching on the replay. Since the whole object is already in-memory, scanning for duplicates would be fairly trivial. If there is a way to prevent the OSDs from potentially replaying a duplicate append journal entry message, the CAS update technique could be used. Actually don't you need to keep <splay count> objects prefetched in memory, because the ops round-robin across them? > >> I'd also like it if we could organize a single Journal implementation >> within the Ceph project, or at least have a blessed one going forward >> that we use for new stuff and might plausibly migrate existing users >> to. The big things I see different from osdc/Journaler are: > > Agreed. While librbd will be the first user of this, I wasn't planning to locate it within the librbd library. > >> 1) (design) class-based >> 2) (design) uses librados instead of Objecter (hurray) >> 3) (need) should allow multiple writers >> 4) (fallout of other choices?) does not stripe entries across multiple >> objects > > For striping, I assume this is a function of how large MDS journal entries are expected to be. The largest RBD journal entries would be block write operations, so in the low kilobytes. It would be possible to add a higher layer to this design that could break-up large client journal entries into multiple, smaller entries. Really we just picked up the striping for free by making use of the Filer to handle our data layout. ;) We don't ever enable it ourself and I don't think it matters. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html