We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory. I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears). I'm skeptical of any hope of keeping things "simple." Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else. -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 ----- Original Message ----- > From: "Sage Weil" <sweil@xxxxxxxxxx> > To: "John Spray" <jspray@xxxxxxxxxx> > Cc: "Ceph Development" <ceph-devel@xxxxxxxxxxxxxxx> > Sent: Tuesday, October 20, 2015 4:00:23 PM > Subject: Re: newstore direction > > On Tue, 20 Oct 2015, John Spray wrote: > > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > - We have to size the kv backend storage (probably still an XFS > > > partition) vs the block storage. Maybe we do this anyway (put metadata > > > on > > > SSD!) so it won't matter. But what happens when we are storing gobs of > > > rgw index data or cephfs metadata? Suddenly we are pulling storage out > > > of > > > a different pool and those aren't currently fungible. > > > > This is the concerning bit for me -- the other parts one "just" has to > > get the code right, but this problem could linger and be something we > > have to keep explaining to users indefinitely. It reminds me of cases > > in other systems where users had to make an educated guess about inode > > size up front, depending on whether you're expecting to efficiently > > store a lot of xattrs. > > > > In practice it's rare for users to make these kinds of decisions well > > up-front: it really needs to be adjustable later, ideally > > automatically. That could be pretty straightforward if the KV part > > was stored directly on block storage, instead of having XFS in the > > mix. I'm not quite up with the state of the art in this area: are > > there any reasonable alternatives for the KV part that would consume > > some defined range of a block device from userspace, instead of > > sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). > > 2) Use something like dm-thin to sit between the raw block device and XFS > (for rocksdb) and the block device consumed by newstore. As long as XFS > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > files in their entirety) we can fstrim and size down the fs portion. If > we similarly make newstores allocator stick to large blocks only we would > be able to size down the block portion as well. Typical dm-thin block > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > me. In fact, we could likely just size the fs volume at something > conservatively large (like 90%) and rely on -o discard or periodic fstrim > to keep its actual utilization in check. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html