On 04/05/2013 09:06 AM, Sage Weil wrote: > On Fri, 5 Apr 2013, Jim Schutt wrote: >> On 04/04/2013 10:44 PM, Sage Weil wrote: >>> Fantastic work tracking this down, Jim! >>> >>> Looking at the Riak docs on tuning leveldb, it looks like a large write >>> buffer size is definitely a good idea. The block size of 4MB is >>> significantly larger than what they recommend, though.. if we go this big >>> we also need to make the cache size larger (it defaults to 8MB?). >> >> I wondered about that, but the change in performance with >> just these simple changes was so drastic that I thought >> it was a good immediate change, with further tuning to >> follow. >> >> Although, I just tested startup, not actually running such >> a filesystem. Perhaps I missed the effects of insufficient >> caching. >> >> I ended up picking 4 MB because the 55K PG case has PGStat >> values that are 20 MB in length, so it still takes 5 blocks >> to store one. Plus, I'd really like to be running at 256K >> or 512K PGs to get the data uniformity across OSDs that I'm >> after. >> >> Perhaps these need to be config options - I just hated to add >> more, when it would be hard for a user to discover that's the >> thing that needed tuning to solve some particular performance >> issue. I'm hoping we can find one set of values that work well >> for everyone. > > I think the best option is to figure out the defaults that will work for > everyone, but still make config options. I suspect there will be users > that need to tune for less memory. > > leveldb_block_size, leveldb_write_buffer_size, etc. > >>> Did you >>> try with a large write buffer but a smaller block size (like 256K or >>> 512K)? >> >> I did try a 256K block size with an 8 MB write buffer, also with >> no compression. That caused the 55K PGs case to make much more >> progress towards starting, but it still failed to come up - I >> forget the details of what went awry. >> >>> >>> I think either a larger cache or a smaller block size is okay, but 4MB >>> with an 8MB cache means only 2 blocks cached, which sounds non-ideal. >> >> I can try fixing up a larger cache - I'll need to dig in >> a little to figure out how to do it, so it might take me >> a little while. How big do you think it should be, given >> that the large PG count cases I'm after might have PGStat >> data lengths that are many tens of MB? >> >> 64 MB? 256 MB? Tunable? > > The default is only 8MB, but this can safely go up to several gigs. I > think 256 MB sounds like a reasonable default... > > Sam, want to weigh in? > >> Also, I wondered about whether the writes need to be >> sync'd? Do these tunings change your mind about whether >> that's needed? > > The sync parameters shouldn't need to be changed. os/FileStore.cc is > calling the sync when we do an overall filestore sync prior to a commit > point or btrfs snapshot. OK. Now that I've had a look, setting up the cache should be easy - I'm working on it. I'll fix up v2 of the patch with that, and options for the tunables. Thanks -- Jim > > sage > > >> >>> >>> Thanks! >>> sage >>> >>> >>> On Thu, 4 Apr 2013, Jim Schutt wrote: >>> >>>> As reported in this thread >>>> http://www.spinics.net/lists/ceph-devel/msg13777.html >>>> starting in v0.59 a new filesystem with ~55,000 PGs would not start after >>>> a period of ~30 minutes. By comparison, the same filesystem configuration >>>> would start in ~1 minute for v0.58. >>>> >>>> The issue is that starting in v0.59, LevelDB is used for the monitor >>>> data store. For moderate to large numbers of PGs, the length of a PGStat value >>>> stored via LevelDB is best measured in megabytes. The default tunings for >>>> LevelDB data blocking seem tuned for values with lengths measured in tens or >>>> hundreds of bytes. >>>> >>>> With the data blocking tuning provided by this patch, here's a comparison >>>> of filesystem startup times for v0.57, v0.58, and v0.59: >>>> >>>> 55,392 PGs 221,568 PGs >>>> v0.57 1m 07s 9m 42s >>>> v0.58 1m 04s 11m 44s >>>> v0.59 45s 4m 17s >>>> >>>> Note that this patch turns off LevelDB's compression. The block >>>> tuning from this patch with compression enabled made no improvement >>>> in the new filesystem startup time for v0.59, for either PG count >>>> tested. I'll note that at 55,392 PGs the PGStat length is ~20 MB; >>>> perhaps that value length interacts pooly with LevelDB's compression >> >> s/pooly/poorly/ >> >> Thanks -- Jim >> >>>> at this block size. >>>> >>>> Signed-off-by: Jim Schutt <jaschut@xxxxxxxxxx> >>>> --- >>>> src/os/LevelDBStore.cc | 3 +++ >>>> 1 files changed, 3 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/src/os/LevelDBStore.cc b/src/os/LevelDBStore.cc >>>> index 3d94096..1b6ae7d 100644 >>>> --- a/src/os/LevelDBStore.cc >>>> +++ b/src/os/LevelDBStore.cc >>>> @@ -16,6 +16,9 @@ int LevelDBStore::init(ostream &out, bool create_if_missing) >>>> { >>>> leveldb::Options options; >>>> options.create_if_missing = create_if_missing; >>>> + options.write_buffer_size = 32 * 1024 * 1024; >>>> + options.block_size = 4 * 1024 * 1024; >>>> + options.compression = leveldb::kNoCompression; >>>> leveldb::DB *_db; >>>> leveldb::Status status = leveldb::DB::Open(options, path, &_db); >>>> db.reset(_db); >>>> -- >>>> 1.7.8.2 >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html