Re: question about bluefs log sync

zengran zhang <z13121369189@xxxxxxxxx> · Wed, 30 Aug 2017 10:24:11 +0800

Thanks, i take pleasure in testing the patch...

2017-08-30 10:07 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
> On Wed, 30 Aug 2017, zengran zhang wrote:
>> ohh, i found some log items as follow:
>>
>> 2017-08-30 07:14:16.641631 7fd23d346700  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1504048456641609, "cf_name": "default", "job": 56,
>> "event": "table_file_creation", "file_number": 237, "file_size":
>> 88854089, "table_properties": {"data_size": 87468782, "index_size":
>> 1384424, "filter_size": 0, "raw_key_size": 22244688,
>> "raw_average_key_size": 44, "raw_value_size": 77465275,
>> "raw_average_value_size": 154, "num_data_blocks": 21885,
>> "num_entries": 502756, "filter_policy_name": "", "kDeletedKeys":
>> "318362", "kMergeOperands": "49535"}}
>> 2017-08-30 07:14:16.641697 7fd23d346700  4 rocksdb:
>> [/tmp/release/Ubuntu/WORKDIR/ceph-12.0.2-29-g37268ad/src/rocksdb/db/flush_job.cc:317]
>> [default] [JOB 56] Level-0 flush table #237: 88854089 bytes OK
>> 2017-08-30 07:14:16.641704 7fd23d346700 10 bluefs sync_metadata - no
>> pending log events
>>
>> but these logs seems too late, the log had grown too big already...
>
> Oh, I think I see the problem.  Does
>
>         https://github.com/ceph/ceph/pull/17354
>
> make sense?
>
> Thanks!
> sage
>
>
>>
>>
>>
>> 2017-08-30 9:14 GMT+08:00 zengran zhang <z13121369189@xxxxxxxxx>:
>> > hi Sage,
>> >     I wrote 100% size of 4T rbd, the log seemed never been
>> > compact...the perfcounter output as follow:
>> >     "bluefs": {
>> >         "gift_bytes": 0,
>> >         "reclaim_bytes": 0,
>> >         "db_total_bytes": 16106119168,
>> >         "db_used_bytes": 16106119168,
>> >         "wal_total_bytes": 5368705024,
>> >         "wal_used_bytes": 5368705024,
>> >         "slow_total_bytes": 79153848320,
>> >         "slow_used_bytes": 2502942720,
>> >         "num_files": 33,
>> >         "log_bytes": 4380053504,
>> >         "log_compactions": 0,
>> >         "logged_bytes": 4378685440,
>> >         "files_written_wal": 26,
>> >         "files_written_sst": 206,
>> >         "bytes_written_wal": 8219878782,
>> >         "bytes_written_sst": 12998417672
>> >     },
>> >     my ceph version is 12.0.2
>> >     bluestore_rocksdb_options = compression=kNoCompression,
>> > max_write_buffer_number=2, min_write_buffer_number_to_merge=1,
>> > write_buffer_size=268435456, writable_file_max_buffer_size=0
>> >
>> >     missing any rocksdb options cause the problem?
>> >
>> >     Thanks & Regards!
>> >
>> > 2017-08-29 21:45 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>> >> On Tue, 29 Aug 2017, zengran zhang wrote:
>> >>> thanks! i open the debug_bluefs to 10, but did not see the
>> >>> "_should_compact_log" log when fio is writing the rbd...so how /offen
>> >>> will BlueFS::sync_metadata() be called?
>> >>
>> >> Not that often.  We only need to write to the bluefs log when new rocksdb
>> >> files are created.. and they are pretty big.  So it will need to age for
>> >> quite a while before anything happens.
>> >>
>> >> You can get some sense of it by watching the 'ceph daemonperf osd.0'
>> >> output on a running OSD and watching the bluefs.wal column.  There is also
>> >> a bluefs.log_bytes counter in the full perf dump, although you won't see
>> >> the estimated size to compare it against.
>> >>
>> >> sage
>> >>
>> >>>
>> >>> 2017-08-29 21:19 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>> >>> > On Tue, 29 Aug 2017, zengran zhang wrote:
>> >>> >> hi Sage,
>> >>> >>     I want to ask when will the bluefs log be compacted? I only sure
>> >>> >> it will be compacted when umount the bluefs...
>> >>> >> I see the log be compact when rocksdb call Dir.Fsync(), but want to
>> >>> >> know how to trigger this...
>> >>> >
>> >>> > There is a heuristic for when it gets "too big":
>> >>> >
>> >>> > bool BlueFS::_should_compact_log()
>> >>> > {
>> >>> >   uint64_t current = log_writer->file->fnode.size;
>> >>> >   uint64_t expected = _estimate_log_size();
>> >>> >   float ratio = (float)current / (float)expected;
>> >>> >   dout(10) << __func__ << " current 0x" << std::hex << current
>> >>> >            << " expected " << expected << std::dec
>> >>> >            << " ratio " << ratio
>> >>> >            << (new_log ? " (async compaction in progress)" : "")
>> >>> >            << dendl;
>> >>> >   if (new_log ||
>> >>> >       current < cct->_conf->bluefs_log_compact_min_size ||
>> >>> >       ratio < cct->_conf->bluefs_log_compact_min_ratio) {
>> >>> >     return false;
>> >>> >   }
>> >>> >   return true;
>> >>> > }
>> >>> >
>> >>> > and the estimate for the (compacted) size is
>> >>> >
>> >>> > uint64_t BlueFS::_estimate_log_size()
>> >>> > {
>> >>> >   int avg_dir_size = 40;  // fixme
>> >>> >   int avg_file_size = 12;
>> >>> >   uint64_t size = 4096 * 2;
>> >>> >   size += file_map.size() * (1 + sizeof(bluefs_fnode_t));
>> >>> >   for (auto& p : block_all)
>> >>> >     size += p.num_intervals() * (1 + 1 + sizeof(uint64_t) * 2);
>> >>> >   size += dir_map.size() + (1 + avg_dir_size);
>> >>> >   size += file_map.size() * (1 + avg_dir_size + avg_file_size);
>> >>> >   return ROUND_UP_TO(size, super.block_size);
>> >>> > }
>> >>> >
>> >>> > The default min_ratio is 5... so we compact when it's ~5x bigger than it
>> >>> > needs to be.
>> >>> >
>> >>> > sage
>> >>> >
>> >>>
>> >>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html