On 03/28/2018 08:05 AM, Varada Kari wrote:
On Wed, Mar 28, 2018 at 11:41 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote:
On Wed, Mar 28, 2018 at 9:43 AM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
Hi Lisa, your presentation last week at Cephalocon was quite convincing.
Recordings aren't available yet, so perhaps you can share your slides.
Here are the slides:
https://drive.google.com/file/d/1WC0id77KWLNVllsEcJCgRgEQ-Xzvzqx8/view?usp=sharing
For those who weren't there, Lisa tested many configurations of rocksdb
with bluestore to attempt to keep the pg log out of level 0 in rocksdb,
and thus avoid a large source of write amplification.
None of these tunings were successful, so the conclusion was that the pg
log ought to be stored outside of rocksdb.
Lisa, what are your thoughts on how to store the pg log?
For historical reference, it was moved into leveldb originally to make
it easier to program against correctly [0], but the current PGLog code
has grown too complex despite that.
I ever wondered whether we can just put pg log in standalone log
files. The read performance is not critical as they are read when an
OSD node recovers. That is to store other metadata in RocksDB and then
store pg log in standalone journal files. (No transaction for other
metadata and pg log). But then I noticed that we can't differentiate
which OSD has latest data if 3 OSD nodes which contain same pgs fail
during a write request. Some OSDs may have updated data, and other
OSDs may have un-undated data, which all of these have no pg log
appended. In this case, it needs to compare the full objects.
We need ordered set of pg log entries for recovery and peering.
If we store them as files, we need to remember in
someway of what all entries this osd contains and where to read them.
And if we keep them in Bluefs as you mentioned in other
discussion, there will more entries in blues log and problem of
compaction shifts from rocksdb to Bluefs.
otherwise we have to store them in BlueStore block device, which would
consume more onodes but leads to
more fragmented space usage on the block device.
There are two schemes that sort of seemed worth pursuing if we were to
seriously considering moving this data outside of rocksdb:
1) dedicated block space for per-pg ring buffers
2) something vaguely like the rocksdb WAL but we don't keep entries in
memory, we don't cap the number of buffers, and we keep in-memory
references to each one and only delete old buffers once no PG references
them anymore.
It seems to me that both schemes would avoid the write amp penalty we
suffer from now. 1 should be more space efficient but more or less turn
this all into random IO. 2 would mean bouncing between the WAL and this
external log which might be nearly as bad as 1. It also could mean
extremely high space amp in unusual scenarios.
I think Josh's proposal from last time is worth thinking about too: Make
per-pg ring buffers in rocksdb itself so that you are updating existing
keys.
Mark
Another method I am investigating is that whether in Rocksdb we can
use fifo case just for pg log. That means we need to handle for each
pg. This needs to update in Rocksdb and every pg log will be written
twice at most. (One to Rocksdb log, and one to level 0).
Ideally if we can have, one instance of rocksdb per PG and one column
family per pglog entries,
that way we can have all the entries never leaving rocksdb L0s, but it
comes with its own baggage of managing multiple
instances and much resources. Practically might be a difficult thing to handle.
Fifo might solve the problem but write amplification is coming to 2
again as you mentioned. sorry I don't have any bright ideas to support
you, we need to explore some ideas about separating all the pg log
info into a multiple column families by themselves and always pin them
to L0s, just passing some extra info to always compact them to
L0's(never to promote) or to keep the trimming intervals too tight to
keep them L0's, that can have more backfilling impact, but given the
improvements in recovery, should we calibrate that and find out the
cost of it compared to log based recovery?
Varada
Any suggestions?
Josh
[0]
https://github.com/ceph/ceph/commit/1ef94200e9bce5e0f0ac5d1e563421a9d036c203
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html