On Tue, 24 Mar 2020, Igor Fedotov wrote:
> Hi Sage,
>
> We've got another occurrence for the ticket:
> https://tracker.ceph.com/issues/40300
>
> Now I'm trying to realize what's happening in BlueFS when it occurs.
> Unfortunately customer applied the suggested workaround and hence likely
> killed the original sst layout.
>
> So I'm wondering if the bluefs part of the issue (which prevents OSD from
> restart) is caused by a very long single read (>4GB len) from BlueFS.
>
> If so BlueRocksSequentialFile::Read implementation seems to be broken due to
> int usage:
>
> rocksdb::Status Read(size_t n, rocksdb::Slice* result, char* scratch)
> override {
> int r = fs->read(h, &h->buf, h->buf.pos, n, NULL, scratch);
>
> ceph_assert(r >= 0);
>
> *result = rocksdb::Slice(scratch, r);
>
> ...
>
> Please note that sizeof(int) is 4!
size_t is long, so we could change int here (and for _read, and so on
down the stack) to ssize_t...
> Also I'm wondering if we're obliged to return exactly the requested amount of
> data from Read to RocksDB. Can't some read cap at this function be the simple
> solution?
I think the easiest way to answer that is to look at the PosixStack (or
whatever it's called) implementation in the rocksdb tree and see if it
ever returns short, or whether it wraps read(2) in a loop.
Unfortunately that's not entirely clear...
if (r < n) {
if (feof(file_)) {
// We leave status as ok if we hit the end of the file
// We also clear the error so that the reads can continue
// if a new data is written to the file
clearerr(file_);
} else {
// A partial read with an error: return a non-ok status
s = IOError("While reading file sequentially", filename_, errno);
}
}
But, I think the only real reason we'd want to do a short read is if
the read is long due to readahead, in which case the problem is
more that readahead was kludged into the stack at the wrong
point.
sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx