Re: ticket #40300

Sage Weil <sweil@xxxxxxxxxx> · Thu, 26 Mar 2020 19:05:05 +0000 (UTC)

On Tue, 24 Mar 2020, Igor Fedotov wrote:
> Hi Sage,
> 
> We've got another occurrence for the ticket:
> https://tracker.ceph.com/issues/40300
> 
> Now I'm trying to realize what's happening in BlueFS when it occurs.
> Unfortunately customer applied the suggested workaround and hence likely
> killed the original sst layout.
> 
> So I'm wondering if the bluefs part of the issue (which prevents OSD from
> restart) is caused by a very long single read (>4GB len) from BlueFS.
> 
> If so BlueRocksSequentialFile::Read  implementation seems to be broken due to
> int usage:
> 
>   rocksdb::Status Read(size_t n, rocksdb::Slice* result, char* scratch)
> override {
>     int r = fs->read(h, &h->buf, h->buf.pos, n, NULL, scratch);
> 
>     ceph_assert(r >= 0);
> 
>     *result = rocksdb::Slice(scratch, r);
> 
> ...
> 
> Please note that sizeof(int) is 4!

size_t is long, so we could change int here (and for _read, and so on 
down the stack) to ssize_t...

> Also I'm wondering if we're obliged  to return exactly the requested amount of
> data from Read to RocksDB. Can't some read cap at this function be the simple
> solution?

I think the easiest way to answer that is to look at the PosixStack (or 
whatever it's called) implementation in the rocksdb tree and see if it 
ever returns short, or whether it wraps read(2) in a loop.

Unfortunately that's not entirely clear...

  if (r < n) {
    if (feof(file_)) {
      // We leave status as ok if we hit the end of the file
      // We also clear the error so that the reads can continue
      // if a new data is written to the file
      clearerr(file_);
    } else {
      // A partial read with an error: return a non-ok status
      s = IOError("While reading file sequentially", filename_, errno);
    }
  }

But, I think the only real reason we'd want to do a short read is if 
the read is long due to readahead, in which case the problem is 
more that readahead was kludged into the stack at the wrong 
point.

sage
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx