On 1/18/2019 6:33 PM, KEVIN MICHAEL
HRPCEK wrote:
On 1/18/19 7:26 AM, Igor Fedotov
wrote:
Hi Kevin,
On 1/17/2019 10:50 PM, KEVIN
MICHAEL HRPCEK wrote:
Hey,
I recall reading about this somewhere but I can't find it in
the docs or list archive and confirmation from a dev or
someone who knows for sure would be nice. What I recall is
that bluestore has a max 4GB file size limit based on the
design of bluestore not the osd_max_object_size setting. The
bluestore source seems to suggest that by setting the
OBJECT_MAX_SIZE to a 32bit max, giving an error if
osd_max_object_size is > OBJECT_MAX_SIZE, and not writing
the data if offset+length >= OBJECT_MAX_SIZE. So it seems
like the in osd file size int can't exceed 32 bits which is
4GB, like FAT32. Am I correct or maybe I'm reading all this
wrong..?
You're correct, BlueStore doesn't support
object larger than OBJECT_MAX_SIZE(i.e. 4Gb)
Thanks for confirming that!
If bluestore has a hard 4GB object limit using radosstriper
to break up an object would work, but does using an EC pool
that breaks up the object to shards smaller than
OBJECT_MAX_SIZE have the same effect as radosstriper to get
around a 4GB limit? We use rados directly and would like to
move to bluestore but we have some large objects <= 13G
that may need attention if this 4GB limit does exist and an
ec pool doesn't get around it.
Theoretically object split using EC might help.
But I'm not sure whether one needs to adjust
osd_max_object_size greater than 4Gb to permit 13Gb object
usage in EC pool. If
it's needed than tosd_max_object_size
<= OBJECT_MAX_SIZE constraint is violated and BlueStore
wouldn't start.
In my experience I had to increase
osd_max_object_size from the 128M default it changed to a couple
versions ago to ~20G to be able to write our largest objects
with some margin. Do you think there is another way to handle
osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will
start and EC pools or striping can be used to write objects that
are greater than OBJECT_MAX_SIZE but each stripe/shard ends up
smaller than OBJECT_MAX_SIZE after striping or being in an ec
pool?
I'm not very familiar with the logic
osd_max_object_size controls at OSD level. But IMO there are
might be two logically valid options:
1) This is maximum user (RADOS?) object size. In
this case verification at BlueStore is a bit incorrect as EC
might be in the path and hence one can still have 4+ GB object stored. If that's
the case then it's just enough to remove the corresponding
assertion at BlueStore.
2) This is maximum object size provided to Object
store. Then one should be able to upload object longer than this
threshold using EC.
I'm going to verify this behavior and come up
with corresponding fixes if any shortly.
Unfortunately in short term I don't see any workarounds
for your case other than having a custom build that has
assertion at BlueStore removed.
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE
0xffffffff // 32 bits
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395
// sanity check(s)
auto osd_max_object_size =
cct->_conf.get_val<Option::size_t>("osd_max_object_size");
if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << OBJECT_MAX_SIZE
<< "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." << std::dec << dendl;
return -EINVAL;
}
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
} else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
}
Thanks!
Kevin
--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Thanks,
Igor
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com