Re: Rados maximum object size issue since Luminous?

Martin Emrich <martin.emrich@xxxxxxxxxxx> · Tue, 4 Jul 2017 12:10:21 +0000

Hi!

I dug deeper, and apparently striping ist not backwards-compatible to "non-striping":

* "rados ls --stripe" lists only objects where striping was used to write them in the first place.
* If I enable striping in Bareos (tried different values for stripe_unit and stripe_count), it crashes here:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.0/rpm/el7/BUILD/ceph-12.1.0/src/osdc/Striper.cc: In function 'static void Striper::file_to_extents(CephContext*, const char*, const file_layout_t*, uint64_t, uint64_t, uint64_t, std::map<object_t, std::vector<ObjectExtent> >&, uint64_t)' thread 7f32d14da700 time 2017-07-04 13:23:26.097884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.0/rpm/el7/BUILD/ceph-12.1.0/src/osdc/Striper.cc: 64: FAILED assert(object_size >= su)
 ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f32db699120]
 2: (Striper::file_to_extents(CephContext*, char const*, file_layout_t const*, unsigned long, unsigned long, unsigned long, std::map<object_t, std::vector<ObjectExtent, std::allocator<ObjectExtent> >, std::less<object_t>, std::allocator<std::pair<object_t const, std::vector<ObjectExtent, std::allocator<ObjectExtent> > > > >&, unsigned long)+0x1826) [0x7f32e5969c16]
 3: (Striper::file_to_extents(CephContext*, char const*, file_layout_t const*, unsigned long, unsigned long, unsigned long, std::vector<ObjectExtent, std::allocator<ObjectExtent> >&, unsigned long)+0x5b) [0x7f32e596b65b]
 4: (libradosstriper::RadosStriperImpl::aio_read(std::string const&, librados::AioCompletionImpl*, ceph::buffer::list*, unsigned long, unsigned long)+0x584) [0x7f32e58f2054]
 5: (libradosstriper::RadosStriperImpl::read(std::string const&, ceph::buffer::list*, unsigned long, unsigned long)+0x55) [0x7f32e58f2315]
 6: (rados_striper_read()+0x112) [0x7f32e58eada2]
 7: (rados_device::read_object_data(long, char*, unsigned long)+0x3c) [0x7f32e6f0a08c]
 8: (rados_device::d_read(int, void*, unsigned long)+0x1a) [0x7f32e6f0a0ba]
 9: (DEVICE::read(void*, unsigned long)+0x27) [0x7f32e6ef1187]
 10: (DCR::read_block_from_dev(bool)+0xca) [0x7f32e6ee99aa]
 11: (read_dev_volume_label(DCR*)+0x2d8) [0x7f32e6ef4988]
 12: (DCR::check_volume_label(bool&, bool&)+0x10d) [0x7f32e6ef72dd]
 13: (DCR::mount_next_write_volume()+0x5c0) [0x7f32e6ef80e0]
 14: (acquire_device_for_append(DCR*)+0xdb) [0x7f32e6ee179b]
 15: /sbin/bareos-sd() [0x408031]
 16: /sbin/bareos-sd() [0x40f5c4]
 17: /sbin/bareos-sd() [0x40f9d9]
 18: /sbin/bareos-sd() [0x40fbd2]
 19: /sbin/bareos-sd() [0x41070b]
 20: /sbin/bareos-sd() [0x40ee02]
 21: /sbin/bareos-sd() [0x414c18]
 22: (workq_server()+0x1f5) [0x7f32e6a9ca85]
 23: (lmgr_thread_launcher()+0x55) [0x7f32e6a84fb5]
 24: (()+0x7dc5) [0x7f32e5da9dc5]
 25: (clone()+0x6d) [0x7f32e4a5876d]

I guess this is because it tries to read older Volumes (==Objects) which were not written with striping on?

So as striping is not backwards-compatible (and this pools is indeed for backup/archival purposes where large objects are no problem):

How can I restore the behaviour of jewel (allowing 50GB objects)?

The only option I found was "osd max write size" but that seems not to be the right one, as its default of 90MB is lower than my observed 128MB.

Cheers,

Martin

-----Ursprüngliche Nachricht-----
Von: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] Im Auftrag von Martin Emrich
Gesendet: Dienstag, 4. Juli 2017 09:46
An: Gregory Farnum <gfarnum@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
Betreff: Re:  Rados maximum object size issue since Luminous?

Hi,

thanks for the explanation! I am just now diving into the C code of Bareos, it seems there is already code in there to use libradosstriper, I just would have to turn it on ;)

But there are two parameters (stripe_unit and stripe_count), but there are no default values.

What would be sane default values for these parameters (expecting objects of 5-50GB) ? Can I retain backwards compatibility to existing larger objects written without striping?

Thanks so much,

Martin

-----Ursprüngliche Nachricht-----
Von: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
Gesendet: Montag, 3. Juli 2017 19:59
An: Martin Emrich <martin.emrich@xxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Betreff: Re:  Rados maximum object size issue since Luminous?

On Mon, Jul 3, 2017 at 10:17 AM, Martin Emrich <martin.emrich@xxxxxxxxxxx> wrote:
> Hi!
>
>
>
> Having to interrupt my bluestore test, I have another issue since 
> upgrading from Jewel to Luminous: My backup system (Bareos with 
> RadosFile backend) can no longer write Volumes (objects) larger than around 128MB.
>
> (Of course, I did not test that on my test cluster prior to upgrading 
> the production one :/ )
>
>
>
> At first, I suspected an incompatibility between the Bareos storage 
> daemon and the newer Ceph version, but I could replicate it with the rados tool:
>
>
>
> Create a large file (1GB)
>
>
>
> Put it with rados
>
>
>
> rados --pool backup put rados-testfile rados-testfile-1G
>
> error putting backup-fra1/rados-testfile: (27) File too large
>
>
>
> Read it back:
>
>
>
> rados  --pool backup get rados-testfile rados-testfile-readback
>
>
>
> Indeed, it wrote just about 128MB
>
>
>
> Adding the “—striper” option to both get and put command lines, it works:
>
>
>
> -rw-r--r-- 1 root root 1073741824  3. Jul 18:47 rados-testfile-1G
>
> -rw-r--r-- 1 root root  134217728  3. Jul 19:12 
> rados-testfile-readback
>
>
>
> The error message I get from the backup system looks similar:
>
> block.c:659-29028 === Write error. fd=0 size=64512 rtn=-1
> dev_blk=134185235
> blk_blk=10401 errno=28: ERR=Auf dem Gerät ist kein Speicherplatz mehr 
> verfügbar
>
>
>
> (German for „No space left on device”)
>
>
>
> The service worked fine with Ceph jewel, nicely writing 50GB objects. 
> Did the API change somehow?

We set a default maximum object size (of 128MB, probably?) in order to prevent people setting individual objects which are too large for the system to behave well with. It is configurable (I don't remember how, you'll need to look it up in hopefully-the-docs but probably-the-source), but there's generally not a good reason to create single individual objects instead of sharding them. 50GB objects probably work fine for archival, but if eg you have an OSD failure you won't be able to do any IO on objects which are being backfilled or recovered, and for a 50GB object that will take a while.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com