Re: Fwd: large concurrent rbd operations block for over 15 mins!

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Oct 2019 10:04:16 -0500

Hi Frank,

Excellent, thanks for the feedback.  One other area that we've seen come 
up recently is folks using EC with RGW and small ( < 64K) objects.  
Depending on the min_alloc size and the EC chunking, that potentially 
could end up resulting in worse space amplification than just using 3x 
replication (and it will be slower too given the small object sizes).  
Potentially we are looking at switching the min alloc size down to 4k 
which would help reduce the space-amp but may not totally alleviate the 
problem.

Mark

On 10/23/19 9:04 AM, Frank Schilder wrote:
Hi Mark,

for us it was mainly that we do not have the budget for replicated data pools for generic storage. In addition, we expect that SSDs will soon be very competitive in price with HDDs, offering the IOPs advantage necessary to run RBD on EC pools. To give you an idea, we use Micron PRO SSDs, which in our setup provide storage at ca. 4 times the price per TB compared with spinning disks. This is due to the fact that our ceph set-up has a quite large overhead in infrastructure (servers, network, etc.), which goes on top of the pure disk price. A factor of 4 with today's prices is already quite good, we could not run an equal sized replication 3(2) pool with the same performance for this money.

Looking at recent price developments, we expect that this factor will go down to 2 within the next or at most next two years.

The IOPs requirement for our VMs are not extreme. They are happy with 50 IOPs per machine (well, and were used to much worse in the past). If you need single-machine IOPs > 150, EC will not deliver due to latency. For such use cases, one needs a replicated pool and we would be willing to set one up if users pay for it.

In short, our design philosophy is "get sufficient performance for small bucks" and EC does it for us.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 22 October 2019 15:59:21
To: ceph-users@xxxxxxx
Subject:  Re: Fwd: large concurrent rbd operations block for over 15 mins!

Out of curiosity, when you chose EC over replication how did you weigh
IOPS vs space amplification in your decision making process?  I'm
wondering if we should prioritize EC latency vs other tasks in future
tuning efforts (it's always a tradeoff deciding what to focus on).

Thanks,

Mark

On 10/22/19 2:35 AM, Frank Schilder wrote:
Getting decent RBD performance is not a trivial exercise. While at a first glance 61 SSDs for 245 clients sounds more or less OK, it does come down to a bit more than that.

The first thing is, how to get SSD performance out of SSDs with ceph. This post will provide very good clues and might already point out the bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do you have good enterprise SSDs?

Next thing to look at, what kind of data pool, replicated or erasure coded? If erasure coded, has the profile been benchmarked? There are very poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better throughput. m>=2.

More complications: do you need to deploy more than one OSD per SSD to boost performance? This is indicated by the iodepth required in an fio benchmark to get full IOPs. Good SSDs deliver already spec performance with 1 OSD. More common ones require 2-4 OSDs per disk. Are you using ceph-volume already, its default is 2 OSDs per SSD (batch mode).

To give a base line, after extensive testing and working through all the required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write performance. This is probably what you would like to see as well.

If you use replicated data pool, this should be relatively easy. With EC data pool, this is a bit of a battle.

Good luck,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Void Star Nill <void.star.nill@xxxxxxxxx>
Sent: 22 October 2019 03:00
To: ceph-users
Subject:  Fwd: large concurrent rbd operations block for over 15 mins!

Apparently the graph is too big, so my last post is stuck. Resending without the graph.

Thanks

---------- Forwarded message ---------
From: Void Star Nill <void.star.nill@xxxxxxxxx<mailto:void.star.nill@xxxxxxxxx>>
Date: Mon, Oct 21, 2019 at 4:41 PM
Subject: large concurrent rbd operations block for over 15 mins!
To: ceph-users <ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>>

Hello,

I have been running some benchmark tests with a mid-size cluster and I am seeing some issues. Wanted to know if this is a bug or something that can be tuned. Appreciate any help on this.

- I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node has 100G link.
- I have 245 client machines from which I am triggering rbd operations. Each client has 25G link
- rbd operations include, creating an RBD image of 50G size and layering feature, mapping the image to the client machine, formatting the device in ext4 format, mounting it, running dd to write to the full disk and cleaning up (unmount, unmap and remove).

If I run these RBD operations concurrently on a small number of machines (say 16-20), they run very well and I see good throughput. All image operations (except for dd) take less than 2 seconds.

However, when I scale it up to 245 clients, each running these operations concurrently, I see lot of operations getting hung for a long time and the overall throughput reduces drastically.

For example, some of the format operations take over 10-15 mins!!!

Note that, all operations do complete - so its most likely not a deadlock kind of situation.

I dont see any errors in ceph.log on the monitor nodes. However, the clients do report "hung_task_timeout" in dmesg logs.

As you can see in the below image, half the format operations are completing in less than a second time, while the other half is over 10mins (y axis is in seconds)

[11117.113618] INFO: task umount:9902 blocked for more than 120 seconds.
[11117.113677]       Tainted: G           OE    4.15.0-51-generic #55~16.04.1-Ubuntu
[11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11117.113787] umount          D    0  9902   9901 0x00000000
[11117.113793] Call Trace:
[11117.113804]  __schedule+0x3d6/0x8b0
[11117.113810]  ? _raw_spin_unlock_bh+0x1e/0x20
[11117.113814]  schedule+0x36/0x80
[11117.113821]  wb_wait_for_completion+0x64/0x90
[11117.113828]  ? wait_woken+0x80/0x80
[11117.113831]  __writeback_inodes_sb_nr+0x8e/0xb0
[11117.113835]  writeback_inodes_sb+0x27/0x30
[11117.113840]  __sync_filesystem+0x51/0x60
[11117.113844]  sync_filesystem+0x26/0x40
[11117.113850]  generic_shutdown_super+0x27/0x120
[11117.113854]  kill_block_super+0x2c/0x80
[11117.113858]  deactivate_locked_super+0x48/0x80
[11117.113862]  deactivate_super+0x5a/0x60
[11117.113866]  cleanup_mnt+0x3f/0x80
[11117.113868]  __cleanup_mnt+0x12/0x20
[11117.113874]  task_work_run+0x8a/0xb0
[11117.113881]  exit_to_usermode_loop+0xc4/0xd0
[11117.113885]  do_syscall_64+0x100/0x130
[11117.113887]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[11117.113891] RIP: 0033:0x7f0094384487
[11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX: 00007f0094384487
[11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000944210
[11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09: 0000000000000014
[11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f009488d83c
[11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fff4199f250
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx