Re: raid456's devices_handle_discard_safely is unusably slow

Michal Soltys <soltys@xxxxxxxx> · Mon, 11 Feb 2019 15:36:44 +0100

On 2/8/19 6:13 PM, Song Liu wrote:
On Wed, Feb 6, 2019 at 9:01 AM Michal Soltys <soltys@xxxxxxxx> wrote:

On 1/30/19 5:11 PM, Michal Soltys wrote:
On 19/01/28 19:44, Michal Soltys wrote:
On 1/28/19 5:57 PM, Song Liu wrote:

<cut>

I looked a bit deeper at raid10 and raid5 (4x32g) logs, and the behavior
is just really weird:

1) r10, blkdiscard

blkdiscard itself submits device-long discard via ioctl, which then is
split into 8388607 sectors long parts. Further down these are split into:

- 8191 x 1024s, 1023s
- 8192 x 1s every 1024s, then going backwards from 8g to 4g: 1022s, 8191
x 1023s
- 8192 x 2s, then backwards: 1021s, 8191 x 1022s
....
- remaining of the device: 8065 x 15s, then backwards: 8064 x 1009s

Anything but first 4g is completely unmergable. Afterwards, why is it
sending single sector values (then 2, then 3) every 1024s, then fill up
the rest of the those 1024s but going backwards ?

For the record, if I force blkdiscard to use power-of-2 aligned step, it
works w/o the weird small/backwards approach.

2) r10, fstrim

While it's working notably faster on empty fs (still very long - nearly
1 minute), the splits are really weird sized: 648s + 376s, 952s + 72s
(smaller ones going backwards as well), so those are not mergable
either. Lots of full 1024s ones though.

In comparison, fstrim on single partition of the same size takes ~ 1.6s
with large discards going 1:1 pretty much.

3) r5, blkdiscard

Now the case of raid5 - while the behavior seems cleaner there (no
unusual splits), the unusually precise 10ms delays between each discard
completion are the main culprit as far as I can see. While the 4k splits
(which then get merged back to chunk-sized pieces) take their toll, it's
a small footprint in comparison.

Anyway,

Song, do you have some suggestions or comments about those results (or
need more specific tests to do, while I can still do them) ?

Hi Michal,

I haven't got much time to look into this. It is probably not easy to fix in md
layer. How about some workaround like:

1. trim each device;
2. create RAID volume;
3. skip trim at mkfs time (mkfs.xfs -K or equivalent)

Well, of course that's what I will be generally doing - pretrim before 
use, overprovision, make --assume-clean raid, mkfs, then forget that any 
kind of discard exists.

Still - raid456's own issues aside - raid10 behavior (like sending 
single-sector requests, one per chunk over a 4gb piece) is very puzzling.

Worth noting is that explicitly specified values such as 
multiplies-of-chunk-size steps work ok over raid10, e.g.:

# time blkdiscard -p $((2*1048576*1024)) /dev/md/test
real    0m5.288s

# time blkdiscard -p $((2*1048576*1024-524288)) /dev/md/test
real    0m5.359s

But something else:

# time blkdiscard -p $((2*1048576*1024-262144)) /dev/md/test
real    10m57.435s

# time blkdiscard -p $((2*1048576*1024-512)) /dev/md/test
real    11m41.233s

# equivalent of 8388607 sectors
# time blkdiscard /dev/md/test
real    11m12.215s

(along with 5 dmesg complaints about hanged tasks in the bad cases)

So there is something really weird in how treats some the discard sizes.