On 2/8/19 6:13 PM, Song Liu wrote:
On Wed, Feb 6, 2019 at 9:01 AM Michal Soltys <soltys@xxxxxxxx> wrote:
On 1/30/19 5:11 PM, Michal Soltys wrote:
On 19/01/28 19:44, Michal Soltys wrote:
On 1/28/19 5:57 PM, Song Liu wrote:
<cut>
I looked a bit deeper at raid10 and raid5 (4x32g) logs, and the behavior
is just really weird:
1) r10, blkdiscard
blkdiscard itself submits device-long discard via ioctl, which then is
split into 8388607 sectors long parts. Further down these are split into:
- 8191 x 1024s, 1023s
- 8192 x 1s every 1024s, then going backwards from 8g to 4g: 1022s, 8191
x 1023s
- 8192 x 2s, then backwards: 1021s, 8191 x 1022s
....
- remaining of the device: 8065 x 15s, then backwards: 8064 x 1009s
Anything but first 4g is completely unmergable. Afterwards, why is it
sending single sector values (then 2, then 3) every 1024s, then fill up
the rest of the those 1024s but going backwards ?
For the record, if I force blkdiscard to use power-of-2 aligned step, it
works w/o the weird small/backwards approach.
2) r10, fstrim
While it's working notably faster on empty fs (still very long - nearly
1 minute), the splits are really weird sized: 648s + 376s, 952s + 72s
(smaller ones going backwards as well), so those are not mergable
either. Lots of full 1024s ones though.
In comparison, fstrim on single partition of the same size takes ~ 1.6s
with large discards going 1:1 pretty much.
3) r5, blkdiscard
Now the case of raid5 - while the behavior seems cleaner there (no
unusual splits), the unusually precise 10ms delays between each discard
completion are the main culprit as far as I can see. While the 4k splits
(which then get merged back to chunk-sized pieces) take their toll, it's
a small footprint in comparison.
Anyway,
Song, do you have some suggestions or comments about those results (or
need more specific tests to do, while I can still do them) ?
Hi Michal,
I haven't got much time to look into this. It is probably not easy to fix in md
layer. How about some workaround like:
1. trim each device;
2. create RAID volume;
3. skip trim at mkfs time (mkfs.xfs -K or equivalent)
Well, of course that's what I will be generally doing - pretrim before
use, overprovision, make --assume-clean raid, mkfs, then forget that any
kind of discard exists.
Still - raid456's own issues aside - raid10 behavior (like sending
single-sector requests, one per chunk over a 4gb piece) is very puzzling.
Worth noting is that explicitly specified values such as
multiplies-of-chunk-size steps work ok over raid10, e.g.:
# time blkdiscard -p $((2*1048576*1024)) /dev/md/test
real 0m5.288s
# time blkdiscard -p $((2*1048576*1024-524288)) /dev/md/test
real 0m5.359s
But something else:
# time blkdiscard -p $((2*1048576*1024-262144)) /dev/md/test
real 10m57.435s
# time blkdiscard -p $((2*1048576*1024-512)) /dev/md/test
real 11m41.233s
# equivalent of 8388607 sectors
# time blkdiscard /dev/md/test
real 11m12.215s
(along with 5 dmesg complaints about hanged tasks in the bad cases)
So there is something really weird in how treats some the discard sizes.