Re: raid456's devices_handle_discard_safely is unusably slow

Song Liu <liu.song.a23@xxxxxxxxx> · Fri, 8 Feb 2019 09:13:32 -0800

On Wed, Feb 6, 2019 at 9:01 AM Michal Soltys <soltys@xxxxxxxx> wrote:
>
> On 1/30/19 5:11 PM, Michal Soltys wrote:
> > On 19/01/28 19:44, Michal Soltys wrote:
> >> On 1/28/19 5:57 PM, Song Liu wrote:
> >>
> >> <cut>
> >>
> >
> > I looked a bit deeper at raid10 and raid5 (4x32g) logs, and the behavior
> > is just really weird:
> >
> > 1) r10, blkdiscard
> >
> > blkdiscard itself submits device-long discard via ioctl, which then is
> > split into 8388607 sectors long parts. Further down these are split into:
> >
> > - 8191 x 1024s, 1023s
> > - 8192 x 1s every 1024s, then going backwards from 8g to 4g: 1022s, 8191
> > x 1023s
> > - 8192 x 2s, then backwards: 1021s, 8191 x 1022s
> > ....
> > - remaining of the device: 8065 x 15s, then backwards: 8064 x 1009s
> >
> > Anything but first 4g is completely unmergable. Afterwards, why is it
> > sending single sector values (then 2, then 3) every 1024s, then fill up
> > the rest of the those 1024s but going backwards ?
> >
> > For the record, if I force blkdiscard to use power-of-2 aligned step, it
> > works w/o the weird small/backwards approach.
> >
> > 2) r10, fstrim
> >
> > While it's working notably faster on empty fs (still very long - nearly
> > 1 minute), the splits are really weird sized: 648s + 376s, 952s + 72s
> > (smaller ones going backwards as well), so those are not mergable
> > either. Lots of full 1024s ones though.
> >
> > In comparison, fstrim on single partition of the same size takes ~ 1.6s
> > with large discards going 1:1 pretty much.
> >
> >
> > 3) r5, blkdiscard
> >
> > Now the case of raid5 - while the behavior seems cleaner there (no
> > unusual splits), the unusually precise 10ms delays between each discard
> > completion are the main culprit as far as I can see. While the 4k splits
> > (which then get merged back to chunk-sized pieces) take their toll, it's
> > a small footprint in comparison.
> >
>
> Anyway,
>
> Song, do you have some suggestions or comments about those results (or
> need more specific tests to do, while I can still do them) ?

Hi Michal,

I haven't got much time to look into this. It is probably not easy to fix in md
layer. How about some workaround like:

1. trim each device;
2. create RAID volume;
3. skip trim at mkfs time (mkfs.xfs -K or equivalent)

Thanks,
Song