Trying to cc the GNU parted and linux-block mailing lists. On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote: > While trying to reproduce some performance issues I have been seeing > with Ceph, I have come across a strange behaviour which is seemingly > affected only by the end point (and thereby the size) of a partition > being an odd number of sectors. Since all documentation about > alignment only refers to the starting point of the partition, this was > pretty surprising and I would like to know whether this is expected > behaviour or maybe a kernel issue. > > The command I am using is pretty simple: > > fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k > --filename=/dev/sdb2 --runtime=10 --name=test > > The difference shows itself when the partition is created either by > sgdisk or by parted: > > sgdisk --new=2:6000M: /dev/sdb > > parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100% > > The difference in the partition table looks like this: > > < 2 6291456000B 1600320962559B 1594029506560B > osd-device-1-block > --- >> 2 6291456000B 1600321297919B 1594029841920B osd-device-1-block Looks like parted took you at your word when you asked for your partition at 100%. Just out of curiosity if you try and make the same partition interactively with parted do you get any warnings after making and after running align-check ? > So this is really only the end of the partition that is different. > However, in the first case, the 4k writes all get broken up into 512b > writes somewhere in the kernel, as can be seen with btrace: > > 8,16 3 36 0.000102666 8184 A WS 12353985 + 1 <- (8,18) 65985 > 8,16 3 37 0.000102739 8184 Q WS 12353985 + 1 [fio] > 8,16 3 38 0.000102875 8184 M WS 12353985 + 1 [fio] > 8,16 3 39 0.000103038 8184 A WS 12353986 + 1 <- (8,18) 65986 > 8,16 3 40 0.000103109 8184 Q WS 12353986 + 1 [fio] > 8,16 3 41 0.000103196 8184 M WS 12353986 + 1 [fio] > 8,16 3 42 0.000103335 8184 A WS 12353987 + 1 <- (8,18) 65987 > 8,16 3 43 0.000103403 8184 Q WS 12353987 + 1 [fio] > 8,16 3 44 0.000103489 8184 M WS 12353987 + 1 [fio] > 8,16 3 45 0.000103609 8184 A WS 12353988 + 1 <- (8,18) 65988 > 8,16 3 46 0.000103678 8184 Q WS 12353988 + 1 [fio] > 8,16 3 47 0.000103767 8184 M WS 12353988 + 1 [fio] > 8,16 3 48 0.000103879 8184 A WS 12353989 + 1 <- (8,18) 65989 > 8,16 3 49 0.000103947 8184 Q WS 12353989 + 1 [fio] > 8,16 3 50 0.000104035 8184 M WS 12353989 + 1 [fio] > 8,16 3 51 0.000104150 8184 A WS 12353990 + 1 <- (8,18) 65990 > 8,16 3 52 0.000104219 8184 Q WS 12353990 + 1 [fio] > 8,16 3 53 0.000104307 8184 M WS 12353990 + 1 [fio] > 8,16 3 54 0.000104452 8184 A WS 12353991 + 1 <- (8,18) 65991 > 8,16 3 55 0.000104520 8184 Q WS 12353991 + 1 [fio] > 8,16 3 56 0.000104609 8184 M WS 12353991 + 1 [fio] > 8,16 3 57 0.000104885 8184 I WS 12353984 + 8 [fio] > > whereas in the second case, I'm getting the expected 4k writes: > > 8,16 6 42 1266874889.659842036 8409 A WS 12340232 + 8 <- > (8,18) 52232 > 8,16 6 43 1266874889.659842167 8409 Q WS 12340232 + 8 [fio] > 8,16 6 44 1266874889.659842393 8409 G WS 12340232 + 8 [fio] This is weird because --size=1G should mean that fio is "seeing" an aligned end. Does direct=1 with a sequential job of iodepth=1 show the problem too? > The above examples are from running with an SSD, where the small > writes get merged together again before hitting the block device, > which is still pretty o.k. performance wise. But when I run the same > test on some NVMe device, the writes do not get merged, instead the > performance drops to less then 10% of what I get in the second case. Perhaps the ioscheduler doesn't have the opportunity with the NVMe device... > If this is indeed expected behaviour from the kernel pov, it might > need some better documentation and probably sgdisk should also be > enhanced to align the end of the partition as well. FWIW, this happens > on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels. Do you mean parted? -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html