Neil Brown <neilb@xxxxxxx> writes: > On Tuesday May 26, goswin-v-b@xxxxxx wrote: >> Neil Brown <neilb@xxxxxxx> writes: >> >> > On Monday May 25, goswin-v-b@xxxxxx wrote: >> >> That really seems to scream for LVM to support more raid levels. It >> >> already has linear, raid0 and raid1 support (although I have no idea >> >> how device mapper raid1 compares to md raid1). >> > >> > Note that LVM (a suite of user-space tools) could conceivably use >> > md/raid1, md/raid5 etc. The functionality doesn't have to go in dm. >> > >> > Neil >> >> How would you do this? Worst case you can have a LV made up of totaly >> non linear PEs, meaning lots of 4MB (default PE size) big chunks in >> random order on random disks. >> >> Do you create a raid1/5 for each stripe? You surely run out of md >> devices. > > We have 2^21 md devices easily (I think that is the number) and it > wouldn't be hard to have more if that were an issue. > >> >> Create dm mappings for all stripe 0s, stripe 1s, stripe 2s, ... and >> then a raid1/5 over those stripe devices? > > That might be an option. > >> >> What if the LV has segments with different raid configurations (number >> of disks in a stripe or even different levels)? Create a raid for each >> segment and then a dm mapping for a linear raid? >> > > Yes. > >> >> You can get a flood of intermediate devices there. A /proc/mdstat with >> 200 entries would be horrible. iostat output would be totaly >> useless. ... >> > > Yep, these would be interesting problems to solve. /proc/mdstat is a > bit of a wart on the design - getting the entry in /proc/mdstat > optional might be a good idea. Resyncing in a way that uses parallelism without using a physical devices twice would also be difficult without merging all those layers into one or peaking through them. The raid could doesn't see what physical devices are in a device-mapper device and so on. Plus I do want ONE entry in /proc/mdstat (or equivalent) to see how a resync is going. Just not 200. So it is not just about hiding but also about showing something sensible. > As for iostat - where does it get info from ? /proc/partitions? /proc/diskinfo? > Maybe /sys/block? > Either way, we could probably find a way to say "this block device is > 'hidden'" . One of those places. > If you want to be able to slice and dice lot of mini-raid arrays into > an LVM system, then whatever way you implement it you will need to be > keeping track of all those bits. I think it makes most sense to use > the "block device" as the common abstraction, then if we start finding > issues: solve them. That way the solutions become available for > others to use in ways we hadn't expected. I think the device mapper tables should suffice. They are perfect for slice and dice operations. This should realy sidestep the block device overhead (alloc major/minor, send event, not runtime overhead) and combine status of many slices into a combined status. I see one problem though for converting md code to dm code: The metadata. In LVM every PE is basically independent and can be moved around at will. So the raid code must be able to split and merge raid devices on a PE granularity at least. Specifically the dirty/clean informations and serial counts are tricky. There could be 2 options: 1) Put a little bit of metadata at the start of every PE. The first block of each PE could also hold an internal bitmap for that PE and not just a few meta infos and the clean/dirty byte. For internal bitmaps this might be optimal as it would garanty short seeks to reach the bits. 2) Have detached metadata. Md already has detached bitmaps. Think of it as a raid without metadata but with external bitmap. >> MfG >> Goswin MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html