On 07/26/2012 05:43 AM, Peng Tao wrote: > On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote: >> On 07/25/2012 05:43 PM, Peng Tao wrote: >> >>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote: >>>> On 07/25/2012 10:31 AM, Peng Tao wrote: >>>> >>>>> Hi Boaz, >>>>> >>>>> Sorry about the long delay. I had some internal interrupt. Now I'm >>>>> looking at the partial LD write problem again. Instead of trying to >>>>> bail out unaligned writes blindly, this time I want to fix the write >>>>> code to handle partial write as you suggested before. However, it >>>>> seems to be more problematic than I used to think. >>>>> >>>>> The dirty range of a page passed to LD->write_pagelist may be >>>>> unaligned to sector size, in which case block layer cannot handle it >>>>> correctly. Even worse, I cannot do a read-modify-write cycle within >>>>> the same page because bio would read in the entire sector and thus >>>>> ruin user data within the same sector. Currently I'm thinking of >>>>> creating shadow pages for partial sector write and use them to read in >>>>> the sector and copy necessary data into user pages. But it is way too >>>>> tricky and I don't feel like it at all. So I want to ask how you solve >>>>> the partial sector write problem in object layout driver. >>>>> >>>>> I looked at the ore code and found that you are using bio to deal with >>>>> partial page read/write as well. But in places like _add_to_r4w(), I >>>>> don't see how partial sectors are handled. Maybe I was misreading the >>>>> code. Would you please shed some light? More specifically, how does >>>>> object layout driver handle partial sector writers like in bellow >>>>> simple testcase? Thanks in advance. >>>>> >>>> >>>> >>>> The objlayout does not have this problem. OSD-SCSI is a byte aligned >>>> protocol, unlike DISK-SCSI. >>>> >>> aha, I see. So this is blocklayout only problem. >>> >>>> The code you are looking for is at _add_to_r4w_first_page() && >>>> _add_to_r4w_last_page. But as I said I just submit a read of: >>>> 0 => offset within the page >>>> What ever that might be. >>>> >>>> In your case: why? all you have to do is allocate 2 sectors (1k) at >>>> most one for partial sector at end and one for partial sector at >>>> beginning. And use chained BIOs then memcpy at most [1k -2] bytes. >>>> >>>> What you do is chain a single-sector BIO to an all aligned BIO >>>> >>> Yeah, it is exactly what I mean by "shadow pages" except for the >>> chained BIO part. I said "shadow pages" because I need to create one >>> or two pages to construct bio_vec to do the full sector sync read, and >>> the pages cannot be attached to inode address space (that's why >>> "shadow" :-). >>> >>> I asked because I don't like the solution and thought maybe there is >>> better method in object layout and I didn't find it in object code. >>> Now that it is a blocklayout only problem, I guess I'll have to do the >>> full sector sync reads tricks. >>> >>>> You do the following: >>>> >>>> - You will need to preform two reads, right? One for the unaligned >>>> BLOCK at the begging and one for the BLOCK at the end. Since in >>>> blocklayout all IO is BLOCK aligned. >>>> >>>> Beginning end of IO >>>> - Jump over first unaligned SECTOR. Prepare BIO from first full >>>> sector, to the end of the BLOCK. >>>> - Prepare a 1-biovec BIO from the above allocated sector, which >>>> reads the full first sector. >>>> - perpend the 1-vec BIO to the big one. >>>> - preform the read >>>> - memcpy from above allocated sector the 0=>offset part into the >>>> NFS original page. >>>> >>>> Do the same for end of IO but for the very last unaligned sector. >>>> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector >>>> part. >>>> >>>> So you see no shadow pages and not so complicated. In the unaligned >>>> case at most you need allocate 1k and chain BIOs at beginning and/or >>>> at end. >>>> >>>> Tell me if you need help with BIO chaining. The 1-vec BIO just use >>>> bio_kmalloc(). >>>> >>> yeah, I do have a question on the BIO chaining thing. IMO, I need to >>> do one or two sync full sector reads, and memcpy the data in the pages >>> to fill original NFS page into sector aligned. And then I can issue >>> the sector aligned writes to write out all nfs pages. So I don't quite >>> get it when you say "perpend the 1-vec BIO to the big one", because >>> the sector aligned writes (the big one) must be submitted _after_ the >>> full sector sync reads and memcpy. Would you explain it a bit? >>> >> >> >> I'm not sure if that is what you meant but I thought you need to write >> as part of the original IO also the reminder of the last and fist BLOCKs >> >> BLOCK means the unit set by the MDS as the atomic IO operation of any >> IO. If not a full BLOCK is written then the read-layout needs to be used >> to copy the un written parts of the BLOCK into the write layout. >> > Not sure about objectlayout, but for block layout, we really don't > have to always read/write in BLOCK size. BLOCK is just a minimal > traceable extent and it is all about extent state that we care about. > If it is a read-write extent (which is the common case for rewrite), > blocklayout client can do whatever size of IO as long as the > underlying hardware supports (in DISK-SCSI case, SECTOR size). > >> And that BLOCK can be bigger then a page (multiple of pages) and therefore >> also bigger then a sector (512 bytes). >> >> [In objects layout RFC the stripe_unit is not mandatory a multiple of >> PAGE_SIZE, but if it is not so, we return error at alloc_lseg and use >> MDS. I hope it is the same for blocklayout. BLOCK if bigger then >> PAGE_SIZE should be multiple of. If not revert to MDS IO] >> >> So this is what I see. Say BLOCK is two pages. >> >> The complete IO will look like: >> >> .....| block 0 || block 1 || block 2 || block 3 |...... >> .....|page 0|page 1||page 2|page 3||page 4|page 5||page 6|page 7|...... >> ^ ^ ^ ^ >> | |<--------------------------------->| | >> | NFS-IO-start NFS-IO-end | >> | | | | >> | | | | >> |<-read I->| |<-read II->| >> |<-------------------------------------------------------->| >> Written-IO-start Written-IO-end >> >> Note that the best is if all these pages above, before the write >> operation, are at page-cache if not it is asking for trouble. >> >> lets zoom into the first block. (The same at last block but opposite) >> >> .....| block 0 |...... >> .....| page 0 | page 1 |...... >> .....| sec0 | sec1 | sec2 | sec3 | sec4 | sec5 | sec6 | sec7 |...... >> ^ ^ >> | |----------...... >> | NFS-IO-start >> |<----------------read I--------------------->| >> |<----------------BIO_A------------------>| | >> |<->| <---- memcpy-part >> BIO_B---> |<--->| >> >> (Sorry I put 4 sectors per page, it is 8, but the principle is the same) > Thanks a lot for the graph, it is very impressive and helps me a lot > in understanding your idea. > >> >> You can not submit an IO read as one BIO into the original cache pages >> because sec6 above will be needed to be read complete and this will >> over-write the good part of sec6 which has valid data. >> >> So you make one BIO_A with sec0-5 which point to original page-cache pages. >> You make a second BIO_B which points to a side buffer of a the full sec6, and >> you chain them. ie: >> BIO_A->bi_next = BIO_B (This is what I mean post-pend) >> > As I explained above, block layout client doesn't have to read sec0-5, > if extent is already read-write. Just when extent is invalid and if > there is a copy-on-write extent, client need to read in data from the > cow extent. And the BIO chaining thing is really unnecessary IMHO. In > cases client need to read in from cow extent, I can just use a single > BIO to read in sec0-6 and memcpy sec4-5 and part of sec6 into the > original nfs page. > > It's not complicated. I have already cooked the patch. Will send it > out later today after more testing. It's just that I don't like the > solution, because I'll have to allocate more pages to construct > bio_vec to do read. It is an extra effort especially in memory reclaim > writeback case. Maybe I should make sure single page writeback don't > go through block layout LD. > >> - Now submit the one read, two BIOs chained. >> - Do the same for the NFS-IO-end above, also one read 2 BIOs chained >> >> - Wait for both reads to return >> >> - Then you memcpy sec6 0 to offset%sector_size into original cache page. >> - Same for the end part, last_byte to sector_end >> >> - Now you can submit the full write. >> >> Both page 0 and page 1 can be marked as uptodate. But most important >> page 0 was not in cache before the preparation of the read, it must >> be marked as PageUptodate(). >> > Another thing is, this further complicates direct writes, where I > cannot use pagecache to ensure proper locking for concurrent writers > in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to > be serialized internally. IOW, the same code cannot be reused by DIO > writes. sigh... > Crap, you did not understand my idea. Because in my plan all IO is done on page-cache pages, and or NFS pages, *ALL*. Even with the sec6 case above, page 1 is directly IOed and locked normally. only the single sector6 is not. You go head and say, "yes I have a solution just like you that allocates multiple pages and IOs and copies" , "But I don't like the allcations ...." But this is exactly the opposite of my plan. In my plan you only allocate *at most* 2 sector. If you are concern about mem pressure just make a mempool of 512 byte units, and have 2 spare and you are done. (That's how scsi works) My demonstration above was for the worst case where "when extent is invalid and if there is a copy-on-write extent" Of course when you don't need that then all you need is the single sector read of above BIO_B and the copy. All during the IO, all pages are locked as before specifically page 1 above which holds sec6. I will not continue with these explanations, because clearly you are not listening to me, and you have your own code in mind, so what is the use? Good luck Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html