Disabling Command Completion Coalescing (CCC) in SATA AHCI

Pallav Bose <pallavbose@xxxxxxxxx> · Thu, 19 May 2011 16:32:47 -0400

Hello,

I'm working on 2.6.35.9 version of the Linux kernel and am trying to
disable Command Completion Coalescing. I have Native Command Queuing
enabled by activating the RAID mode through the BIOS.

I was looking at the Serial ATA AHCI 1.3 Specification and found on
page 115 that -

The CCC feature is only in use when CCC_CTL.EN is set to ‘1’. If
CCC_CTL.EN is set to ‘0’, no CCC interrupts shall be generated.

Next, I had a look at the relevant code (namely, the files concerning
AHCI) for this version of the kernel but wasn't able to make any
progress. I found the following enum constant - HOST_CAP_CCC = (1 <<
7) - in drivers/ata/ahci.h, but I'm not sure how this should be
modified to disable command coalescing. I did set HOST_CAP_CCC to 0
but through some experiments that I conducted, I found that responses
were being batched.

I conducted an experiment wherein I issued requests of size 64KB from
my driver code. 64KB corresponds to 128 sectors (each sector = 512
bytes).

When I look at the "response timestamp differences", here is what I find:

Timestamp  | Timestamp |  Difference
   at             |     at          |  in microsecs
------------------------------------------------------------
Sector 255 - Sector 127 =  510
Sector 383 - Sector 255 =  3068
Sector 511 - Sector 383 =  22
Sector 639 - Sector 511 =  22
Sector 767 - Sector 639 =  12
Sector 895 - Sector 767 =  19
Sector 1023 - Sector 895 =  13
Sector 1151 - Sector 1023 =  402

As you can see, the _response timestamp_ differences seem to suggest
that the write completion interrupts are being batched into one and
then one single interrupt is being raised, which might explain the
really low numbers (tens of microseconds.)

Clearly, there is some interrupt batching involved here which I need
to disable so that an interrupt is raised for each and every write
request. Will disabling CCC do the trick, or is there some more
complexity involved?

And yes, I did disable the write cache and a few other caches as well
using the following commands:

hdparm -a0 -W0 /dev/sdd;
hdparm -m0 --yes-i-know-what-i-am-doing /dev/sdd;
hdparm -A0 /dev/sdd;

Here is another experiment that I tried.

Create a bio structure in my driver and call the __make_request()
function of the lower level driver. Only one 2560 bytes write request
is sent from my driver.

Once this write is serviced, an interrupt is generated which is
intercepted by do_IRQ(). Finally, the function blk_complete_request()
is called. Keep in mind that we are still in the top half of the
interrupt handler (i.e., interrupt context, not kernel context). Now,
we compose another struct bio in blk_complete_request() and call the
__make_request() function of the lower level driver. We record a
timestamp at this point (say T_0). When the request completion
callback is obtained, we record another timestamp (call it T_1). The
difference - T_1 - T_0 - is always above 1 millisec. This experiment
was repeated numerous times, and each time, the destination sector
affected this difference - T_1 - T_0. It was observed that if the
destination sectors are separated by approximately 350 sectors, the
time difference is about 1.2 millisec for requests of size 2560 bytes.

Every time, the next write request is sent only when the previous
request has been serviced. So, all these requests are chained and the
disk has to service only one request at a time.

My understanding is that since the destination sectors of consecutive
requests have been separated by a fairly large amount, by the time the
next request is issued, the requested sector would be almost below the
disk head and thus the write should happen immediately and T_1 - T_0
should be small (at least < 1 millisec).

The following lines of code were inserted to block/blk-softirq.c
starting at line number 112:

        do_gettimeofday(&tv);
        time_ms = (tv.tv_sec * 1000000) + (tv.tv_usec);
        if(req && req->rq_disk && req->rq_disk->disk_name)
        {
           if(!strncmp(req->rq_disk->disk_name, "sdd", 3))
           {
              if(count < 10) // The experiment involves a total of 10
requests - 1 sent from my driver, and the remaining 9 from here.
              {
                 if(req->bio && (req->bio->bi_rw == 1) &&
req->bio->bi_bdev && req->bio->bi_bdev->bd_disk &&
req->bio->bi_bdev->bd_disk->queue)
                 {
                    tracing_on();
                    trace_printk("Count = %d: Receive Timestamp for
sector #%llu = %lu microsecs; bi_size = %u\n", count,
req->bio->bi_sector, time_ms, req->bio->bi_size);

                    compose_bio_rw(&biop, req->bio->bi_bdev, NULL,
NULL, 2560, 1); // This function (defined below) populates a bio
structure

                    biop->bi_sector = req->bio->bi_sector + 350;
                    subq = req->bio->bi_bdev->bd_disk->queue;
                    if (subq && subq->make_request_fn) {
                       do_gettimeofday(&tv);
                       time_ms = (tv.tv_sec * 1000000) + (tv.tv_usec);
                       trace_printk("Send Timestamp for sector #%llu =
%lu microsecs\n", biop->bi_sector, time_ms);
                       count++;
                       subq->make_request_fn(subq, biop);
                    }
                 }
              }
              else
              {
                 count = 0;
                 tracing_off();
              }
           }
       }

static int compose_bio_rw(struct bio **biop, struct block_device *bdev,
                   bio_end_io_t * bi_end_io, void *bi_private, int bi_size,
                   int bi_vec_size)
   {
      struct page *bio_page;
      struct bio *bio;
      int order = 0, i = 0;

      order = 0;

      /* Grab a free page and free bio to hold the log record header */
      while (!(bio_page = alloc_pages(GFP_KERNEL, order))) {
         printk("allocate header_page fails in compose_bio\n");
         schedule();
      }

      while (!(bio = bio_alloc(GFP_ATOMIC, bi_vec_size
/*MAX_BIO_VEC_NUM */ ))) {
         printk("Allocate header_bio fails in compose_bio\n");
         schedule();
      };

      for (i = 0; i < bi_vec_size; i++) {
         bio->bi_io_vec[i].bv_page = &bio_page[i];
         bio->bi_io_vec[i].bv_offset = 0;
         bio->bi_io_vec[i].bv_len = 2560;

      }

      bio->bi_sector = -1;              /* we do not know the dest_LBA yet */
      bio->bi_bdev = bdev;           /* set header_bio with same value as bio */
      bio->bi_vcnt = bi_vec_size;
      bio->bi_idx = 0;
      bio->bi_rw = 1;
      bio->bi_size = bi_size;
      bio->bi_end_io = bi_end_io;
      bio->bi_private = bi_private;
      *biop = bio;
      return 0;
   }

The mass storage controller in the system is: Promise Technology, Inc.
PDC20268 (Ultra100 TX2) (rev 02), and the HDD being used is: WD Caviar
Black (Model number - WD1001FALS).

Thank you for reading this really really long mail and assisting me in
resolving this issue!

Regards,
Pallav
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html