Indexing (work in progress)

Rob Landley <rob@xxxxxxxxxxx> · Mon, 6 Aug 2007 19:54:57 -0400

The "work in progress" parts are hardest to write about, because I'm 
summarizing multiple conflicting points of view I haven't resolved yet, and 
easily get distracted trying to resolve them.  Here's a stab at the really 
big one:

-----------  Scope of the problem

Keeping up with incoming data (just glancing at and bookmarking it so I'm 
aware of it) is a more than full-time job.  At OLS Greg KH said nobody can 
read all of the linux-kernel mailing list anymore, and that's just one list.

But indexing involves going through years of accumulated existing data, 
reading it, and attempting to put it in a coherent order.  And there are many 
existing sources of Linux kernel documentation beyond linux-kernel.  Some of 
the biggest ones are:

- Documentation/ (in the kernel tarball).
- make htmldocs (in the kernel tarball).
- menuconfig help (in the kernel tarball).
- Commit descriptions in the source control system (git).
- Linux-kernel mailing list messages (such as the [0/xx] linux-patch series
  descriptions)
- Linux Weekly News kernel section articles (http://lwn.net/Kernel/Index/)
- Linux Journal articles
  (http://www.linuxjournal.com/xstatic/magazine/archives)
- The old kernel-traffic website (which I've sicced mark on resurrecting).
- The man-pages package (http://kernel.org/pub/linux/docs/manpages)
- Kerneltrap articles.
- Ottawa Linux Symposium papers.
- Selected Wikipedia articles.
- Various Kenelnewbies wiki pages.
- The Linux Documentation Project (ala http://tldp.org/LDP/lki/lki.pdf and
  http://tldp.org/LDP/tlk/tlk.html and
  http://tldp.org/LDP/lkmpg/2.6/html/index.html)
- Online books (Linux Device Drivers, Mel Gorman's MM book...)
- Videos from youtube, google, OLS, Linuxconf AU...
- Audio recordings of technical presentations.
- Developer blogs.  (See http://kernelplanet.org for a taste)
- Various standards bodies (t10, Open Group, ELF, lanana/fsg/lsb...)
- The gentoo and Linux From Scratch websites.  (And to a lesser extent Ubuntu
  and Fedora.)
- Random web stuff (discussions on other development lists, articles in other
  online magazines, HOWTOs written by individual developers, IBM
  developerworks, ars technica, universities, conferences...)

The hard part of documenting the kernel isn't accumulating piles of 
information, it's collating it together into one big index.  (Writing new 
documentation before indexing the available documentation has some obvious 
downsides: how do you know what _isn't_ documented until you know what is?)

Right now, the way to find information is to either Google for it (which 
assumes you know what you're looking for) or read the kernel source code.  
The purpose of an index is to list available topics, allow progressive 
discovery, and let people find things without having to know exactly what 
they're looking for first.  Proper categorization also allows people to 
ignore irrelevant information.

There are several existing indexing attempts, of varying scope.  Some 
individual data repositories (like lwn.net or Documentation/00-INDEX) index 
themselves, with mixed success.  Others try to index or reproduce external 
resources, but these efforts are often abandoned (ala 
http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html or 
http://www.nongnu.org/lkdp/ ) because it's a _big_ job.  Some people try to 
do this via wikis (http://kernelnewbies.org/Documents or 
http://tree.celinuxforum.org/CelfPubWiki) which has a set of downsides big 
enough to be its own topic.

----- Why not just turn Documentation/ in the kernel tarball into an index?

The assumption among kernel developers is that the Documentation directory in 
the kernel tarball is the master index of all kernel documentation.  This is 
wrong on several levels:

1) It's organized based on where passing strangers put things down last, which 
is correctible but symptomatic.  The output of many different people is not 
self-organizing; organizing it is an editorial job (done by a maintainer) and 
a lot of work.  As for being kept up to date, if being in the kernel tarball 
magically kept Documentation/ up to date, the Documentation/scsi directory 
wouldn't consistently refer to the 2.4 kernel series as current.

2) It seldom links outside itself.  It doesn't even comprehensively index even 
the two other major sources of documentation in the linux kernel tarball (the 
menuconfig and htmldocs information), and each of those only cross-reference 
Documentation/ occasionally.  The assumption is that all documentation worth 
reading is added to Documentation, which simply doesn't match reality.

3) Extensive mirroring of existing web content would bloat the kernel tarball 
tremendously (Google video, all the OLS PDFs, years of LWN articles sometimes 
with graphics), and in some cases is conceptually difficult: how do you add 
wikipedia to the kernel tarball?  In addition, the various licenses web 
content is distributed under often allow free redistribution, but are seldom 
compatible with GPLv2.  A web-based mirror can add several gigabytes of data 
without seriously inconveniencing anyone, and merely needs the ability to 
redistribute data without worrying specifically about GPLv2 compatability.

4) Documentation is primarily a repository of text files, while the purpose of 
an index is to _link_ to lots of information rather than to contain it.  
Linking to lots of information requires something like html.  The master 
index has to be html so it can link out, and Documentation isn't html.  
Converting the existing Documentation to html would be very intrusive, and 
would require changing the design assumption that it's self-contained to 
assuming that the majority of its content is out on the web.

----------- Plan of attack.

Probably the best format for an index is to have a hierarchical topic list, 
perhaps something like:

Building from source
  User interface
    Configuring
    building
      Building out of tree
    Installing
    running
    debugging
      QEMU
    cross compiling
      User Mode Linux
  Infrastructure
    kconfig
    kbuild
    build and link (tmppiggy)

Each topic link goes to a brief summary, and each summary contains (or is 
followed by) multiple links to existing sources for information on that 
topic.  The summary could be local, could be an authoritative external site 
for that information if such exists, or a small wrapper around an 
authoritative external site.  (If the hierarchy gets big and complicated 
enough, it can nest.  The top-level index can just have "building from 
source" which links to a separate page with the more detailed index, to keep 
the clutter down but still let people find stuff.  The "big monster index" 
should still be an option for people just browsing, though.)

Generating/maintaining such summaries is easy to delegate, once the overall 
structure is determined.

The easiest way of keeping the summaries and the topic index in sync is 
probably to generate one from the other, and the index is a structured subset 
of the information in the topic summary list.  Potential ways to do this 
include:

  - Javascript.
  - CSS
  - Use an fancy editor (local wiki engine?)
  - Generating the HTML from DocBook.
  - Preprocessing the summary HTML with a script (sed, python, etc) to
    generate the nested links.

My preferred approach is the last one, with a dash of css to make things look 
better.  If the desired output is html the simplest thing to do is have the 
source material be HTML as well, with the lightest additional markup 
available.  I'd prefer to avoid placing additional demands on the browser, 
and the kernel.org web server prefers to serve static pages for both load and 
security reasons.  Plus lots of people understand basic HTML; I want to 
minimize demands on contributors.

I'm starting with one big source html, which I expect I'll break up when it 
gets big enough to warrant it.

The first step in populating the index is to come up with a skeleton list of 
topics, by looking at various kernel documentation resources with existing 
indexes (The "Linux Kernel Internals" document on tldp, Linux Device Drivers 
third edition, etc), and botching up something temporary.  I've sort of done 
this, although it needs more work and needs a format conversion.

The second step in populating the index is to go through some existing 
resources that change fairly slowly and predictably (such as the OLS papers, 
the lwn.net kernel articles, and a release version of the Documentation 
directory), and link each relevant entry into the index.  This involves 
reading through all this stuff, which I've been doing.

Note that a given resource can be linked to from multiple places.  For 
example, the "make htmldocs" file I posted to the list yesterday A) needs to 
be brought to the attention of people building their own linux-from-scratch 
style systems (the "what's in a linux root filesystem" section of the index 
I'm building), B) should be linked from a "building the kernel" page (what 
kinds of output can the build produce), C) should be linked from a "make 
help" page with hotlinks to the individual commands (like htmldocs, 
allnoconfig, menuconfig) that have more extensive documentation, etc.  The 
point isn't to have a "one true location" for a piece of information, but to 
link it so people browsing a relevant topic can find it.

Once each "slow" type of documentation has been fully integrated into the 
index, maintenance of that source is a question of recording when the index 
was last brought up to date and periodically going from that date to the 
current date.  For resources updated chronologically (lwn.net, OLS papers) 
this is fairly easy.  Some others have a source control log to go through.  
Others I can mirror the version of the resource we have indexed and 
run "diff" when it's time to update to a new version.

As for the "fast" types (like linux-kernel or the developer blogs, which can 
generate documentation faster than it's possible to read it)...  There's a 
reason I refer to this job as "drinking from the firehose".  Some kind of 
filtering and summary step will be involved between the source of data an the 
index.  In the case of linux-kernel, there's a reason I hired a research 
assistant to bring kernel-traffic.org up to date.  That is an extremely 
valuable resource, and no the lwn.net kernel page (valuable as it is in its 
own way) is not a substitute for this.

As for the developer blogs, maybe the developers themselves can send me 
patches someday.  The mercurial tree was public long before it had anything 
useful in it. :)

There's a lot more on this topic, but that's probably enough for one email.  I 
should do a follow-up email about what can and can't be delegated, and how.

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html