Thinking out loud about the master index.

Rob Landley <rob@xxxxxxxxxxx> · Thu, 6 Sep 2007 14:39:12 -0500

So, the Documentation master index.  I'm trying to get it online this week.  
I've got a ton of data to put into it, in the form of notes.txt files, but 
the problem all along has been what should the source format be?

I'm only interested in generating an HTML index (not PDF).  It's a website, 
indexing both local and web resources.  I'm also very interested in external 
contributions from people who know the code but not necessarily any specific 
documentation system.  In 2007 I can probably assume most vaguely technical 
people knows enough HTML to put up their own home page, even if it's 
just "the paragraph tag", so if people "view source" on the html they see on 
the website, and send me patches to that, I should be able to cope.  This 
argues for a source format as close to the generated HTML as possible.

What pure HTML doesn't give me is progressive disclosure.  I want a topic 
index, a one line per topic summary outline hotlinking down into the text.  I 
don't want to maintain it separately or it'll get out of sync.  For long-term 
maintainability, I want to generate it from the HTML.  How do I do that?

The existing tools to do this are basically "docbook", but in that case the 
source format is a bit farther from the generated HTML than I'd like.  
Docbook has its own set of tags, and the number of people who know those tags 
are an order of magnitude fewer than those who understand simple HTML.  Last 
I checked none of the word processors out there produce docbook either; 
wysiwyg editing of docbook turns out to be a hard problem since it's so 
rigidly non-presentation.  (There have been attempts, of course: 
http://wiki.docbook.org/topic/DocBookAuthoringTools .)  The tech writing 
community hasn't particularly embraced docbook that I've seen; it's a tool 
programmers use, not a tool tech writers use.  This doesn't completely rule 
it out, especially since the kernel already uses it in a few places (mostly 
as an intermediate format generated by running perl regexes against C source 
code), but it doesn't seem to me like a _good_ solution if producing PDF 
output of this data isn't useful.

A friend of mine sent me some javascript that generates an index from a page 
by parsing nested <h1><h2><h3> tags.  This solution doesn't scale.  For one 
thing, it requires all the data to be indexed be in a single HTML file, which 
could get unwieldy fast.  Also, below <h3> these tags display smaller than 
normal text, so it needs a custom stylesheet to not look horrible.  Debugging 
javascript in IE/Firefox/Safari/Konqueror doesn't exactly fill me with joy, 
either.

Another friend proposed a solution done entirely as a stylesheet.  While XSLT 
is turing complete (if only by accident), the number of people on the planet 
who really understand it are way below the number who understand docbook.  I 
so do not want to go there.  Canned stylesheet for display, ok.  For 
navigation, less so.  And this also has the problem it only works if the 
entire index is a single page, and I expect this index to get really big and 
naturally break down into sections.

Another suggestion is a wiki.  That's yet more non-html markup: if somebody 
sends me an existing HTML page, getting it into a wiki can be a bit of a 
pain.  Plus, I've got a distributed source control system already.  And I 
really want to generate static pages because running arbitrary code on 
kernel.org is not something that's acceptable from a security perspective.  
(They explicitly disallow php, for example.)

My best guess after thinking about this a longish time is to add whatever 
turns out to be the minimal extra markup in the HTML needed to hold index 
information (possibly "span" tags), and running a python preprocessor to 
generate the index.  This has the advantage that no code has to run on 
kernel.org to generate a page, so there are no security implications.  Also I 
can emit "span" tags into the resulting html and they should be ignored if 
the current stylesheet doesn't give them behavior.  (And adds the possibility 
of somebody who does do stylesheets using these tags to make things prettier, 
without said stylesheet actually being required to access the data.)

Something I do want is the ability to reorganize sections via cut and paste, 
without extensively redoing the tags inside the sections.  That would rule 
out "span id=1.3.7.2" style sections; any numbering would have to be 
generated by the preprocessor.  What I think it needs is named sections 
(<span id="walrus">) and matching end tags (</span id="walrus"> perhaps?).

Then there's the question of how much the preprocessor should insert.  (When 
it sees a start-of-span tag, does it do a title for the section, or should 
that already be there in the html?  Probably the latter, but section numbers 
would have to be inserted...)

I'm working on it, but meanwhile: anybody else have an opinion?

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html