[wikka-community] Announcing the release of WikkaWiki 1.3.6

Fri Dec 26 13:07:31 UTC 2014

>>
>>    So let's start with a wish list of sorts.  We can do this here, or I can create an issue on the github tracker.  
>>   Maybe we can hash out the ugly details here and then move it to the tracker once we have some     solid ideas to work with.

Awesome - yes a few forum discussions then if you're still interested (and anyone else) we should move off the list onto tracker or smaller CC's

>> I believe the easiest way to approach this would be to simply use the existing markup formatter as the basis to generate whatever output we want.
>> I'm not too sure there's a need to create a new markup parsing engine to accomplish this.  Maybe that's what your volunteers got stuck on?

That's my first thought too ... but neither I nor the people who tried understood the codebase or language well so we went at it the wrong way.
A) My first few attempts were HTML scraping - not a great way to go, but at least HTML can be converted to well-formed XML and from there
  XML tools have a *chance*
B) The most success was the friend who wrote the python app  (I don't know python much either),
  I did a DB dump of all the raw pages from MySQL and gave him a zip of a directory tree with a file for each DB  record as text,
  and he tried to reverse engineer the markup engine, did a pretty good job but never finished - for good reasons - 
  There isn't quite enough semantic information to go straight from Wiki to a domain specific format ... so the goal was not right. (see below)

I agree ideally this should be done *in* the markup engine where the parsing already has all the context that's possible, 
and doesn't need to be re-invented or kept in sync with a second implementation.   
The 'last 10%' of his problems (of course resulting in the 99% of the work) 
ended up being very tiny discrepancies between his parser and the wiki parser that would mess up entire pages ... 

---
>> So, got some ideas for me to work with?  Examples perhaps?  It's been years since I've looked at DocBook, so I imagine I'll have to ramp up on that.
---

I have lots of ideas and examples, probably should start that with a smaller group.
The project I'm working on is : http://www.xmlsh.org/  - a 10 year ongoing open source project.

But the concept can (should) be much more general.  In my case I have a combination general documentation (how-to, installation etc),
as well as a lot of 'command description' pages.  They are quite different things, have different formatting/presentation/output needs,
and ultimately I would like to move some of it into the codebase itself (comments, Java doc, something ...) and auto-generate both the wiki contents ,
in-program help, offline HTML, PDF etc.  -- but let's put that aside for a bit, it mainly helps to put the simpler goals into perspective.

What I found though is that even if the markup parsing was 100% it wasn't quite structured enough for my needs,
That's OK ... there is simply not enough or not the same kind of metadata available in the markup for everything (intentionally).
But, I learned from this it was a mistake to try to go straight to DocBook ...
So the *goal* of magically inferring markup that didn't exist in the first place is unrealistic.

DocBook is a very good nearly-domain-free schema for documents but not quite general enough  ... its got built into it things like Chapters,
Sections, Figures, Footnotes, Command Descriptions  etc ... you don't have to use all those but if you don't you end up 
with something not much better than a bunch of  header, bold, and paragraph tags ... not much of an improvement 
from HTML scraping. 

So my lesson from this is a better first start would be to model the existing wiki tag structure in its own XML schema with as high fidelity a possible.
That is least invasive to the existing codebase, and would produce the highest quality output from the existing markup (i.e. no loss of information and no 
adding of 'inferred' structure that is specific to a particular persons page or site ) .

For example: 

In the Wiki you might have
   ===Notes===
or 
 =====Command base64=====

These are presentation semantics ... H1,H2 ... etc.  some structural semantics but not much.   I.e. from this its difficult to 
infer that  " =====Command base64====="  is really a 'Level 2 section in the Commands Chapter' but "===Notes===" is only
a hint to add a bold title ...  
In actuality they may be different things at different times depending on what your producing (a quick 2 line help text might 
want to index into 'base64' knowing its the command and then find a specially named subsection  and only print out the first paragraph).
One could conceive of a configurable mapping that says when to convert headings to chapters and when to convert them to sections etc ..
But that gets complex fast - and the best tools for that are not in the wiki parser itself.

So instead, I suggest a more simple and feasible goal - and a more useful one  - is a direct mapping as possible from the markup to a single well-defined schema.  
Both ways ideally. (export AND import).

Since XML tools are the most capable in this area I suggest an XML format - but that doesn't imply at all that it can't be other formats (like JSON, YAML etc)
or "IS XML Specific" (the "X Word" has been getting some flack lately ,... :)  
Just that the XML toolsets are still the most powerful and expressive and generally available of any markup ecosystem for handling document representations,
transformations, conversion etc.
if you get it into a decent XML format you can convert it to anything  and visa versa ... there should be no problem with writing converters to/from other formats.  
(HTML,PDF,JSON,YAML,text,CSV, ... )  It makes a good intermediate format that is most versatile.

The power of Wiki is also its weakness ... Its easy to write but quite hard to reliably parse, extract, import and convert, especially if its embedded into a specialized system.
( let alone the issue of every wiki having its own flavor of syntax).

So to summarize,  I think a very powerful and probably easy first step is to define an XML format ('schema' of a sort) that maps to the elements of the Wiki markup format.
Then implement (probably as you suggest) a rendering variant in the current markup parser that can produce the XML as some kind of export.
+10 points if it can be scripted somehow so you don't have to manually go click on every page to get it all ...

That itself would be a great deliverable.  I would be happy to help on the XML side of things.

A next 'obvious' step is the reverse ... importing from the same format.   

Those 2 pieces dramatically remove the barrier to interoperability and open the Wiki to a whole new world of possibilities.

>From there, one might for example decide on 'conventions' that are not hard coded into the parser for distinguishing domain specific markup,
like 'chapter' or 'command description' or 'footnote' ... 
Generic conversions could be written to take  the XML format and produce generic HTML or PDF or JSON or whatever, 
and leave room for custom conversions for individual needs (like say DocBook or MS Windows Help format ...)

---- HOLES ---
Inline raw HTML is simply going to be annoying.  The best I can suggest is to pass it through untouched as well formed xhtml.
It would be a mistake to try to convert it in the markup parser into something else.
OTOH, its also a great way to *pass through* data (both ways) through the system untouched.  This can be very useful for not only
presenting the HTML in a browser, but also adding metadata that gets a free ride through the parser tagging along as you will, with the associated content.
That back door may well end up being the key to many issues.  

----------------------------------------
David A. Lee
dlee at calldei.com
http://www.xmlsh.org