Criteria, perspectives and limitations of Open Source Systems
Open Applications – A Model for Technical Documentation?
by Ferdinand Soethe
Talking with Technical Editors on the topic of Single Source Publishing (SSP), one sometimes gets the impression of talking about Utopia. (Nearly) everyone knows it, nearly everyone could well use it in their work, but hardly anyone actually uses it.
The necessity to prepare contents for different target media can neither be really met with classic text processing or DTP systems nor with Web design tools. This was why the XML language family born in the 90’s found huge acceptance in the area of Technical Documentation. And later, as large software companies like Lotus, Sun and IBM published efficient programs for reading and transforming XML as an Open Source Software, a broad base for Single Source Publishing systems without manufacturer involvement was formed in a short time. Java also plays an important role in this development; in any case, it enabled programming of applications with comparatively less effort, which ran perfectly well on all commonly used Operating Systems.
In the meantime, one has the choice between a variety of systems under Open Source License and proprietary solutions. Here the comparison gets difficult, especially since Single Source Publishing (SSP) solutions can be achieved in very different ways.
Media-independent formatting Functional illustration, i.e. the illustration of text portions with their meaning forms the core of the SSP concept. Instead of formatting the display in a certain medium such as setting a quote in Italics, for example
He spoke of very personal experiences …
The author simply inserts the intent of a text portion in the text, as follows
He spoke of <quote>very personal experiences</unquote> …
and leaves it to later processing to establish how this should be changed to a media-appropriate format. Thus the quote can be set to Italics for the print version, or could be read out by another voice in an audio publication.
What sounds simple in theory is, alas, difficult to uphold in practice. There is far too strong a media approach in most minds yet. This is why SSP systems should create lucid and meaningful illustrations to start with.
HMTL-based illustration languages are ill suited, because they have suitable elements for only a fraction of the required functional illustrations. If one expands these using class attributes,
<p class=″Instruction Step″>Click on Edit-Copy</p>
it will be hard to restrict wild shrubs behind the fence. The same, incidentally, is true of other popular Wikis.
XML, though flexible and universally used as a document format, is not sufficient alone because it cannot guarantee meaningful illustrations owing to its flexibility. Hence, even XHTML does not resolve problems with HTML. And a Microsoft Word File used through XML can by no means be used as a SSP format.
A good SSP system hence relies on tried and tested illustration languages such as Docbook [1] or DITA [2] and ideally supports different languages. This is where XML-based systems have a cutting edge, because the entire technology is based on universality and interchangeability.
Even so, theory and practice are far apart. Thus the application Apache Forrest [3] has all the pre-requisites for a good SSP.
However, the Forrest sample application almost always uses an XML format similar to HTML named „document-v20“ and has to first be configured using the right Plug-ins for genuine SSP. Hence the system potential can only be recognised at a second glance.
Slow on the uptake and too fast on judgement. In the meantime, Forrest HTML dialects or XHTML dialects serve only as internal mediator formats with the help of which additional source or publishing formats can be supported more easily.
This Help is definitely valuable. The deciding factor is, how well the information of the source formats is retained with this double jump. For good results, however, the complete information of the source document must also be available in the publication version, a requirement that most systems have thus far been unable to meet.
Grammar –enabled Feature Even the best of illustration languages is not of much help if its rules are not followed. Hence the grammar-enabled feature, i.e. the inclusion of features such as DTD, Schema [4] or Relax NG [5] is important for every part of the system.
Systems that merely validate the documents have many practical advantages: If the editor already knows and understands the illustration language, it can offer the author a menu with different illustration options for a given text portion. Some advanced editors, for example oXygen [6], even explain the meaning and the implementation purpose of all possible elements during selection.
The XML world, too, has some less than desirable aspects in this regard. The well-known DTD, for instance, was followed by the more powerful and complex Schema. Of late, the more user-friendly Relax NG is fast gaining popularity as an enhanced grammatical language. This is a development only few systems have fully internalised.
Visually supported text processing For a long time, working with XML editors was also a questionable pleasure. While the text entry was child’s play thanks to the grammar and translation help, the correction became a rigorous concentration exercise due to the scattered XML functions. How comfortable, then, are options like OpenOffice [7] or Microsoft Word, where one heading is in Upper Case and Bold font and a Drop-Down-list can be identified at one glance by its punctuated paragraphs?
Good XML editors like XML Spy [8], oXygen or XMLmind [9] now offer similar solutions: In addition to the XML source text there is an optically pleasing visual mode for display and editing available. By adapting the format templates (usually CSS Stylesheets [10]), the author can decide how he wants text elements displayed („what you see is what you need“).
This can be identical in the case of a point-wise list in a later print edition, but there are often deviations here. Partially because you can recognise different text illustrations on the screen fastest through different colours and prefixes, or because the given font is difficult to read on the screen for printing.
All said and done, there are major differences between the systems. It is, therefore, worthwhile to test editors once while working on a long text.
From the SSP system viewpoint: Ideally, the editor should be
optionally selectable,
easy to integrate into the system
and flexibly configurable in its display.
This is a limitation for many server-based CMS systems in the Open Source Area: HTML can do visual editing, but support for XML dialects and grammar-enabled features are a needle in the haystack.
The editor Apache Lenya [11] has an advantage compared with the rest in the filed and supports any number of XML dialects online, as long as there is a Relax NG-Grammar feature present. The widely used Bitflux Editor [12] by far exceeds the options offered by CSS formatting, because it does visual editing using actual XSL transformations.
Publication Support For merely publishing small documents in HTML and PDF, you do not need a SSP system. Most XML editors today support transformation scenarios and many even generate Stylesheets for important transformations simultaneously.
However, this is oftentimes not enough: What is supposed to appear on separate pages in the Web will be compiled together in title pages, page numbers and Table of Contents for a book. Web sites must also fit into a prescribed Website format and therefore cannot do without navigation elements and a suitable border layout.
SSP systems automate the entire publication process and integrate the transformed documents in the publication context. Forrest provides a good example for this: Every document to be translated is integrated into the Website and provided header and footer sections as well as navigation elements. The PDF documents, on the other hand, are given a Table of Contents and project-specific footers. By changing the system configuration, several documents can be transformed into a common PDF document.
Daisy [13] and Lenya are even more flexible: Publications can be compiled and published from absolutely any and every document. Even the DITA Toolkit [14] re-combines once reserved contents with MAP files in all types of publications.
Content Management Functions Content Management, i.e. the systematic management of contents, their relationships and change history form an important aspect of every SSP system. However, these functions need not always be integrated into the system. Apache Forrest or the DITA Toolkit show how even complex documentation projects can be managed solely on the basis of the file system.
The DITA Toolkit is consistent in this regard and places all relevant data, including relations between documents, in XML files using a standardised format. This is a simple and sure-fire way that makes installation and administration easy and makes the system more transparent to lay persons.
Hybrid systems like Daisy take a middle path: They store the data in a relational database, but keep only clearly structured XML documents in it.
Other systems and Wikis that also depend on Apache Cocoon [15], for example DokuWiki [16], seem, at first sight, to work in a database-supported manner, but are actually saving the contents in separate files on the file system of the server. The user has to thus struggle with the inner organisation of this system, though the files can be exported in original format from the system at any time.
However: Tasks such as data saving, data export or versioning become considerably more complicated the moment application servers and databases come into play. Not to mention the difficulties of working with others in a work group via Internet or of using the system online with a Laptop. Thus, file system-based solutions are worth considering for small work groups or working in the WAN.
A more user-friendly option is the use of database technology for storing secondary data such as the Headlines directory or Link lists. In this way, there can be references to incorrect cross-links during the editing of a page. Systems like Daisy make the insertion of cross-links look like child’s play and are far ahead in terms of comfort features than the DITA Toolkit or Forrest.
Modularity of Content Closely linked to the internal set-up of a data storage system is the content modularity, especially with respect to the level at which (simple) content modules can be stored and re-used.
Daisy has the most sophisticated solution for this, wherein any content can be re-used. The DITA Toolkit also has functions for this. However, some research is usually needed to find out if and how a desired function can be used within a system.
Versioning Versioning is indispensable for working on on-going projects, especially because „hand-made“ solutions for work groups are hardly feasible any longer. Therefore, a SSP system should essentially be versionable even if it does not come with its own solution. For this purpose, file-based systems have an application sub-version [17], an excellent version control system from the area of Software development. For a starter, it can be installed purely locally and without much difficulty, but if required, it enables very smooth and secure group work via Internet, especially with members who are not constantly online. In addition, thanks to its widespread usage, it is a good deal more reliable than some self-created versioning system and far easier than some CVS.
Flexibility in working modes Integrated systems like Daisy offer the option of editing documents online only. The document is called or queried by the server, edited in the browser and then sent back to the server. The result is immediately visible on the monitor. This option is simply wonderful if you have constant access to the server, but not very practical if you have to edit a document while on a train.
The DITA toolkit goes the opposite way. All documents can be edited by any editor as files, wherein the transformation occurs by means of a processing run (batch processing). Depending on the size of the publication, it can take a while before you can see results. This is a tedious procedure for small corrections.
Forrest, on the other hand, stands out with its more elegant middle path: In its dynamic mode you can transform a changed document in a matter of seconds and display it in the context. Moreover, with the static mode of a batch processing the entire publication can be generated afresh at any time.
Publication methods The release of publication throws up some very different questions. While almost all systems can create a PDF document, systems often have specific server requirements for online publication in the Web.
Many Wikis expect a server having PHP and mySQL support. Cocoon-based systems require expensive application servers such as Apache Tomcat for Hosting.
Forrest, Lenya, Daisy and the DITA Toolkit can also export a publication as a static HTML page, which can then be published on any Web server at a low cost but with high performance. One can also publish on CD without extra effort or cost.
Publishing and workflows Where Workflow support is necessary as in the proofreading of documents for publishing, Forrest or the DITA Toolkit have very limited use. Lenya and Daisy do this maintenance with refined functionalities.
Strategic aspects The introduction of a new SSP system always comes at a high price and many investments for system configuration, training and induction of the associates or employees. Therefore, the future utility of these systems is an important deciding criterion.
Avoid being fixed to one Operating System. Even if your world is now full of Windows computers, you have every good reason to operate on a platform-independent system. This was a lesson learnt by innumerable users of special software. Their applications did not work properly on the new Windows Vista.
By going in for Open Source Software, you are also free of any manufacturer commitments. However, it is more important to consistently and continuously process contents and metadata in manufacturer-independent XML formats like DocBook or DITA.
XML island solutions such as Microsoft‘s „Office Open XML“, though usable only with Microsoft Word, are no alternative. Here, openness plays a subordinate role, if it is simply too expensive to develop a compatible Software. Some tried and tested and cheaply available Software conforming to public standards can always be found for you to continue using your documents.
Open Source special features In some points, Open Source Software has some underlying differences to proprietary Software, or even to the Community. The community of developers and users has a place of high importance in Open Source Projects. As a user of Open Source Software, you can influence the development of your software or even further it by receiving orders from other developments.
Anyone who sees Open Source Projects as nothing but solutions supplying cheap Software and who keeps all enhancements for himself, will not only get little support from the community, but is also losing the chance of receiving valuable support from an experienced community of developers.
English is the standard in almost all communities. Therefore, a pre-requisite in user support as well as discussions for the development of a project is minimum knowledge of English language and, of course, the willingness to use it. Anyone shying away from this would do well to seek support from German-speaking members of a community.
Also, it is no taboo to mix Open Source Systems with commercial systems. For there is not always a useful Open Source Solution for a given problem. Consequently, Forrest is real fun when the user uses a (commercial) visual XML editor for text processing.
Note The perfect SSP system, which covers all conceivable application cases, simply does not exist, not even in the Open Source Area. Nevertheless, open systems do come with attractive conditions offering enhanceable and watertight solutions for the future, which are suited for many cases of Technical Documentation.
Ferdinand Soethe, a Software Architect, advises people on selecting, adapting and implementing complex Software systems. Since 1996, the use of SSP systems for Technical Documentation has been a significant part of his work. The author is a member of the Project Management Committee of Apache Forrest and has introduced the same at numerous lectures in the country and abroad.