Project Gutenberg Projects
Convert a treasure trove of Project Gutenberg novels to Markdown in preparation for thematic typesetting.
I hope to give readers a thorough understanding of where presentation logic can creep into content and the ways that it can be avoided. If you’d rather dig into delicious programming bits, see the XHTML to Markdown section.
Project Gutenberg is so notoriously inconsistent that it could use a trigger warning for anyone who has an obsessive-compulsive personality disorder (ahem). This inconsistency means that automatic typesetting—reaching professional level—would be extraordinarily tedious, if not altogether impossible. That laborious overhead prevents showcasing some popular fiction in a variety of beautiful forms with great ease.
Fortunately, others have engaged in projects to ease typesetting prose from Project Gutenberg. Notably, these include HTML Writers Guild and Standard Ebooks. Both of these come with their own set of technical challenges, so let’s explore the some of the technical quagmires that caused clear losses in the battle of content versus presentation.
A number of attempts were made to normalize—or suggest formats for normalizing—Project Gutenberg over the years, including:
- HTML Writers Guild
- Standard Ebooks
- Project Gutenberg 2 (defunct)
- TEI: Text Encoding Initiative
- Brandon D. Perkins’ Thesis
- Cynthia L. Blue’s Thesis
Of all these projects, the most amenable to automatic typesetting are those produced by Standard Ebooks and HTML Writers Guild. The benefit of using HTML Writers Guild is their semantic markup and simple document type definition (DTD) file. Standard Ebooks, as the name suggests, are brilliantly standardized and have an excellent Manual of Style that describes what to expect from the XHTML.
At time of writing, the official XML documents listed on the Guild’s website are mostly unavailable. Thankfully, the Internet Archive, a random GitHub page, and a GitLab page contain copies of their work.
HTML Writers Guild
As great as they are, the HTML Writers Guide’s XML documents couple words to style in a few ways. This section describes the problems and proposes how to solve them. The issues are presented as shown in the original marked-up file, with only the document’s XML structure indented edited for readability.
By Lines and Subtitles
The “by line” is what I call the line that introduces the author’s name. Here we can spot a few problems:
<frontmatter> <titlepage> <title>OLIVER TWIST</title> <para> OR</para> <subtitle> THE PARISH BOY'S PROGRESS</subtitle> <para>BY </para> <author>CHARLES DICKENS</author> </titlepage> </frontmatter>
First, all element texts are capitalized. Although converting to Title Case is straightforward, formal names have edge conditions—such as possessives, van Dykes, or McLeods—that take some care. It is easier for a computer to convert mixed case to uppercase than the other way around. Arguably, how the case is presented is, well, presentation logic.
Second, there’s a linear organization for the title page, as signified by the
BY paragraphs that smack of presentation presumptions. Admittedly, the
para elements can be ignored, but it leaves an unsettling feeling because we don’t know if an automated process is going to miss a key phrase inserted in a
para that was meant to be inserted above or below a semantic element. Another way to mark up the document would be to prevent
para elements within the
titlepage element. Using an attribute—like
alt for “alternative title”—to suggest the meaning of the subtitle then allows the presentation layer to insert the
OR if desired:
<title>OLIVER TWIST</title> <subtitle alt="true">THE PARISH BOY'S PROGRESS</subtitle> <author>CHARLES DICKENS</author>
BY by itself is redundant (we all know that authors write books), repetitious (yes, all books), presentation, and not metadata. Whether the author’s name is introduced with “by” is logic that belongs outside the book’s content.
The XML fragment with the problems resolved resembles:
<frontmatter> <titlepage> <title>Oliver Twist</title> <subtitle alt="true">The Parish Boy's Progress</subtitle> <author>Charles Dickens</author> </titlepage> </frontmatter>
In a printed book or preformatted eBook, page numbers are incredibly useful. Within a plain text file, however, page numbers interfere with automatic typesetting because factors that affect the page count—page dimensions, font sizes, chapter sink, and margins—are not yet realized.
Some of the texts embed page numbers within the text, such as the following snippet from Moby Dick:
reveries. Some leaning against the spiles; some seated upon the pier-heads; some looking over the bulwarks glasses! .. <p 2 > of ships from China; some high aloft in the rigging, as if striving to get a still better seaward peep. But these are all landsmen; of week days pent up
Now we have three problems, grammar notwithstanding. First, there’s no way to tell whether the text after the page number should join with the text before the page number, such as at a paragraph boundary. Second, while computers are exceptional at counting, humans will make many data entry errors, such as double-counted pages:
.. <p 104 > Mr. Flask --good-bye, and good luck to ye all --and this day three years I'll <!-- skipped for brevity --> heavy-hearted cheers, and blindly plunged like fate into the lone Atlantic. .. <p 104 >
Third, the page numbers themselves were formatted inconsistently, which would have to be taken into account when writing a regular expression to eliminate the numbers and join the ~566 paragraphs together:
.. <p 109n. > See subsequent chapters for something more on this head. .. <p 110n. > See subsequent chapters for something more on this head. .. <p 110 >
Breathe. Remember to breathe. Moby Dick cannot be typeset automatically without extensive edits by a human, such as those made to produce the Standard Ebooks version.
Table of Contents
A nice feature included in the XML versions is that the tables of contents have been normalized in
toc elements like the following:
<toc> <title>CONTENTS</title> <subtitle>Book the First--Recalled to Life</subtitle> <item>Chapter I The Period</item>
You know where this is going, though: all of the
toc elements contain presentation logic that is also duplicated within the text. The following markup within the text body sheds light on the
<bookbody> <part> <titlepage> <title>Book the First--Recalled to Life</title> </titlepage>
For our purposes, we’ll ignore the
toc element because the typesetting engine will recreate it automatically from the chapter headings.
Chapters in these files resemble the following:
<chapheader> <chapnum>I</chapnum> <title>The Period</title> </chapheader>
Computers really do excel at counting. Whether to use Roman, Arabic, or Egyptian numerals is a design decision. We can safely ignore the
The DTD could be changed to suggest a numeral style that captures how the original publication was printed, which would cleanly separate concerns in a machine-readable fashion:
<chapheader numeral="roman"> <title>The Period</title> </chapheader>
In Huckleberry Finn, Robinson Crusoe, The Red Badge of Courage, and other texts, sometimes the first word of a chapter was entered in uppercase. Sometimes words within paragraphs were added in uppercase for emphasis, like “through” in Tom Sawyer or A Tale of Two Cities.
In the latter case, modern typesetting would prefer to use italics or bold to make the words stand out. In the former, it is the job of the presentation layer, be it cascading-style sheets, ConTeXt setups, LaTeX packages, or custom SILE extensions.
To clarify with an example from Tom Sawyer:
<para> SATURDAY morning was come, and all
<para> Saturday morning was come, and all
When writing in plain text, applying word wrap, line breaks, and paragraph breaks consistently can be difficult for the uninitiated. For the most part, the HTML Writers Guild made wonderfully consistent and machine-parsable paragraphs. Inside The Insidious Dr. Fu Manchu, are lines that cannot be parsed into paragraphs:
<para> "That will do," said Smith, and I thought I detected a note of triumph in his voice. "But stay! Take us through to the back of the house." </para>
Fortunately, it looks like the entire file has been double-spaced consistently, so it would be simple enough to fix with the following regular expression applied using vim:
This forces all lines in a paragraph to immediately following each other without an intermediary blank line:
<para> "That will do," said Smith, and I thought I detected a note of triumph in his voice. "But stay! Take us through to the back of the house." </para>
Moby Dick suffers from this affliction inconsistently. Moreover, multiple paragraphs are embedded within a single
para element, in violation of the one paragraph per
para element rule. Once again, the Standard Ebooks version provides a cleaner semantic markup:
<p>“So it is, so it is; if we get it.”</p> <p>“I was speaking of the oil in the hold, sir.”</p>
Using entities (such as paired left- and right-double quotes) allows complex nested quotes to be typeset unambiguously:
<para> “Violet said, ‘Rose yelled, “I'm cybed!” in elation,’” said Redd. </para>
This would produce:
“Violet said, ‘Rose yelled, “I’m cybed!” in elation,’” said Redd.
Most of the texts embed curly quotes directly into the text.
We could get a jump on burgeoning commodity text-to-speech (TTS) software by marking document speech as follows:
<para> <q s="Redd">Voilet said, <q s="Violet">Rose yelled, <q s="Rose">I'm cybed!</q> in elation,</q></q> said Redd. </para>
speaker, to reduce repetitive strain injuries. This eliminates ambiguity, eliminates obscure entities, is machine-readable, and enables TTS engines to change voices appropriately. Notice that because the quotes are nested, whether the TTS switches voices within nesting can be decided when exporting to audio.
It’s also extensible, meaning that expressiveness can be added if desired:
<q s="Rose" e="joy">I'm cybed!</q>
Quite often, especially in poetry, people use spacing to signify markup. This happens even when semantic markup exists to separate poetic forms from the prose. For example, Call of the Wild uses:
<poem> <line> "Old longings nomadic leap,</line> <line> Chafing at custom's chain;</line> <line> Again from its brumal sleep</line> <line> Wakens the ferine strain."</line> </poem>
Prester John obfuscates the markup to make the poem easier to read (or perhaps edit):
<poem><verse><line> 'Diving as if condemned to lave</line><line> Some demon's subterranean cave,</line><line> Who, prisoned by enchanter's spell,</line><line> Shakes the dark rock with groan and yell.' </line></verse></poem>
Adventures of Robin Hood provides a clever twist where the indentation is given outside of the lines to indent—transformation engines will ignore the whitespace by default:
<song> <verse> <line>"_In peascod time, when hound to horn</line> <line>Gives ear till buck be killed,</line> <line>And little lads with pipes of corn</line> <line>Sit keeping beasts afield_--"</line> </verse> </song>
Treasure Island attempts to mark lines of a poem’s verse with
indent6, but these classes are extraneous and repetitious:
<poem> <verse> <line class="indent3"> If sailor tales to sailor tunes,</line> <line class="indent6"> Storm and adventure, heat and cold,</line>
The whitespace can be removed, but for the cases where recreating the poem’s form would be laborious to codify, a special syntax is needed:
<poem type="ekphrastic"> </poem>
TEI’s poetry markup is comprehensive and a good source of ideas, but too verbose for simple poems found in fiction novels.
Last on the list are section breaks. In professionally typeset novels, ornate illustrations can sometimes replace manuscript asterisks (
* * *). In Call of the Wild, we find:
<para> * * * </para>
Preferably, a semantic section break would be useful, such as:
Or even one of these lesser-known, archaic, perilous, substandard tags:
<br class="section" /> <hr />
GITenberg has the same goal as Standard Ebooks. The main difference is that GITenberg aims to use AsciiDoc. While this is a step forward, there appears to be little attempt at giving a deeper semantic meaning to the prose. Other issues:
- Text includes page numbers.
- Class names bereft of meaning.
- Inconsistent use of header tags (especially
- Tables used for formatting.
- Mixes content with presentation.
- No distinction between salutations and valedictions.
- Posted letters are not easily distinguished from prose.
- Poems or verses are not marked as such.
- Possible use of
brto replicate pagination.
While GITenberg is a marked improvement over Project Gutenberg, it would be rather arduous to typeset its novels automatically.
Standard Ebooks are superior to the HTML Writers Guild books in many ways. Additionally, they appear to be kept up-to-date with a growing library of classics. With respect to automatic typsetting, here are some issues (none of which are insurmountable):
- Separate files. The chapters and metadata are in separate XML files, which must be recombined. This can be accomplished either using their epub tools or XSLT.
- Unicode characters. Unicode characters, such as hair spaces, embed presentation within the prose that must be removed because typesetting engines often have their own rules for typesetting hair spaces, em dashes, ellipsis, and similar.
- Blockquotes. Not all XHTML
blockquoteelements are classified, which makes detecting extended quotations a chore, and therefore difficult to prefix with
Given the extraordinary consistency and detailed attention to modern typography, typesetting Standard Ebooks will produce the most aesthetically pleasing results.
XHTML to Markdown
There are a number of steps necessary to convert Standard Ebooks to Markdown. Broadly, these include:
- Download the book.
- Read the metadata file.
- Extract the title and author.
- Concatenate the chapters sequentially.
- Export each formatted chapter.
Even though ConTeXt can typeset XML documents, we’ll use XSLT—the verbose language only gurus grok without gripes—to convert XHTML into a Markdown document that pandoc can read to produce a native ConTeXt file.
Download and install the following tools before beginning:
Once installed, set an environment variable named
SAXON_JAR to the fully qualified path (directory plus file name) for
saxon-he-10.0.jar. Substitute the version of the software that was downloaded, if different.
Ensure the XSLT processor can run before continuing:
java -jar $SAXON_JAR
Download a Book
Once the requirements are met, open a terminal then run the following commands to download Jane Austen’s Pride and Prejudice:
mkdir -p $HOME/dev/writing/book/novels cd $HOME/dev/writing/book/novels git clone \ https://github.com/standardebooks/jane-austen_pride-and-prejudice
The novel is downloaded.
Create a new file named
se2md.xsl (meaning an extensible stylesheet for transforming Standard Ebook to Markdown) that contains the following:
<?xml version="1.0"?> <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/"> <xsl:text>hello, world
</xsl:text> </xsl:template> </xsl:stylesheet>
We’ll refer to the file as the stylesheet and the file to parse (
content.opf) as the source document. Although the metadata files are always named
content.opf, the epub specification defines the file name in
./src/epub/META-INF/container.xml. If ever Standard Ebooks renames the metadata file name, the stylesheets will fail. Deriving the name from
container.xml would be more robust: an adventure that is all yours.
Confirm that the source document can be opened by running the XSLT processor with the stylesheet (
-xsl:) and source document (
cd $HOME/dev/writing/book/novels java -jar $SAXON_JAR \ -xsl:se2md.xsl \ -s:jane-austen_pride-and-prejudice/src/epub/content.opf
If the transformation worked, you should see:
The metadata has been read by the XSLT processor, even though the stylesheet makes no use of it.
Title and Author
Replace the contents of
<?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY nl "
"> ]> <xsl:stylesheet version="3.0" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/ | opf:package | opf:metadata"> <xsl:apply-templates /> </xsl:template> <xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']"> <xsl:text>::: </xsl:text> <xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text> <xsl:apply-templates /><xsl:text>&nl;</xsl:text> <xsl:text>:::&nl;&nl;</xsl:text> </xsl:template> <xsl:strip-space elements="*" /> <xsl:template match="*" /> </xsl:stylesheet>
Run the same command to invoke the XSLT processor as before. The output should now resemble:
::: title Pride and Prejudice ::: ::: author Jane Austen :::
A few key lines could use explaining, the first being:
<xsl:template match="/ | opf:package | opf:metadata">
As the XSLT processor reads
match attribute instructs the processor to look for any of the following elements:
/– The root element, which comes before all elements.
opf:package– Matches the
packageelement within the
opfXML namespace. You can see the XML namespace (
xmlns) by opening
content.opfin a plain text editor. Notice how
packagespecifies an XML namespace (
xmlns="http://www.idpf.org/2007/opf". In our stylesheet, that same namespace is declared with an
opfprefix. When the XSLT processor detects a
packageelement in the
matchcriteria in our stylesheet fires and the contents of the
opf:metadata– We also have to match the metadata (in the same namespace) because that’s where the title and author elements are nested.
<xsl:apply-templates /> to tell the XSLT processor to continue matching and applying additional stylesheets, recursively, as it reads through the source file’s nested hierarchy.
The next line of interest is similar to the previous template:
<xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']">
Like before, our stylesheet defines the
dc namespace to be the same as the
dc namespace declared inside the source document. This allows us to match both
dc:creator elements. We further specify the criteria by using an
id attribute to hone in on the exact value we want to include in the output document.
Another notable line follows:
<xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text>
Upon matching the
id attribute value of either
author, we write said attribute value into the output document verbatim. The only issue here is that if the order of
dc:creator are swapped inside
content.opf then the output document will be incorrectly ordered.
If any part of the implementation could be considered fun, this would be it. Let’s break down the overall steps we want to accomplish:
- Combine all the chapters.
- For each section, export its heading.
- Transform all relevant XHTML elements into Markdown.
- Map Unicode characters to Markdown equivalents.
The first step is accomplished using the following dense snippet:
<xsl:template match="opf:manifest"> <xsl:variable name="book"> <book> <xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" /> </book> </xsl:variable> <xsl:apply-templates select="$book" /> </xsl:template>
This creates a variable named
book that contains the following overall XML structure for all concatenated chapters:
<book> <section epub:type="volume"> <section epub:type="part"> <section epub:type="chapter"> <p>First section's text.</p> </section> </section> </section> <section epub:type="volume"> <section epub:type="part"> <section epub:type="chapter"> <p>Second section's text.</p> </section> </section> </section> </book>
Without wrapping the
book element around each chapter’s enclosing section element, there would be no easy way to detect whether a section has a preceding section. We need to check for preceding sections to determine whether the volume or part for a particular chapter is a continuation of the previous volume or part.
Next up is this beast:
<xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" />
content.opf reveals its structure with respect to
<package> <metadata> <manifest> <item href="text/chapter" id="chapter" media-type="...">
The template declaration (
<xsl:template match="opf:manifest">) matches on
manifest (in the “opf” namespace), which provides local access to its nested
item elements (in the same namespace). We want to extract the
href attribute from all
item elements to get the relative path to each chapter’s file. After getting its relative path, we want to read that chapter’s XHTML content. Therefore:
copy-of– creates a deep copy of whatever was
document(– calls the
documentfunction to read an XML file;
opf:itemelements meeting specific criteria;
@media-type='application/xhtml+xml'– focuses on
opf:itemelements that are marked as XML documents;
and substring( @id, 0, 8 )='chapter']– and have an
idattribute beginning with the word
chapter(see aside, below);
/@href– extracts the
hrefattribute from the
opf:item, for example
text/chapter-1.xml, which is passed into the
.– instructs the XSLT processor to use a relative path from the XML document’s directory when reading the files; and
/h:html/h:body/h:section– means to discard the
bodyelements from the XHTML document, returning only the
Aside, Standard Ebooks does not have a machine-readable way to tell chapter files apart from other file types. We fudge it by checking that the
@id attribute of each
item in the
manifest starts with
chapter. String comparisons are almost always brittle solutions in software development because there is no guaranteed contract that defines a small, finite set of possible values that everyone agrees upon. (For example if a group of editors translated the books into French, they could prefix the chapter files with
chapitre instead, which would break the stylesheet’s code.) Ideally, each
item element would be classified with a value that could be used to distinguish chapter files from supplementary files.
The second step entails exporting the author and title from each chapter file. Replace the contents of
<?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY nl "
"> ]> <xsl:stylesheet version="3.0" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/ | opf:package | opf:metadata"> <xsl:apply-templates /> </xsl:template> <xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']"> <xsl:text>::: </xsl:text> <xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text> <xsl:apply-templates /><xsl:text>&nl;</xsl:text> <xsl:text>:::&nl;&nl;</xsl:text> </xsl:template> <xsl:template match="opf:manifest"> <xsl:variable name="book"> <book> <xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" /> </book> </xsl:variable> <xsl:apply-templates select="$book" /> </xsl:template> <xsl:template match="book"> <xsl:apply-templates /> </xsl:template> <xsl:template match="h:section[matches( @epub:type, '.*' )]"> <xsl:if test="count( preceding::h:section[@id=current()/@id] ) = 0"> <xsl:for-each select="0 to count( ancestor::h:section )"> <xsl:text>#</xsl:text> </xsl:for-each> <xsl:text> </xsl:text> <xsl:value-of select="@id" /> <xsl:text>&nl;&nl;</xsl:text> </xsl:if> <xsl:apply-templates /> </xsl:template> <xsl:strip-space elements="*" /> <xsl:template match="*" /> </xsl:stylesheet>
Re-run the XSLT processor to see:
::: title Pride and Prejudice ::: ::: author Jane Austen ::: # chapter-1 # chapter-2
When run against Victor Hugo’s Les Misérables, we see:
::: title Les Misérables ::: ::: author Victor Hugo ::: # volume-1 ## book-1-1 ### chapter-1-1-1 ### chapter-1-1-2 ... skipped for brevity... ## book-1-2 ... skipped for brevity... # volume-2 ## book-2-1 ### chapter-2-1-1
Each XHTML chapter file may repeat the volume and part number. This interferes with our ability to both autogenerate a table of contents and start each volume or part on a new page. Recall that we introduced a
book element to nest all the concatenated XHTML document sections together. Let’s look a little closer at how this is leveraged:
<xsl:template match="h:section[matches( @epub:type, '.*' )]"> <xsl:if test="count( preceding::h:section[@id=current()/@id] ) = 0"> <xsl:for-each select="0 to count( ancestor::h:section )"> <xsl:text>#</xsl:text> </xsl:for-each> <xsl:text> </xsl:text> <xsl:value-of select="@id" /> <xsl:text>&nl;&nl;</xsl:text> </xsl:if> <xsl:apply-templates /> </xsl:template>
The above template is fairly generic in that it isn’t specific to any one type of section. It handles
chapter, and any other nesting levels or names that the XHTML throws at it. Upon inspection:
h:section[matches( @epub:type, '.*' )]– Matches any
sectionelement (in the XHTML namespace) that has an
epub:typeattribute having at least one character.
count( ... ) = 0– Guard against redundant
preceding::h:section[@id=current()/@id]– Collects all previous
sectionelements that have the same
idattribute as the current
section. For example, if the current
volume-1and a previous
sectionelement had an
volume-1, then the result from the
countfunction will be greater than zero. Thus repeated sections are skipped.
0 to count( ancestor::h:section )– Iterates up the nested chain of
sectionelements such that the nesting depth controls the number of
#symbols written. If the book has volumes, parts, and chapters, then each chapter will be marked using
Note that the text for each heading is really a placeholder. When styling the chapters using ConTeXt, the text will be rewritten altogether. To reiterate, the choice of how to represent numerals is a presentation decision.
For the third step, we want to convert each XHTML element into its equivalent Markdown. In the interest of brevity, here’s how this is accomplished for a few simple XHTML elements:
<xsl:template match="h:p"> <xsl:apply-templates /> <xsl:text>&nl;&nl;</xsl:text> </xsl:template> <xsl:template match="h:em | h:i"> <xsl:text>_</xsl:text> <xsl:apply-templates /> <xsl:text>_</xsl:text> </xsl:template> <!-- Bold is swapped for small caps by the typesetting engine. --> <xsl:template match="h:strong | h:b"> <xsl:text>**</xsl:text> <xsl:apply-templates /> <xsl:text>**</xsl:text> </xsl:template> <xsl:template match="h:abbr | h:span"> <xsl:apply-templates /> </xsl:template>
And so on. The full conversion is quite long; having explained the high- and many low-level concepts necessary to do the conversion, we’ll forgo delving into the technical minutae of the stylesheet code for converting the remaining XHTML elements. Go on, thank me for sparing you.
The fourth and final step isn’t immediately obvious and may not be entirely necessary, if you are fine with letting pandoc and ConTeXt figure out how to handle Unicode characters. Otherwise, inject the following code into the stylesheet:
<xsl:output method="text" encoding="utf-8" use-character-maps="ununicode" /> <!-- Map specific Unicode characters to Markdown equivalents. --> <xsl:character-map name="ununicode"> <!-- hair space --> <xsl:output-character character=" " string="" /> <!-- ellipsis --> <xsl:output-character character="…" string="..." /> <!-- en-dash --> <xsl:output-character character="–" string="--" /> <!-- em-dash --> <xsl:output-character character="—" string="---" /> <!-- two em-dash --> <xsl:output-character character="⸺" string="--- ---" /> <!-- three em-dash --> <xsl:output-character character="⸻" string="--- --- ---" /> </xsl:character-map>
As a starting point, the downloadable stylesheet (below) transforms many
div environments, poetry, tables, and more.
The end? Well, almost.
The Markdown output contains many blocks that resemble:
::: annotation Text :::
These will have to be translated into ConTeXt environments and styled separately. For now, our mission is accomplished: by and large, we have translated classic novels marked up by Standard Ebooks into Markdown.
Download the complete stylesheet and
build script, released under an MIT license. Be sure to copy
$HOME/bin for the
build script to work.
In practice, communicating and formalizing a syntax that wholly separates content from presentation is hard. Even when the intent is clear, such as with the HTML Writers Guild and Standard Ebooks, there are a plethora of ways that the two get inseparably mingled. In this review of Project Gutenberg Projects, we encountered the following issues and ideas:
- By line – “By” is implied just by a work being authored
- Subtitle – “Or” can be made machine-readable
- Page numbering – Avoid transcribing numbers altogether
- Table of Contents – Machine-generate from chapter titles
- Chapter numbers – Machine-generate and avoid styling
- Capitalization – Avoid all caps, let typesetting change the case
- Paragraphs – One paragraph per element, keep lines together
- Quotes – Consider marking up speech using TTS-friendly elements
- Spacing – Prefer semantic markup, avoid indenting with spaces
- Section breaks – Use semantic markup that a computer can style
Standard Ebooks avoid many pitfalls in their separation of content from presentation.
About the Author
My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.
Delighted to discuss opportunities to work with revolutionary companies combatting climate change.