Typesetting Markdown – Part 1: Build Script

Dave Jarvis, Feb 24 2020

This series describes a way to typeset Markdown content using the powerful typesetting engine ConTeXt.

Introduction

Separating document text (the written word) from its appearance (colours, fonts, and layout) makes it possible to create consistent corporate branding, ease documentation maintenance, simplify collaborative real-time editing, reliably embed machine-generated information, simultaneously publish multiple digital formats, and increase productivity.

Markdown is a document format designed to make writing documents easy. Here’s an example:

# Prolonged Bombardment
## 4,500 to 3,500

Late into its development, Earth sustained impacts from comets, asteroids, and huge celestial objects astronomers call _planetesimals_...

ConTeXt can help reformat such text into documents resembling:

Illustration courtesy of Joanne Neubauer.

Using the same concepts, it’s possible generate technical documentation:

High-level software design document excerpt

Overview

Provided enough interest, the series will include the following parts:

Build Script – create user-friendly shell scripts
Tool Review – describe how the toolset works
Automagicify – continuously integrate typesetting
Theme Style – define colours, fonts, and layout
Interpolation – define and use external variables
Computation – leverage R for calculations
Mathematics – beautifully typeset equations
Annotations – apply different styles to annotated text
Figures – draw figures using MetaPost

Requirements

Readers must have some programming experience to follow along and must be familiar with Linux or similar operating systems.

Have the following tools ready for this part:

bash, a command language

Shell Script Template

When performing the same steps many times—such as compiling a document—it’s convenient to have one script that performs all those steps. A user-friendly shell script:

can run from any directory;
can show useful usage information;
parses and uses command-line arguments;
informs the user of missing software requirements; and
displays meaningful logging messages while running.

Let’s create a reusable template that addresses these requirements.

Any Directory

When writing a bash script that can be launched from any directory, first determine the fully qualified path to the script itself, as the following lines demonstrate:

#!/usr/bin/env bash
readonly SCRIPT_SRC="$(dirname "${BASH_SOURCE[0]}")"
readonly SCRIPT_DIR="$(cd "$SCRIPT_SRC" >/dev/null 2>&1 && pwd)"
readonly SCRIPT_NAME=$(basename "$0")

The first line indicates that the script must be run using bash because the script uses bash-specific features. Accordingly, the script is named build, rather than build.sh.

The second line extracts the path to the directory where the script resides, regardless of where or how the script was invoked, and has a few parts:

$(...) – runs the command given inside the parentheses
dirname – command to extract the directory path to a given file
${...} – gets a variable value (in this case, the script’s filename)
"..." – double quotes ensure filename spaces are handled correctly

The third line changes to the script’s directory, and, if successful, captures the fully qualified path to the script, excluding the script’s filename:

>/dev/null – discard messages written to standard output
2>&1 – also, discard messages written to standard error
&& – only run the second command if the first succeeded
pwd – print the working directory to standard output

Any information printed by the third line to standard output is captured by the SCRIPT_DIR variable. By changing to the script’s directory within the script itself, it allows the script to run successfully when the working directory differs from the script’s working directory.

The fourth line uses the basename command to capture only the filename of the script that was run—without any directory names. This will be useful when informing the user how to use the script.

Script Functions

Functions are useful ways to reuse and organise code. When scripting, it is expedient to see the high-level structure near the top of the file. There are a couple of ways to declare functions in bash, the most syntactically terse being:

function_name() {
  echo "function_name called"
}

The parentheses on the first line instructs bash that a new function is being declared with a given name. When called, the code within the curly braces is executed.

Entry Point

For the script template, a main() function is introduced near the top of the file:

main() {
  parse_commandline "$@"

  if [ -n "${ARG_HELP}" ]; then
    show_usage
    exit 3
  fi

  log "Check for missing software requirements"
  validate_requirements

  if [ "${REQUIRED_MISSING}" -gt "0" ]; then
    exit 4
  fi

  cd "${SCRIPT_DIR}" && execute_tasks
}

There is nothing special about the name main(): it is merely a convention that indicates to human readers where to find the script’s starting point. The main() function’s overall algorithm is straightforward:

Update the script’s settings using command-line options.
Take no significant action if help is requested.
Ensure the required software packages are available.
Execute all tasks starting from the script’s directory.

Notice that log(), parse_commandline(), check_requirements(), and other functions are called but not yet declared. When bash runs a script, it does so from the top down. Function declarations are not executed until specifically invoked by the script. Since the entire script is loaded into memory before running, main() will be invoked as the last line in the script’s file (not yet shown), after all the functions have been declared. This allows the script to be organised by importance and algorithmic flow.

The following line calls parse_commandline() and passes into it all the parameters that were passed into main() by way of the $@ variable:

  parse_commandline "$@"

If only four options were possible, it may be tempting to write:

  parse_commandline "$1" "$2" "$3" "$4"

However, using $@ means that the line of code need not change when additional command-line options are added. Plus, its fewer keystrokes.

After parsing the command-line options, the following line determines whether to display a friendly help message:

  if [ -n "${ARG_HELP}" ]; then

Another useful convention is to name variables with prefixes that suggest their kind of usage (a variation on Apps Hungarian notation, not the abysmal Systems Hungarian). For scripting, any variable prefixed with ARG_ denotes its value can be set using a command-line argument. The line above checks if the ARG_HELP variable contains something other than the empty string.

If not empty (-n), inform the user of the available command-line options and subsequently terminate the script with exit code 3, the first non-reserved exit code:

    show_usage
    exit 3

Finally, change to the script’s directory and execute all the tasks required:

  cd "${SCRIPT_DIR}" && execute_tasks

As every script has a different purpose, the template defines an effectively empty placeholder for execute_tasks().

Useful Usage

While competing syntaxes exist for standard usage messages, scripts do not tend to be as complex as applications. As such, a simple approach to help is generally sufficient:

show_usage() {
  printf "Usage: %s [OPTION...]\n" "${SCRIPT_NAME}" >&2
  printf "  -d, --debug\t\tLog messages while processing\n" >&2
  printf "  -h, --help\t\tShow this help message then exit\n" >&2
}

The help message is written to standard error (>&2) because the script uses standard output for log messages exclusively. This is a convention for this particular script.

Command-line Parsing

Unlike typical computer languages, bash does not name function parameters. Instead, parameters are numbered $1 to $9, which is limiting; however, work arounds exist for using an arbitrary number of parameters.

There are many ways to parse command-line options. My preference is to avoid combined short options (e.g., -vfd) while offering short and long options, which reduces the parsing logic to:

parse_commandline() {
  while [ "$#" -gt "0" ]; do
    local consume=1

    case "$1" in
      -d|--debug)
        ARG_DEBUG="true"
      ;;
      -h|-\?|--help)
        ARG_HELP="true"
      ;;
      *)
        # Skip argument
      ;;
    esac

    shift ${consume}
  done
}

The first line of the function loops while positional parameters remain:

  while [ "$#" -gt "0" ]; do

The $# variable represents the number of parameters passed into the function. By default, the shift command consumes a single parameter, thereby decreasing the value of $#. For each iteration of the loop, the number of parameters consumed is controlled by the value stored in consume. The loop ends when the number of remaining parameters is less than or equal to zero; since at least one parameter is always consumed, the loop will end (as long as consume is never programmed to be less than 1).

Next, $1 becomes the first positional parameter in the list of ever-shifting command-line arguments passed into the function. Each successive loop iteration changes the value of $1 because the following line removes one or more parameters:

    shift ${consume}

When the loop completes, the command-line arguments will have been parsed and—by coding to convention—assigned to global variables having an ARG_ prefix. In this way the code does not introduce arbitrary limits on the number of arguments it accepts.

To parse an option that takes an additional argument, introduce a new condition that consumes two parameters instead of one. For example, to parse a filename option (e.g., -f file.txt), write:

      -f|--filename)
        ARG_FILENAME="$2"
        consume=2
      ;;

The value for the second argument is stored in $2, which is assigned to the global ARG_FILENAME variable.

Missing Requirements

Making user-friendly shell scripts means informing users of what’s required to run the software. Ideally, scripts would ask users for permission to install the required software packages. This approach has two problems. First, package managers differ from system to system (apt, brew, choco, cydia, dpkg, install, macports, pacman, portage, rpm, smit, tazpkg, yum, and zypper, to run the alphabet); there is no POSIX-compliant install command that “just works” for the most common use cases—those being install and uninstall some software—across platforms. Second, the same software package may differ in name and content between distributions.

So this leaves checking for requirements to inform users what they have to install themselves:

required() {
  local missing=0

  if ! command -v "$1" > /dev/null 2>&1; then
    warning "Missing requirement: install $1 ($2)"
    missing=1
  fi

  REQUIRED_MISSING=$(( REQUIRED_MISSING + missing ))
}

The command command is a POSIX way to discover whether a particular program can be run from the command-line. Using command instead of which is strongly recommended for bash scripts.

The following line displays a program name to install along with a URL:

    warning "Missing requirement: install $1 ($2)"

Rather than force users to run the script several times to discover all the missing requirements, the following line tallies the number of missing commands:

  REQUIRED_MISSING=$(( REQUIRED_MISSING + missing ))

If REQUIRED_MISSING is found to be greater than zero, the script will terminate due to the following lines:

  if [ "${REQUIRED_MISSING}" -gt "0" ]; then
    exit 4
  fi

Reusing the required() function resembles the following:

validate_requirements() {
  required context "https://wiki.contextgarden.net"
  required pandoc "https://www.pandoc.org"
  required gs "https://www.ghostscript.com"
}

Note how neither show_usage() nor validate_requirements() terminate the script. To do so would ignore the single responsibility principle. That is, the only reason show_usage() should change is to update the help message; if show_usage() contained an exit statement, then the function would have more than one reason to change: usage updates, exit code values, and program control flow.

Broadening the function’s scope by including and in the function name, such as show_usage_and_exit(), does not render the single responsibility principle inapplicable any more than calling a duck a disco ball makes our feathered friend reflective. On the contrary, the word and in a function name suggests that the principle has been violated.

Colourful Logging

Informational messages for this script have three flavours: log, warning, and error. A log message displays to the user what the script is about to do (or has done). A warning message indicates a problem that doesn’t necessarily mean the script will or has failed. An error is a fatal condition that requires fixing. Whether or not log messages are displayed is controlled by the ARG_DEBUG variable; warning and error messages are always displayed.

ANSI escape sequences can help call users’ attention to problems or key details. A reusable function to display a line of text in a particular colour could look as follows:

coloured_text() {
  printf "%b%s%b\n" "$2" "$1" "${COLOUR_OFF}"
}

The coloured_text() function accepts the following parameters:

$1 – text message to display; and
$2 – text message colour.

Where the code gets tricky is understanding how the function’s parameters are used by the printf command. The printf command is given the following arguments:

"%b%s%b\n" – specifies how to format all subsequent arguments;
"$2" – text message colour; and
"$1" – text message to display; and
"${COLOUR_OFF}" – escape sequence to stop colouring text.

Consider the printf argument "%b%s%b\n", known as the format specifier:

%b (the first) is replaced by the ANSI escape sequence for the text message’s colour;
%s is replaced by the text message to write;
%b (the second) is replaced by the ANSI escape sequence for turning off coloured text; and
\n writes a newline character.

Effectively, the first and second %b are replaced by $2 and COLOUR_OFF, respectively. Since %s is bookended by %b specifiers, the result is that the script writes the following:

Colour On ($2)
Text Message ($1)
Colour Off (${COLOUR_OFF})

Using the function is far simpler than explaining how it works:

warning() {
  coloured_text "$1" "${COLOUR_WARNING}"
}

Any time warning() is called, its text message parameter ($1) is displayed in the warning colour, described later. An example warning call looks like:

warning "Install pandoc (https://pandoc.org)"

The warning() and error() functions differ only by colour; whereas, the log() function is slightly more feature rich:

log() {
  if [ -n "${ARG_DEBUG}" ]; then
    printf "[%s] " "$(date +%H:%I:%S.%4N)"
    coloured_text "$1" "${COLOUR_LOGGING}"
  fi
}

Setting ARG_DEBUG to true, false, or any non-empty string will enable logging since -n examines string length, not contents. This isn’t a logic issue because users cannot set the value of ARG_DEBUG directly, though it may be considered a maintenance issue.

Every logging statement is prefixed with the current time in hours (%H), minutes (%I), seconds (%S), and thousands of nanoseconds (%4N). When improving shell script performance, it is useful to see what commands take the most time. For long-running scripts, it may be helpful to include the date.

Constants

Defining constant ANSI escape sequence colours makes for convenient references, such as:

readonly COLOUR_BLUE='\033[1;34m'
readonly COLOUR_PINK='\033[1;35m'
readonly COLOUR_DKGRAY='\033[30m'
readonly COLOUR_DKRED='\033[31m'
readonly COLOUR_YELLOW='\033[1;33m'
readonly COLOUR_OFF='\033[0m'

Directly using these colour names elsewhere in the script is vulgar because when someone decides to update the colour, what often happens is:

colour constants are changed, e.g., COLOUR_BLUE is no longer blue; or
colour constants are swapped with different colours within the script.

Either of these actions makes the script more time-consuming to change, which decreases maintainability. By defining a set of logical colours that are used throughout the script consistently, changing colours is isolated to a single place in the code:

readonly COLOUR_LOGGING=${COLOUR_BLUE}
readonly COLOUR_WARNING=${COLOUR_YELLOW}
readonly COLOUR_ERROR=${COLOUR_DKRED}

Extending this to allow user-controlled colours (or themes) would be trivial. Furthermore, this concept is especially applicable to cascading stylesheets.

Initial State

Shell scripts cannot depend on variables being empty: variables could have been prepopulated using values from the environment. To handle such situations, clear variables used by the script with unset:

unset ARG_HELP
unset ARG_DEBUG
unset REQUIRED_MISSING

Call Main

The last line of the script calls the entry point:

main "$@"

All command-line arguments, denoted by $@, are passed into main(), which subsequently passes them into parse_commandline(). Enclosing $@ in double quotes is important when parsing arguments that take strings, for example:

./script -d --message "Strings are single argument values"

Linting

The term “lint”, with respect to software development, refers to unwanted bits of fiber and fluff found in sheep’s wool: an analogy for undesirable bits in code. Linters can warn developers about syntax errors, undeclared variables, deprecated language features, and more. Such software tools are especially useful for interpreted languages like bash. Use shellcheck to report possible issues:

shellcheck -s bash build

Download

Download the starter build script, distributed under the MIT license.

Alternatives

See also bash3boilerplate, which has similar goals and additional features.

Summary

This part introduced a user-friendly reusable build script template. Part 2 walks through how pandoc and ConTeXt can generate a PDF file from a Markdown document.

Contact

About the Author

My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.

Delighted to discuss opportunities to work with revolutionary companies combatting climate change.