Typesetting Markdown -- Part 1: Build Script
This series describes a way to typeset Markdown content using the powerful typesetting engine ConTeXt.
Introduction
Separating document text (the written word) from its appearance (colours, fonts, and layout) makes it possible to create consistent corporate branding, ease documentation maintenance, simplify collaborative real-time editing, reliably embed machine-generated information, simultaneously publish multiple digital formats, and increase productivity.
Markdown is a document format designed to make writing documents easy. Here’s an example:
# Prolonged Bombardment
## 4,500 to 3,500
Late into its development, Earth sustained impacts from comets, asteroids, and huge celestial objects astronomers call _planetesimals_...
ConTeXt can help reformat such text into documents resembling:
Using the same concepts, it’s possible generate technical documentation:
Overview
Provided enough interest, the series will include the following parts:
- Build Script -- create user-friendly shell scripts
- Tool Review -- describe how the toolset works
- Automagicify -- continuously integrate typesetting
- Theme Style -- define colours, fonts, and layout
- Interpolation -- define and use external variables
- Computation -- leverage R for calculations
- Mathematics -- beautifully typeset equations
- Annotations -- apply different styles to annotated text
- Figures -- draw figures using MetaPost
Requirements
Readers must have some programming experience to follow along and must be familiar with Linux or similar operating systems.
Have the following tools ready for this part:
- bash, a command language
Shell Script Template
When performing the same steps many times---such as compiling a document---it’s convenient to have one script that performs all those steps. A user-friendly shell script:
- can run from any directory;
- can show useful usage information;
- parses and uses command-line arguments;
- informs the user of missing software requirements; and
- displays meaningful logging messages while running.
Let’s create a reusable template that addresses these requirements.
Any Directory
When writing a bash
script that can be launched from any directory, first determine the fully qualified path to the script itself, as the following lines demonstrate:
#!/usr/bin/env bash
readonly SCRIPT_SRC="$(dirname "${BASH_SOURCE[0]}")"
readonly SCRIPT_DIR="$(cd "$SCRIPT_SRC" >/dev/null 2>&1 && pwd)"
readonly SCRIPT_NAME=$(basename "$0")
The first line indicates that the script must be run using bash
because the script uses bash
-specific features. Accordingly, the script is named build
, rather than build.sh
.
The second line extracts the path to the directory where the script resides, regardless of where or how the script was invoked, and has a few parts:
$(...)
-- runs the command given inside the parenthesesdirname
-- command to extract the directory path to a given file${...}
-- gets a variable value (in this case, the script’s filename)"..."
-- double quotes ensure filename spaces are handled correctly
The third line changes to the script’s directory, and, if successful, captures the fully qualified path to the script, excluding the script’s filename:
>/dev/null
-- discard messages written to standard output2>&1
-- also, discard messages written to standard error&&
-- only run the second command if the first succeededpwd
-- print the working directory to standard output
Any information printed by the third line to standard output is captured by the SCRIPT_DIR
variable. By changing to the script’s directory within the script itself, it allows the script to run successfully when the working directory differs from the script’s working directory.
The fourth line uses the basename
command to capture only the filename of the script that was run---without any directory names. This will be useful when informing the user how to use the script.
Script Functions
Functions are useful ways to reuse and organise code. When scripting, it is expedient to see the high-level structure near the top of the file. There are a couple of ways to declare functions in bash
, the most syntactically terse being:
function_name() {
echo "function_name called"
}
The parentheses on the first line instructs bash
that a new function is being declared with a given name. When called, the code within the curly braces is executed.
Entry Point
For the script template, a main()
function is introduced near the top of the file:
main() {
parse_commandline "$@"
if [ -n "${ARG_HELP}" ]; then
show_usage
exit 3
fi
log "Check for missing software requirements"
validate_requirements
if [ "${REQUIRED_MISSING}" -gt "0" ]; then
exit 4
fi
cd "${SCRIPT_DIR}" && execute_tasks
}
There is nothing special about the name main()
: it is merely a convention that indicates to human readers where to find the script’s starting point. The main()
function’s overall algorithm is straightforward:
- Update the script’s settings using command-line options.
- Take no significant action if help is requested.
- Ensure the required software packages are available.
- Execute all tasks starting from the script’s directory.
Notice that log()
, parse_commandline()
, check_requirements()
, and other functions are called but not yet declared. When bash
runs a script, it does so from the top down. Function declarations are not executed until specifically invoked by the script. Since the entire script is loaded into memory before running, main()
will be invoked as the last line in the script’s file (not yet shown), after all the functions have been declared. This allows the script to be organised by importance and algorithmic flow.
The following line calls parse_commandline()
and passes into it all the parameters that were passed into main()
by way of the $@
variable:
parse_commandline "$@"
If only four options were possible, it may be tempting to write:
parse_commandline "$1" "$2" "$3" "$4"
However, using $@
means that the line of code need not change when additional command-line options are added. Plus, its fewer keystrokes.
After parsing the command-line options, the following line determines whether to display a friendly help message:
if [ -n "${ARG_HELP}" ]; then
Another useful convention is to name variables with prefixes that suggest their kind of usage (a variation on Apps Hungarian notation, not the abysmal Systems Hungarian). For scripting, any variable prefixed with ARG_
denotes its value can be set using a command-line argument. The line above checks if the ARG_HELP
variable contains something other than the empty string.
If not empty (-n
), inform the user of the available command-line options and subsequently terminate the script with exit code 3, the first non-reserved exit code:
show_usage
exit 3
Finally, change to the script’s directory and execute all the tasks required:
cd "${SCRIPT_DIR}" && execute_tasks
As every script has a different purpose, the template defines an effectively empty placeholder for execute_tasks()
.
Useful Usage
While competing syntaxes exist for standard usage messages, scripts do not tend to be as complex as applications. As such, a simple approach to help is generally sufficient:
show_usage() {
printf "Usage: %s [OPTION...]\n" "${SCRIPT_NAME}" >&2
printf " -d, --debug\t\tLog messages while processing\n" >&2
printf " -h, --help\t\tShow this help message then exit\n" >&2
}
The help message is written to standard error (>&2
) because the script uses standard output for log messages exclusively. This is a convention for this particular script.
Command-line Parsing
Unlike typical computer languages, bash
does not name function parameters. Instead, parameters are numbered $1
to $9
, which is limiting; however, work arounds exist for using an arbitrary number of parameters.
There are many ways to parse command-line options. My preference is to avoid combined short options (e.g., -vfd
) while offering short and long options, which reduces the parsing logic to:
parse_commandline() {
while [ "$#" -gt "0" ]; do
local consume=1
case "$1" in
-d|--debug)
ARG_DEBUG="true"
;;
-h|-\?|--help)
ARG_HELP="true"
;;
*)
# Skip argument
;;
esac
shift ${consume}
done
}
The first line of the function loops while positional parameters remain:
while [ "$#" -gt "0" ]; do
The $#
variable represents the number of parameters passed into the function. By default, the shift
command consumes a single parameter, thereby decreasing the value of $#
. For each iteration of the loop, the number of parameters consumed is controlled by the value stored in consume
. The loop ends when the number of remaining parameters is less than or equal to zero; since at least one parameter is always consumed, the loop will end (as long as consume
is never programmed to be less than 1
).
Next, $1
becomes the first positional parameter in the list of ever-shifting command-line arguments passed into the function. Each successive loop iteration changes the value of $1
because the following line removes one or more parameters:
shift ${consume}
When the loop completes, the command-line arguments will have been parsed and---by coding to convention---assigned to global variables having an ARG_
prefix. In this way the code does not introduce arbitrary limits on the number of arguments it accepts.
To parse an option that takes an additional argument, introduce a new condition that consumes two parameters instead of one. For example, to parse a filename option (e.g., -f file.txt
), write:
-f|--filename)
ARG_FILENAME="$2"
consume=2
;;
The value for the second argument is stored in $2
, which is assigned to the global ARG_FILENAME
variable.
Missing Requirements
Making user-friendly shell scripts means informing users of what’s required to run the software. Ideally, scripts would ask users for permission to install the required software packages. This approach has two problems. First, package managers differ from system to system (apt
, brew
, choco
, cydia
, dpkg
, install
, macports
, pacman
, portage
, rpm
, smit
, tazpkg
, yum
, and zypper
, to run the alphabet); there is no POSIX-compliant install
command that “just works” for the most common use cases---those being install and uninstall some software---across platforms. Second, the same software package may differ in name and content between distributions.
So this leaves checking for requirements to inform users what they have to install themselves:
required() {
local missing=0
if ! command -v "$1" > /dev/null 2>&1; then
warning "Missing requirement: install $1 ($2)"
missing=1
fi
REQUIRED_MISSING=$(( REQUIRED_MISSING + missing ))
}
The command
command is a POSIX way to discover whether a particular program can be run from the command-line. Using command
instead of which
is strongly recommended for bash
scripts.
The following line displays a program name to install along with a URL:
warning "Missing requirement: install $1 ($2)"
Rather than force users to run the script several times to discover all the missing requirements, the following line tallies the number of missing commands:
REQUIRED_MISSING=$(( REQUIRED_MISSING + missing ))
If REQUIRED_MISSING
is found to be greater than zero, the script will terminate due to the following lines:
if [ "${REQUIRED_MISSING}" -gt "0" ]; then
exit 4
fi
Reusing the required()
function resembles the following:
validate_requirements() {
required context "https://wiki.contextgarden.net"
required pandoc "https://www.pandoc.org"
required gs "https://www.ghostscript.com"
}
Note how neither show_usage()
nor validate_requirements()
terminate the script. To do so would ignore the single responsibility principle. That is, the only reason show_usage()
should change is to update the help message; if show_usage()
contained an exit
statement, then the function would have more than one reason to change: usage updates, exit code values, and program control flow.
Broadening the function’s scope by including and
in the function name, such as show_usage_and_exit()
, does not render the single responsibility principle inapplicable any more than calling a duck a disco ball makes our feathered friend reflective. On the contrary, the word and
in a function name suggests that the principle has been violated.
Colourful Logging
Informational messages for this script have three flavours: log
, warning
, and error
. A log
message displays to the user what the script is about to do (or has done). A warning
message indicates a problem that doesn’t necessarily mean the script will or has failed. An error
is a fatal condition that requires fixing. Whether or not log
messages are displayed is controlled by the ARG_DEBUG
variable; warning
and error
messages are always displayed.
ANSI escape sequences can help call users’ attention to problems or key details. A reusable function to display a line of text in a particular colour could look as follows:
coloured_text() {
printf "%b%s%b\n" "$2" "$1" "${COLOUR_OFF}"
}
The coloured_text()
function accepts the following parameters:
$1
-- text message to display; and$2
-- text message colour.
Where the code gets tricky is understanding how the function’s parameters are used by the printf
command. The printf
command is given the following arguments:
"%b%s%b\n"
-- specifies how to format all subsequent arguments;"$2"
-- text message colour; and"$1"
-- text message to display; and"${COLOUR_OFF}"
-- escape sequence to stop colouring text.
Consider the printf
argument "%b%s%b\n"
, known as the format specifier:
%b
(the first) is replaced by the ANSI escape sequence for the text message’s colour;%s
is replaced by the text message to write;%b
(the second) is replaced by the ANSI escape sequence for turning off coloured text; and\n
writes a newline character.
Effectively, the first and second %b
are replaced by $2
and COLOUR_OFF
, respectively. Since %s
is bookended by %b
specifiers, the result is that the script writes the following:
- Colour On (
$2
) - Text Message (
$1
) - Colour Off (
${COLOUR_OFF}
)
Using the function is far simpler than explaining how it works:
warning() {
coloured_text "$1" "${COLOUR_WARNING}"
}
Any time warning()
is called, its text message parameter ($1
) is displayed in the warning colour, described later. An example warning call looks like:
warning "Install pandoc (https://pandoc.org)"
The warning()
and error()
functions differ only by colour; whereas, the log()
function is slightly more feature rich:
log() {
if [ -n "${ARG_DEBUG}" ]; then
printf "[%s] " "$(date +%H:%I:%S.%4N)"
coloured_text "$1" "${COLOUR_LOGGING}"
fi
}
Setting ARG_DEBUG
to true
, false
, or any non-empty string will enable logging since -n
examines string length, not contents. This isn’t a logic issue because users cannot set the value of ARG_DEBUG
directly, though it may be considered a maintenance issue.
Every logging statement is prefixed with the current time in hours (%H
), minutes (%I
), seconds (%S
), and thousands of nanoseconds (%4N
). When improving shell script performance, it is useful to see what commands take the most time. For long-running scripts, it may be helpful to include the date.
Constants
Defining constant ANSI escape sequence colours makes for convenient references, such as:
readonly COLOUR_BLUE='\033[1;34m'
readonly COLOUR_PINK='\033[1;35m'
readonly COLOUR_DKGRAY='\033[30m'
readonly COLOUR_DKRED='\033[31m'
readonly COLOUR_YELLOW='\033[1;33m'
readonly COLOUR_OFF='\033[0m'
Directly using these colour names elsewhere in the script is vulgar because when someone decides to update the colour, what often happens is:
- colour constants are changed, e.g.,
COLOUR_BLUE
is no longer blue; or - colour constants are swapped with different colours within the script.
Either of these actions makes the script more time-consuming to change, which decreases maintainability. By defining a set of logical colours that are used throughout the script consistently, changing colours is isolated to a single place in the code:
readonly COLOUR_LOGGING=${COLOUR_BLUE}
readonly COLOUR_WARNING=${COLOUR_YELLOW}
readonly COLOUR_ERROR=${COLOUR_DKRED}
Extending this to allow user-controlled colours (or themes) would be trivial. Furthermore, this concept is especially applicable to cascading stylesheets.
Initial State
Shell scripts cannot depend on variables being empty: variables could have been prepopulated using values from the environment. To handle such situations, clear variables used by the script with unset
:
unset ARG_HELP
unset ARG_DEBUG
unset REQUIRED_MISSING
Call Main
The last line of the script calls the entry point:
main "$@"
All command-line arguments, denoted by $@
, are passed into main()
, which subsequently passes them into parse_commandline()
. Enclosing $@
in double quotes is important when parsing arguments that take strings, for example:
./script -d --message "Strings are single argument values"
Linting
The term “lint”, with respect to software development, refers to unwanted bits of fiber and fluff found in sheep’s wool: an analogy for undesirable bits in code. Linters can warn developers about syntax errors, undeclared variables, deprecated language features, and more. Such software tools are especially useful for interpreted languages like bash
. Use shellcheck to report possible issues:
shellcheck -s bash build
Download
Download the starter build script, distributed under the MIT license.
Alternatives
See also bash3boilerplate, which has similar goals and additional features.
Summary
This part introduced a user-friendly reusable build script template. Part 2 walks through how pandoc and ConTeXt can generate a PDF file from a Markdown document.
Contact
About the Author
My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.
Delighted to discuss opportunities to work with revolutionary companies combatting climate change.