String replacements in command lines

Often a program will want to change a parameter in a command line; an example would be to run a command on a user-supplied file. To make it more concrete, let's say that our program converts DVI files to PostScript and to PDF. It does so by enlisting the help of two other programs, dvips and dvipdf. These commands are to be invoked as follows:

dvips -o output input

dvipdf input output

If we want to write this program using a POSIX shell-like language, it would probably look something like Figure 1.

Figure 1. Implementation using Bash

#!/bin/bash
# $1 is the file we want to convert
dvips -o out.ps "$1"
dvipdf "$1" out.pdf

$1 is a positional parameter: when the program is run with arguments, it takes on the value of the first argument—in fact, the first word, other than the command itself.

When this program is run with the command convert "my file.dvi", $1 will be replaced with my file.dvi, and so the command lines executed will be:

In terms of argument vectors, these translate to:

So long as we don't care about checking for errors, this implementation is exactly what we wanted. It will even cope with spaces in the file name, because the double quotes will keep it all together as one ‘word’, hence one entry in the array of strings given to main(). (Of course, in order to be useful both dvips and dvipdf would need to handle spaces in filenames too.) Easy enough, so now let's try that in Python using os.system().

Figure 2. Implementation using Python

#!/usr/bin/python
import os
import sys
os.system ('dvips -o out.ps "' + sys.argv[1] + '"')
os.system ('dvipdf "' + sys.argv[1] + '" out.pdf')

(Here, sys.argv[1] is the same string as $1 was before.)

Although it might look the same, this implementation in Python is quite different to the one in Figure 1. Imagine if the file name happens to contain a dollar sign (by chance or by malice). If by malice, the miscreant can use parentheses in the (made up) file name to execute commands of their own—command substitution of ‘$(command)’ will cause the command to be executed. With the Bash implementation, this was not possible because the malicious file name was in an environment variable and substituted into the command line using parameter expansion—command substitution is always performed before this step. Obviously with this small example there isn't a lot of scope for malice, but there are plenty of examples where there is: web-based interfaces, mail and print filters, and so on.

To try to fix this up, we could try to add our own quoting to the argument: whenever we find a dollar sign, replace it with a backslash followed by a dollar sign (this takes away its special meaning). Are we safe yet? Well, no, because now the miscreant can still use backticks (backticks are another form of command substitution), or just break out of the quotation by using their own quotation marks in the file name such as ‘"; cat /etc/passwd; echo "’, and still get their own programs run.

So we need to, for want of a real word, enquote the file name, escape any backticks we find, and escape any dollar signs we find. Or more efficiently, we could use single-quote characters instead of double-quote characters, and escape any stray single-quote characters in the file name (command substitutions and parameter expansions are not performed inside single-quoted words).

An easier way of doing it is presented in Figure 3.

Figure 3. Alternative implementation in Python

#!/usr/bin/python
import os
import sys
os.environ['file']=sys.argv[1]
os.system ('dvips -o out.ps "$file"')
os.system ('dvipdf "$file" out.pdf')

With this implementation, the shell (which is what os.system invokes in this instance) does the tricky stuff for us.

Where we went wrong before was in trying to modify the command line that we had. When the shell interprets the command line, it needs to apply quoting rules to it, and parameter expansion rules, and command substitution rules, and several others, before it is finally split into the words that make up the argument vector, the argv parameter that gets given to main(). Trying to insert arbitrary strings into a command line without knowing these rules can end up causing problems, as we've seen.