Sed Rules Reference

This reference was assembled while closely looking at the information provided in Sed - An Introduction and Tutorial. In March 2008, the author, Bruce Barnett, gave permission to use that information for the reference.

The essential command: s for substitution

The substitute command is the s. It changes all occurrences of a regular expression into another value. A simple example changes "day" to "night":

s/day/night/

This substitute command consists of four parts.

s         substitute command
/../../   delimiters
day       regular expression pattern string
night     replacement string

The regular expressions are covered in detail in another reference.

The slash as a delimiter

The character after the s is the delimiter. Conventionally it is a slash, but it can be anything that you want.

For example, it can be an underscore instead of the slash:

s_day_night_

Or colons:

s:day:night:

Or you can use the "|" character.

s|day|night|

Select the one that you want. As long as it is not in the string that you are looking for, anything works. Also remember that three delimiters are needed. If you get a "Unterminated 's' command" it is because you are missing one of them.

Using & as the matched string

At times one may want to search for a pattern and add some characters, like parenthesis, around or near the pattern that was found. It is easy to do this if looking for a particular string:

s/abc/(abc)/

This won't work if you don't know exactly what you will find. How can you put the string you found in the replacement string if you don't know what it is?

The solution requires the special character "&." It corresponds to the pattern found.

s/[a-z]*/(&)/

There can be any number of "&" in the replacement string. You could also double a pattern, e.g. the first number of a line:

s/[0-9]*/& &/

If the input for this rule is "123 abc", then the output will be "123 123 abc".

Let's slightly amend this example. Sed will match the first string, and make it as greedy as possible. The first match for '[0-9]*' is the first character on the line, as this matches zero of more numbers. So if the input was "abc 123" the output would be unchanged, except for a space before the letters. A better way to duplicate the number is to make sure it matches a number:

s/[0-9][0-9]*/& &/

The input of "123 abc" becomes "123 123 abc".

Using \1 to keep part of the pattern

The use of "(" ")" and "1" has been described in the reference on regular expressions. To review, the escaped parenthesis (that is, parenthesis with backslashes before them) remember portions of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern. If you wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:

s/\([a-z]*\)*/\1/

If you want to switch two words around, you can remember two patterns and change the order around:

s/\([a-z]*\) \([a-z]*\)/\2 \1/

The "\1" doesn't have to be in the replacement string. It can be in the pattern you are searching for. If you want to eliminate duplicated words, you can try:

s/\([a-z]*\)\1/\1/

You can have up to nine values: "\1" thru "\9."

Substitute Flags

You can add additional flags after the last delimiter. These flags can specify what happens when there is more than one occurrence of a pattern on a single line, and what to do if a substitution is found.

/g - Global replacement

Most Unix utilties work on files, reading a line at a time. Sed, by default, is the same way. If you tell it to change a word, it will only change the first occurrence of the word on a line. You may want to make the change on every word on the line instead of the first. For an example, let's place parentheses around words on a line. Instead of using a pattern like "[A-Za-z]*" which won't match words like "won't," we will use a pattern, "[^ ]*," that matches everything except a space. Well, this will also match anything because "*" means zero or more. To work around a possible bug in sed, you must avoid matching the null string when using the "g" flag to sed. A work-around example is: "[^ ][^ ]*." The following will put parenthesis around the first word:

s/[^ ]*/(&)/

If you want it to make changes for every word, add a "g" after the last delimiter and use the work-around:

s/[^ ][^ ]*/(&)/g

Is sed recursive?

Sed only operates on patterns found in the incoming data. That is, the input line is read, and when a pattern is matched, the modified output is generated, and the rest of the input line is scanned. The "s" command will not scan the newly created output. That is, you don't have to worry about expressions like:

s/loop/loop the loop/g

This will not cause an infinite loop. If a second "s" command is executed, it could modify the results of a previous command. I will show you how to execute multiple commands later.

/1, /2, etc. Specifying which occurrence

With no flags, the first pattern is changed. With the "g" option, all patterns are changed. If you want to modify a particular pattern that is not the first one on the line, you could use "\(" and "\)" to mark each pattern, and use "\1" to put the first pattern back unchanged. This next example keeps the first word on the line but deletes the second:

s/\([a-zA-Z]*\) \([a-zA-Z]*\) /\1 /

This rule is ugly! There is an easier way to do this. You can add a number after the substitution command to indicate you only want to match that particular pattern. Example:

s/[a-zA-Z]* //2

You can combine a number with the g (global) flag. For instance, if you want to leave the first world alone alone, but change the second, third, etc. to DELETED, use /2g:

s/[a-zA-Z]* /DELETED /2g

Don't get /2 and \2 confused. The /2 is used at the end. \2 is used in inside the replacement field.

Note the space after the "*" character. Without the space, sed will run a long, long time. (Note: this bug is probably fixed by now.) This is because the number flag and the "g" flag have the same bug. You should also be able to use the pattern

s/[^ ]*//2

but this also eats CPU.

The number flag is not restricted to a single digit. It can be any number from 1 to 512. If you wanted to add a colon after the 80th character in each line, you could type:

s/./&:/80

Combining substitution flags

You can combine flags when it makes sense. Some things that don't work are using a number and a "g" flag, because this is inconsistent.

Scripts

If you have a large number of sed rules, you can put then into a script.

A sedscript could look like this:

# sed comment - This script changes lower case vowels to upper case
s/a/A/g
s/e/E/g
s/i/I/g
s/o/O/g
s/u/U/g

When there are several commands in one file, each command must be on separate line.

Comments

Sed comments are lines where the first non-white character is a "#." On many systems, sed can have only one comment, and it must be the first line of the script. On modern systems, you can have several comment lines anywhere in the script.

Multiple rules and order of execution

As we explore more of the commands of sed, the commands will become complex, and the actual sequence can be confusing. It's really quite simple. Each line is read in. The various commands, in order specified by the user, has a chance to operate on the input line. After the substitutions are made, the next command has a chance to operate on the same line, which may have been modified by earlier commands. If you ever have a question, the best way to learn what will happen is to just try it out. If a complex command doesn't work, make it simpler. If you are having problems getting a complex script working, break it up into two smaller scripts and do them one by one.

Addresses and Ranges of Text

You have only learned one command, and you can see how powerful sed is. However, all it is doing is a search and substitute. That is, the substitute command is treating each line by itself, without caring about nearby lines. What would be useful is the ability to restrict the operation to certain lines. Some useful restrictions might be:

- Specifying a line by its number.

- Specifying a range of lines by number.

- All lines containing a pattern.

- All lines from the beginning of a file to a regular expression.

- All lines from a regular expression to the end of the file.

- All lines between two regular expressions.

Sed can do all that and more. Every command in sed can be assigned an address, range or restriction like the above examples. The restriction or address immediately precedes the command:

restriction command

Restricting to a line number

The simplest restriction is a line number. If you wanted to delete the first number on line 3, just add a "3" before the command:

3 s/[0-9][0-9]*//

Patterns

Many Unix utilities use a slash to search for a regular expression. Sed uses the same convention, provided you terminate the expression with a slash. To delete the first number on all lines that start with a "#," use:

/^#/ s/[0-9][0-9]*//

I placed a space after the "/expression/" so it is easier to read. It isn't necessary, but without it the command is harder to fathom. Sed does provide a few extra options when specifying regular expressions. But I'll discuss those later. If the expression starts with a backslash, the next character is the delimiter. To use a comma instead of a slash, use:

\,^#, s/[0-9][0-9]*//

The main advantage of this feature is searching for slashes. Suppose you wanted to search for the string "/usr/local/bin" and you wanted to change it for "/common/all/bin." You could use the backslash to escape the slash:

/\/usr\/local\/bin/ s/\/usr\/local/\/common\/all/

It would be easier to follow if you used an underline instead of a slash as a search. This example uses the underline in both the search command and the substitute command:

\_/usr/local/bin_ s_/usr/local_/common/all_

This illustrates why sed scripts get the reputation for obscurity. I could be perverse and show you the example that will search for all lines that start with a "g," and change each "g" on that line to an "s:"

/^g/s/g/s/g

Adding a space and using an underscore after the substitute command makes this much easier to read:

/^g/ s_g_s_g

Er, I take that back. It's hopeless. There is a lesson here: Use comments liberally in a sed script. Comments are a Good Thing. You may have understood the script perfectly when you wrote it. But six months from now it could look like a confusing thing.

Ranges by line number

You can specify a range on line numbers by inserting a comma between the numbers. To restrict a substitution to the first 100 lines, you can use:

1,100 s/A/a/

If you know exactly how many lines are in a file, you can explicitly state that number to perform the substitution on the rest of the file. In this case, assume that there are 532 lines in the file:

101,532 s/A/a/

An easier way is to use the special character "$," which means the last line in the file.

101,$ s/A/a/

The "$" is one of those conventions that mean "last" in several utilities.

Ranges by patterns

You can specify two regular expressions as the range. Assuming a "#" starts a comment, you can search for a keyword, remove all comments until you see the second keyword. In this case the two keywords are "start" and "stop:"

/start/,/stop/ s/#.*//

The first pattern turns on a flag that tells sed to perform the substitute command on every line. The second pattern turns off the flag. If the "start" and "stop" pattern occurs twice, the substitution is done both times. If the "stop" pattern is missing, the flag is never turned off, and the substitution will be performed on every line until the end of the file.

You should know that if the "start" pattern is found, the substitution occurs on the same line that contains "start." This turns on a switch, which is line oriented. That is, the next line is read. If it contains "stop" the switch is turned off. Otherwise, the substitution command is tried. Switches are line oriented, and not word oriented.

You can combine line numbers and regular expressions. This example will remove comments from the beginning of the file until it finds the keyword "start:"

1,/start/ s/#.*//

This example will remove comments everywhere except the lines between the two keywords:

1,/start/ s/#.*//' -e '/stop/,$ s/#.*//

The last example has a range that overlaps the "/start/,/stop/" range, as both ranges operate on the lines that contain the keywords. I will show you later how to restrict a command up to, but not including the line containing the specified pattern.

Before I start discussing the various commands, should show know that some commands cannot operate on a range of lines. I will let you know when I mention the commands. I will describe three commands in this secton. One of the three cannot operate on a range.

Delete with d

Using ranges can be confusing, so you should expect to do some experimentation when you are trying out a new script. A useful command deletes every line that matches the restriction: "d." If you want to look at the first 10 lines of a file, you can use:

11,$ d

which is similar in function to the head command. If you want to chop off the header of a mail message, which is everything up to the first blank line, use:

1,/^$/ d

The range for deletions can be regular expressions pairs to mark the begin and end of the operation. Or it can be a single regular expression. Deleting all lines that start with a "#" is easy:

/^#/ d

Removing comments and blank lines takes two commands. The first removes every character from the "#" to the end of the line, and the second deletes all blank lines:

s/#.*//
/^$/ d

A third one should be added to remove all blanks and tabs immediately before the end of line:

s/#.*//
s/[ ^I]*$//
/^$/ d

The character "^I" is a CRTL-I or tab character. You would have to explicitly type in the tab. Note the order of operations above, which is in that order for a very good reason. Comments might start in the middle of a line, with white space characters before them. Therefore comments are first removed from a line, potentially leaving white space characters that were before the comment. The second command removes all trailing blanks, so that lines that are now blank are converted to empty lines. The last command deletes empty lines. Together, the three commands removes all lines containing only comments, tabs or spaces.

This demonstrates the pattern space sed uses to operate on a line. The actual operation sed uses is:

- Copy the input line into the pattern space.

- Apply the first sed command on the pattern space, if the address restriction is true.

- Repeat with the next sed expression, again operating on the pattern space.

- When the last operation is performed, write out the pattern space and read in the next line from the input file.

Reversing the restriction with !

Sometimes you need to perform an action on every line except those that match a regular expression, or those outside of a range of addresses. The "!" character, which often means not in Unix utilities, inverts the address restriction.

The q or quit command

There is one more simple command that can restrict the changes to a set of lines. It is the "q" command: quit. the third way to duplicate the head command is:

11 q

which quits when the eleventh line is reached. This command is most useful when you wish to abort the editing after some condition is reached.

The "q" command is the one command that does not take a range of addresses. Obviously the command

1,10 q

cannot quit 10 times. Instead

1 q

or

10 q

is correct.

Grouping with { and }

The curly braces, "{" and "}," are used to group the commands.

Hardly worth the build up. All that prose and the solution is just matching squigqles. Well, there is one complication. Since each sed command must start on its own line, the curly braces and the nested sed commands must be on separate lines.

Previously, I showed you how to remove comments starting with a "#." If you wanted to restrict the removal to lines between special "begin" and "end" key words, you could use:

/begin/,/end/ {
s/#.*//
s/[ ^I]*$//
/^$/ d
p
}

These braces can be nested, which allow you to combine address ranges. You could perform the same action as before, but limit the change to the first 100 lines:

1,100 {
/begin/,/end/ {
s/#.*//
s/[ ^I]*$//
/^$/ d
p
}
}

You can place a "!" before a set of curly braces. This inverts the address, which removes comments from all lines except those between the two reserved words:

/begin/,/end/ !{
s/#.*//
s/[ ^I]*$//
/^$/ d
p
}

The # Comment Command

As we dig deeper into sed, comments will make the commands easier to follow. Older versions of sed only allow one line as a comment, and it must be the first line. Recent versions allow more than one comment, and these comments don't have to be first. An example could be:

Adding, Changing, Inserting new lines

Sed has three commands used to add new lines to the output stream. Because an entire line is added, the new line is on a line by itself to emphasize this. There is no option, an entire line is used, and it must be on its own line. If you are familiar with many unix utilities, you would expect sed to use a similar convention: lines are continued by ending the previous line with a "\". The syntax to these commands are finicky, like the "r" and "w" commands.

Append a line with 'a'

The "a" command appends a line after the range or pattern. This example will add a line after every line with "WORD:

/WORD/ a\
Add this line after every line with WORD

Insert a line with 'i'

You can insert a new line before the pattern with the "i" command:

/WORD/ i\
Add this line before every line with WORD

Change a line with 'c'

You can change the current line with a new line.

/WORD/ c\
Replace the current line with the line

A "d" command followed by a "a" command won't work, as I discussed earlier. The "d" command would terminate the current actions. You can combine all three actions using curly braces:

/WORD/ {
i\
Add this line before
a\
Add this line after
c\
Change the line to this one
}

Leading tabs and spaces in a sed script

Sed ignores leading tabs and spaces in all commands. However these white space characters may or may not be ignored if they start the text following a "a," "c" or "i" command. If you want to keep leading spaces, and not care about which version of sed you are using, put a "\" as the first character of the line:

a\
\        This line starts with a tab

Adding more than one line

All three commands will allow you to add more than one line. Just end each line with a "\:"

/WORD/ a\
Add this line\
This line\
And this line

Adding lines and the pattern space

I have mentioned the pattern space before. Most commands operate on the pattern space, and subsequent commands may act on the results of the last modification. The three previous commands, like the read file command, add the new lines to the output stream, bypassing the pattern space.

Address ranges and the above commands

Some commands can take a range of lines, and others cannot. To be precise, the commands "a," "i," "r," and "q" will not take a range like "1,100" or "/begin/,/end/." The "c" or change command allows this, and it will let you change several lines into one:

/begin/,/end/ c\
***DELETED***

If you need to do this, you can use the curly braces, as that will let you perform the operation on every line:

1,$ {
a\
}

Multi-Line Patterns

Most UNIX utilities are line oriented. Regular expressions are line oriented. Searching for patterns that covers more than one line is not an easy task. (Hint: It will be very shortly.)

Sed reads in a line of text, performs commands which may modify the line, and outputs modification if desired. The main loop of a sed script looks like this:

1. The next line is read from the input file and placed in the pattern space. If the end of file is found, and if there are additional files to read, the current file is closed, the next file is opened, and the first line of the new file is placed into the pattern space.

2. The line count is incremented by one.

3. Each sed command is examined. If there is a restriction placed on the command, and the current line in the pattern space meets that restriction, the command is executed. Some commands, like "n" or "d" cause sed to go to the top of the loop. The "q" command causes sed to stop. Otherwise the next command is examined.

3. After all of the commands are examined, the pattern space is output.

The restriction before the command determines if the command is executed. If the restriction is a pattern, and the operation is the delete command, then the following will delete all lines that have the pattern:

/PATTERN/ d

If the restriction is a pair of numbers, then the deletion will happen if the line number is equal to the first number or greater than the first number and less than or equal to the last number:

10,20 d

If the restriction is a pair of patterns, there is a variable that is kept for each of these pairs. If the variable is false and the first pattern is found, the variable is made true. If the variable is true, the command is executed. If the variable is true, and the last pattern is on the line, after the command is executed the variable is turned off:

/begin/,/end/ d

Whew! That was a mouthful. Hey, I said it was a review. If you have read the previous lessons, you should have breezed through this. You may want to refer back to this review, because I covered several subtle points.

Transform with y

If you wanted to change a word from lower case to upper case, you could write 26 character substitutions, converting "a" to "A," etc. Sed has a command that operates like the tr program. It is called the "y" command. For instance, to change the letters "a" through "f" into their upper case form, use:

y/abcdef/ABCDEF/

I could have used an example that converted all 26 letters into upper case, and while this column covers a broad range of topics, the "column" prefers a narrower format.

If you wanted to convert a line that contained a hexadecimal number (e.g. 0x1aff) to upper case (0x1AFF), you could use:

/0x[0-9a-zA-Z]*/ y/abcdef/ABCDEF

This works fine if there are only numbers in the file. If you wanted to change the second word in a line to upper case, you are out of luck - unless you use multi-line editing.

Displaying control characters with a l

The "l" command prints the current pattern space. It is therefore useful in debugging sed scripts. It also converts unprintable characters into printing characters by outputting the value in octal preceded by a "\" character.

The Hold Buffer

So far we have talked about three concepts of sed: (1) The input stream or data before it is modified, (2) the output stream or data after it has been modified, and (3) the pattern space, or buffer containing characters that can be modified and send to the output stream.

There is one more "location" to be covered: the hold buffer or hold space. Think of it as a spare pattern buffer. It can be used to "copy" or "remember" the data in the pattern space for later. There are five commands that use the hold buffer.

Exchange with x

The "x" command eXchanges the pattern space with the hold buffer. By itself, the command isn't useful. Executing the sed command

x

as a filter adds a blank line in the front, and deletes the last line. It looks like it didn't change the input stream significantly, but the sed command is modifying every line.

The hold buffer starts out containing a blank line. When the "x" command modifies the first line, line 1 is saved in the hold buffer, and the blank line takes the place of the first line. The second "x" command exchanges the second line with the hold buffer, which contains the first line. Each subsequent line is exchanged with the preceding line. The last line is placed in the hold buffer, and is not exchanged a second time, so it remains in the hold buffer when the program terminates, and never gets printed. This illustrates that care must be taken when storing data in the hold buffer, because it won't be output unless you explicitly request it.

Hold with h or H

The "x" command exchanges the hold buffer and the pattern buffer. Both are changed. The "h" command copies the pattern buffer into the hold buffer. The pattern buffer is unchanged.

Flow Control

As you learn about sed you realize that it has it's own programming language. It is true that it's a very specialized and simple language. What language would be complete without a method of changing the flow control? There are three commands sed uses for this. You can specify a label with an text string followed by a colon. The "b" command branches to the label. The label follows the command. If no label is there, branch to the end of the script. The "t" command is used to test conditions.

Testing with t

You can execute a branch if a pattern is found. You may want to execute a branch only if a substitution is made. The command "t label" will branch to the label if the last substitute command modified the pattern space.

One use for this is recursive patterns. Suppose you wanted to remove white space inside parenthesis. These parentheses might be nested. That is, you would want to delete a string that looked like "( ( ( ())) )." The sed expressions

s/([ ^I]*)/g

would only remove the innermost set. You would have to pipe the data through the script four times to remove each set or parenthesis. You could use the regular expression

s/([ ^I()]*)/g

but that would delete non-matching sets of parenthesis. The "t" command would solve this:

:again
s/([ ^I]*)//g
t again

Passing regular expressions as arguments

In the earlier scripts, I mentioned that you would have problems if you passed an argument to the script that had a slash in it. In fact, and regular expression might cause you problems. A script like the following is asking to be broken some day:

s/'"$1"'//g

If the argument contains any of these characters in it you may get a broken script: "/\.*[]^$." One solution is to have the user put a backslash before any of these characters when they pass it as an argument. However, the user has to know which characters are special.