Introduciton to AWK
AWK is a powerful but simple scripting language designed for text extraction and processing. Due to its versatility and simple usage, it is a widely-used tool, and there are entire books written on it.
Why use AWK?
Although it's possible to translate any AWK script into C for faster processing, AWK is often easier to write and debug. Thus, even though a C program may execute faster, it's preferrable to use AWK due to its simplicity and ease of use.
Origins of the name
The name AWK comes from its authors Alfred Aho, Peter Weinberger and Brian Kernighan, who developed it in Bell Labs in the 1970's. AWK also serves as a homonym of its mascot, the auk bird.

AWK provides command line users with a variety of functions. You may use a single AWK script to process several files within a pipeline or apply the commands to several files at once. Here are just a few of the features that AWK provides:
- Text formatting
- Formatted text outputs
- Perform mathematical and string operations
- Field extraction and rearrangement
In short, AWK can be of immense help to anything that has to do with text processing or data-table manipulation.
Variations of AWK
There have been a variety of AWK implementations as users started expanding on the language.
- awk
- Original awk.
- nawk
- New and improved awk. Used by OS X.
- gawk
- GNU awk. Mostly ships with Linux distributions.
- mawk
- Very fast AWK implementation based on bytecode interpreter.
Installing gawk
For this tutorial, we'll be sticking to gawk
, which is the GNU version of AWK (GNU is simply a suite of open-source utilities - learn about the history of UNIX and GNU). To install gawk
on a Debian-based Linux platform, use the apt-get
package manager.
$ sudo apt-get update
$ sudo apt-get install gawk
For RPM based Linux, use yum
.
$ sudo yum install gawk
For Mac OS X platform, use homebrew, the package manager for OS X.
$ brew install gawk
Commenting in AWK
One last thing to mention before moving on - to comment in AWK, use either the hashtag symbol (#
) or two forward slashes (//
).
Notes before getting started
AWK is a difficult programming language to learn, as most of its concepts, syntax and notations are intertwined with each other. Thus, learning one piece of AWK involves having to know some other parts. Due to this, some of the lessons in this tutorial may introduce concepts that will be covered in more detail in a future lesson. So if you see something new and not explained in detail, try your best to understand it and sit tight until we cover it more formally later. Now let's get started!
Awk's Workflow BEGIN, BODY, and END blocks
Let's begin by looking at the step-by-step methodology of how AWK works. AWK starts by executing the BEGIN
block. It then enters the BODY
block, reading in some record, executing some command, and repeating until the file is exhausted. Finally, the END
block is then executed.
- Execute awk commands from BEGIN block.
- Read in a line from the input stream (may be from a file or directly from std in). Stored in memory.
- Execute the awk commands on a line.
- Repeat if not end of file.
- Execute awk commands from END block.
Awk runs in blocks three blocks - BEGIN, BODY, and END. The BEGIN and END blocks provide the startup and cleanup actions of our program. The BODY block includes lines of pattern & action pairings.
The BEGIN
block executes just once, acting to initialize the program. Here, we can define variables such as FS, RS and ORS, which are initially undefined. Additionally, we may create a header for a data table if not exists.
BEGIN {
// initialize variables and other commands
}
BODY Block
The BODY
block runs on every input line that matches an optional pattern. Note that you don't need any keywords before the opening curly brace for the BODY
block.
{
/pattern/ { actions }
}
END block
The END
block is the last block of code to be executed once the file is exhausted. Oftentime it is used to produce summary reports. Precede the block with the END
keyword.
END {
// cleanup
}
The BEGIN
and END
patterns can occur in any order within the awk program, but convention holds that BEGIN
should come first, and END
should be last. If there are multiple BEGIN
and END
blocks, they are processed in order of the AWK file.
Example
Let's now look at an example AWK script. The syntax and variables have not been covered yet, but we wanted to give you a brief gist of what a basic AWK script would look like.
Assume we want to perform two tasks to the grades.txt datafile below. 1) We want to create a header, and 2) we wanted to find out how many students received a B in the class.
# grades.txt
Gil Conrad 98 93 94 A Vern Wynne 85 78 93 B Ingram Dannie 84 85 94 B+ Wright Morty 75 76 79 C+ Johnnie Adair 78 94 87 B
Now we can write our awk script test.awk.
# test.awk
BEGIN {
# Print the header out before starting anything
printf "FName\tLName\tExam1\tExam2\tFinal\tGrade\n";
# Initialize any variables
n = 0;
}
{
# Print each line (called "record")
print $0
# If the sixth column (called "field") is a B, then increment n
if($6 == "B") { ++n }
}
END {
# Wrap things up and print out summary variables
print "Number of students with a B in the class = " | n;
}
To apply our awk script via the command line, use the -f
option.
$ gawk -f test.awk grades.txt
FName LName Exam1 Exam2 Final Grade Gil Conrad 98 93 94 A Vern Wynne 85 78 93 B Ingram Dannie 84 85 94 B+ Wright Morty 75 76 79 C+ Johnnie Adair 78 94 87 B Number of students with a B in the class = 2
Now let's move onto Records and Fields , one of the main backbones of AWK.
Records and Fields RS, RT, ORS, FS, OFS, $n
The backbone of AWK's programming model consists of two pieces: 1) records & fields, along with 2) patterns & actions. Let's look at the first core component here, then move onto patterns & actions in the next lesson.
What are records and fields?
AWK views each input stream as a collection of records. Records can be thought of individual lines, which are then divided into fields (each data cell). Take a look at the figure below, which displays the grades.txt file.
Record separators (RS & RT)
To specify the character that separates records, we use the built-in RS
variable. In the original AWK implementation, the RS
variable had to be a single literal character such as the newline or an empty string. In other implementations such as gawk
, RS
may be a regular expression.
In the case we have a regular expression, RS
will hold the literal regex, while RT
will hold the matching string.
$ echo firstRecord 111111 secondRecord 222222 thirdRecord 333333 lastRecord |
> gawk 'BEGIN { RS = "([[:digit:]]+)" }
> { print "RS = " RS " and RT = " RT }'
RS = ([[:digit:]]+) and RT = 111111
RS = ([[:digit:]]+) and RT = 222222
RS = ([[:digit:]]+) and RT = 333333
This code snippet sets the RS
variable to any number of digits. Notice how the RS
variable displays the literal regex, while RT
displays the matched regex.
Output Record Separator (ORS)
The Output Record Separator (ORS) is used to specify what should come after an record is printed. The default is a newline character.
In this example, we read and print out the current record in our buffer (denoted by $0
), followed by a plus (+
) symbol.
$ echo 'hello; nihao; hola; anyonghasaeyo' |
> gawk 'BEGIN { RS = ";"; ORS = " +"}
> { print $0 }'
hello + nihao + hola + anyonghasaeyo
Field separators (FS)
Fields are separated by the FS
variable. The default value is a single space, which translates to one or more whitespace characters with the leading/trailing whitespaces on the line are ignored. Thus, the following fields looks the same to AWK.
Joe John Johanna
Joe John Johanna
To specify a literal single space, enclose the space with brackets such that FS = '[ ]'
The field separated may be identified by the -F
option via the command line, or by assigning it in the BEGIN
block.
$ echo 'Joe John Johanna' |
> gawk -F' ' '{ print NF ":" $0 }'
3:Joe John Johanna
# Same command as above but using the BEGIN block
$ echo 'Joe John Johanna' |
> gawk 'BEGIN { FS=" " }
> { print NF ":" $0 }'
3:Joe John Johanna
# Changing the FS character
$ echo ' Joe John Johanna ' |
> gawk -F'[ ]' '{ print NF ":" $0 }'
13: Joe John Johanna
Here we can see that the -F
variable is used to manipulate the FS
variable straight from the command line. We'll formally learn about how to use AWK via the command line in future lesson.
Output Field Separator (OFS)
The Output Field Separator, or OFS
stores the variable that separates each field upon output. By default, it is a space.
$ echo 'John Mary; Jacob Teresa; Bob Claire' |
> gawk 'BEGIN { OFS=" loves "; RS=";" }
> { print $1, $2 }'
John loves Mary
Jacob loves Teresa
Bob loves Claire
h4 Field accession ($n)
You may have noticed the use of the $0
variable in the previous example. This variable stores the current record. To access fields, we can simply use a $
, followed by the field number (eg. $1
for the first field, $2
for the second, and so on).
$ echo 'uno dos tres' | gawk -F' ' '{ print "The second | field is: " $2; print "The entire record is: " $0 }'
The second field is: dos
The entire record is: uno dos tres
Note that that the values start at 1 and not 0, unlike most programming languages with a zero-based index.
Field to integer conversion
Fields are converted to integer values accordingly. Thus, $(2*2)
, $(8/2)
, $"4.41"
and $4
all refer to the fourth field. Note that negative values have no meaning.
Patterns and Actions print
In the previous lesson, we looked at records & fields, and saw how AWK is able to parse and manipulate them. While learning this, you may have noticed that each action performed is enclosed within braces, which then applies to all records.
Patterns
But what if we just wanted to apply actions to lines that matched a specific pattern? We can do this by preceding actions by a regular expression pattern.
/pattern/ { action } // Action is applied only those records that match pattern
/pattern/ // Print all lines matching /pattern/
{ action } // Apply actions on all lines
This allows us to select which lines to apply our actions to. If you do not specify a pattern, then the action is applied to all lines. On the other hand, if there is no action, then all lines with the specified pattern are printed out.
Actions
Actions tell AWK how to process a specific record or part of its fields. Let's look at the print
action, as it's the most basic thing you can do with a record.
Printing
When print
is called, it will print out the record with an output record separator (ORS
), the default of which is a newline character. In the following example, all record will be printed. We have already seen how we can specify the entire record (with $0
) and the specific field n ($n
) with the dollar symbol.
$ echo ' uno dos tres ' | gawk -F' ' '{ print $0 }'
uno dos tres
# Default is to print the record
$ echo ' uno dos tres ' | gawk -F' ' '{ print }'
uno dos tres
# Print a specific field only
$ echo ' uno dos tres ' | gawk -F' ' '{ print $2 }'
dos
Printing by pattern
Now we can follow a certain pattern and print only those that match. For this example, we'll use grades.txt, which is a file containing grade reports of five students.
$ gawk '$6 ~ /B/ { print $0 }' grades.txt
Vern Wynne 85 78 93 B
Ingram Dannie 84 85 94 B+
Johnnie Adair 78 94 87 B
# Print last name, first name for students with a B
$ awk '/B/ {print $2 "\t" $1}' grades.txt
Wynne Vern
Dannie Ingram
Adair Johnnie
Here, we can use the ~
to select those that match field #6.
That's all for now...
This was just part 1 of our Awk series. If you're interested in learning more, please follow us on Twitter or Like us on Facebook for our next update!
Calling GAWK from the command line
Let's take some time to formally learn how to call gawk
from the command line. The formal listing of the gawk command and its parameters is:
The -F
is for field separator (fs).
The -v var=value
parameter is used to assign a value to a variable before the execution of the program. These variables may be accessed by the BEGIN
block.
Options that come after the --
are those that you can use for the awk program.
As a one-liner
We have already seen how to use one-liners to apply awk statements to one or more files.
Pipelining
You may also choose to incorporate awk within a pipeline. Simply put in the command as the first argument.
Applying an awk file
It can be a hassle to type out every awk statements via the command line. Thus, we can save our awk commands and use it via the -f
option.
Assigning Options
We may also allow the user to declare some variable from the command line.
To do so, use the -v
option.
Separate paramters per files
Say you have several files, but each file has its own awk variables that apply to it. We can do this all in one line.
Modes
gawk also comes with the --profile
option which can be used to gather profiling statistics from the execution of the program.
Furthermore, there is a debug mode, which is indicated by the --debug
option.
To revert gawk back to the traditional (awk) mode, use the --traiditional
mode.
Further reading
To obtain a further and more complete list of awk
options, use the --help
parameter, or the man
command.
Predefined Variables in Awk
Awk contains a slew of helpful predefined variables. Let's look at them and how we can incorporate them into our awk scripts.
Command Line Arguments
Awk allows you to access any command line arguments that the user may have passed in via the command line. ARGC gives the argument count, while ARGV provides an array of argument values.
It is possible to modify ARGC
and ARGC
. When deleting from ARGV
, be sure to decrement ARGC
.
Environment variables
You may access the user's environment varialbes with the built-in array ENVIRON.
$ awk 'BEGIN { print ENVIRON["HOME"]; print ENVIRON["USER"] }' you can add, delete and modify entries as needed. POSIX requries that subprocesses inherit the environment in effect when awk is started.Scalar variables
The following are scalar variables that hold a single value.
- FILENAME
- Name of current input file.
- FNR
- Record number in the current input file.
- FS
- Field Separator default is " ".
- NF
- number of fields in current record.
- NR
- record number in the job.
- OFS
- output field separator (default = " ").
- ORS
- Output record separator (default = "\n").
- RS
- Input record separator (regular expression in gawk and mawk only (default: "\n").
Printing records of a specific length
There are built-in functions that we will see This function uses the built-in length function.
1 NR > 0 {print} 1 {print} {print} {print $0} $ echo 'one two three four' | awk '{ print $1, $2, $3 }' one two three $ echo 'one two three four' | awk '{ OFS="..."; print $1, $2, $3 }' one...two...three $ echo 'one two three four' | awk '{ OFS="\n"; print $1, $2, $3 }' one two three $ echo 'one two three four' | awk '{ OFS="\n"; print $0 }' one two three four $ echo 'one two three four' | awk '{ OFS="\n"; $1 = $1; print $0 }' one two three four reassign forces reassembly of record with the new field separatorPattern Expressions
You may also use built-in awk variables within your pattern. Here is a short list of expressions you may use. We will go through built-in variables in a future lesson, but here's a quick sneak-peek.
- NF == 0
- Select empty records.
- NF > 4
- Select records containing more more than 4 fields.
- NF < 4
- Select records that contain 1 to 4 fields.
- $1 ~ /Ingram/
- Select records that contain "Ingram" as the first field
Pattern range expressions
In addition to patterns above, we can specify a range of text. We may do this with two expressions separated by a comma.
- (FNR == 3), (FNR == 5)
- Select records 3 through 10 (inclusive).
- /<[Hh][Tt][Mm][Ll]>/, /<\/[Hh][Tt][Mm][Ll]>/
- Select body of an HTML document
- /[aeiouy][aeiouy]/, /[^aeiouy][^aeiouy]/
- Select from two vowels to two nonvowels.