CAZine: issue 2, August 2009

CAZine: issue 2, August 2009

Coders Corner

Featuring Sed & AWK

By: Gamma Cpt Zerhash

This is a new column that is geared around introducing the community to new and interesting ways to code. We will look into various languages as well as concepts and platforms. The idea is to teach something usefull and open doors to further research.

This issue we will touch upon a lesser known evil known as sed and AWK. In the shell programing world, for those less familiar, is done much differently than a typical programing language. Language definition isn’t defined by libraries and end user documentation as it is by programs and man pages. There was a time when windows had a descent MS-DOS framework, however that has somewhat dwindled. With UNIX and later Linux, a standardized set of programs more or less emerged. So therefor when I refer to shell programing, I refer to only Linux platforms.

Sed is a ‘S’tream ‘Ed’itor and hence is very good at handling extremely large files. Rather then slurping everything into memory, sed will use a minimalist approach. Sed has been a large influence in the way that perl does regular expressions. AWK is designed for processing text based data, and is named after the authors Alfred Aho, Peter Weinberger and Brian Kernighan.

To demonstrate a use of the language in its simplist form we will use a common problem. Running a website you want to see if there are bad links on your site and maybe other warnings. We will use AWK to go through your error logs and see if any pages are being reported and follow up with sed in order to fix the problems.

So low and behold logsearch.awk

#!/bin/awk -f
        {
                if (NR==1){
                        startdate = $1 " " $2 " " $3 " "$4 " " $5;
                }
        }
/\[error\]/ {
                error[substr($0,index($0,$9))]++;
                errorline++;
        }
/\[warn\]/ {
                warn[substr($0,index($0,$9))]++;
                warnline++;
        }
END     {
                print "Frequency\tError";
                for (i in error){
                        print error[i] "\t\t" i
                }
                print "Frequency\tWarning";
                for (i in warn){
                        print warn[i] "\t\t" i
                }
                print "Total Errors: \t" errorline;
                print "Total Warnings: " warnline;
                print "Start Date: \t" startdate;
                print "End Date: \t" $1 " " $2 " " $3 " " $4 " " $5
        }

In order to run this file you will have to change the permission to execute:

chmo d u+x logsearch.awk

Then to run it:
$> logsearch.awk /location/of/logfile.log
Line by line

The shebang line tells the bash that we want awk to handle this. '-f'
means that we will be working with a file being passed as an argument.
#!/bin/awk -f So common tothe K&R way, we
have the open and close brace with our arguments. Prior to it we have a
pattern which is optional. If there is no pattern it will always be
run, otherwise regex can be used. Other special patterns include the
BEGIN and END, which tells awk to run before or after the script. 

In
the first block we have an if statement which uses a special variable
NR. NR is the number of the record. I want to set the start date of the
log file so for the first record I will set the variable startdate.
        {
                if (NR==1){
                        startdate = $1 " " $2 " " $3 " "$4 " " $5;
                }
        }
Now the start date is looking pretty funny compaired to other languages.
                        startdate = $1 " " $2 " " $3 " "$4 " " $5; This
is because AWK is a stream functioning, record based language. So as it
reads and operates line by line, the special variables $1, $2, $3...
$[n] will be set. The record is being seperated by spaces however that
can be changed to anything else, like a colon for passwd files.

Pattern Matching

So as I have mentioned we can set aside blocks for specific patterns.
This uses basic regular expressions which conform to the GNU standard
as well as POSIX. With other variations of AWK there are more options.
This isnt a lesson on regex's so we will be very simple here and look
for [error] and [warn].
/\[error\]/ {
                error[substr($0,index($0,$9))]++;
                errorline++;
        }
/\[warn\]/ {
                warn[substr($0,index($0,$9))]++;
                warnline++;
        }  When
we find it we search add to the associative array to keep track of what
we find. Variables in AWK are loosly typed. So warn[blah] could be
anything and the same goes for 'blah'. 

The END

When the script is done I want to print a report of what I found. The
for loop is new here in the fact that it will go through the variables
for us like any for each in any number of languages.
END     {
                print "Frequency\tError";
                for (i in error){
                        print error[i] "\t\t" i
                }
                print "Frequency\tWarning";
                for (i in warn){
                        print warn[i] "\t\t" i
                }
                print "Total Errors: \t" errorline;
                print "Total Warnings: " warnline;
                print "Start Date: \t" startdate;
                print "End Date: \t" $1 " " $2 " " $3 " " $4 " " $5
        }
The Clean Up
So at this point we have found a number of
errors and warnings and it is looking much neater then a web log file.
We can see what files are being accessed which dont exist. The most
common fix is to change the wrong location to the right one.

Here is where we bring in sed.
find ./ -iname *html -type f -exec sed -i 's/string1/string2/' {} \;
We start off with the find comand. You will have to be in the directory of your website.
find ./ -iname *html
This will recursively go through your website and pull all files ending
with html. The -type f is futher specifying only files. Not directories
or anything else. Now the -exec command will have 'find' operate on
each of these files. For each file which find retrives it will be given
to the executed command where '{}' is placed. Note that the command is
escaped with \; as sed can group commands with the semicolon.

Now for the sed part:
sed -i 's/string1/string2/'
This is a very simple devil. It is replacing the first string with the
second. So you can have a file path replaced with the correct one. The
-i says that sed is expecting a file and will operate on that file,
saving the results.

There are a lot more options here, which allow you to back up content and do more complicated operations.

The Home Stretch

So these are two very basic commands which we managed to do a lot of
work with very little code. This is very unique to the UNIX based
systems. On other platforms this may be a long process and involve a
lot of clicking.

As this is a new article the focus is to introduce the community into
new languages and tools. This isn't meant to be an absolute new commers
approach to it otherwise we loose all functionality and purpose. It is
meant to, however, inspire you to dive in and get more familiar on your
own. This being said I will leave you with some good reference material.

Book

sed & awk, second edition by Dale Dougherty and Arnold Robbins

Tutorial

AWK Tutorial by Greg Goebel

AWK Tutorial by Bruce Barnett

Sed Tutorial by Bruce Barnett

Man Pages

man page for awk

man page for sed

man page for find
VN:F [1.7.9_1023]
Rating: 0.0/10 (0 votes cast)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

About the Author