Skip to content

My Reading List

Here I am sharing the blogs/websites that I read, hoping this would help you.

Personal:

http://zenhabits.net/

http://sethgodin.typepad.com/

Technical:

http://stackoverflow.com/

http://highscalability.com/all-time-favorites

https://blog.twitter.com/engineering

https://www.facebook.com/Engineering

Field separator in shellscript

 

For analysing we parse some data from an input file, while doing that we use some separator characters (comma, space, etc.) depends on the data. This post is about how default (internal) field separators work in shellscript in Unix based environment.

I encountered a problem with parsing an access log. Refer the sample log below. The requirement was to parse entries of a specific domain into a separate file. i.e To read lines of log for test1.com and write it into filtered.log file.

 

Access log pattern : (Each word is tab separated)

Source_IP      HTTP_Version    Domain  Port    Access_Method   Url_Path        Referer         Agent   Bytes_In        Bytes_Out       Http_Status     Time

### Sample log : (access.log)

xx.xx.xx.xx     HTTP/1.0        test1.com       8080    GET     /images/logo.png        “http://test1.com/home.html”     “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”       490     7560    200     2013-08-06 20:00:02

yy.yy.yy.yy     HTTP/1.0        new2.org       8080    GET     /slide/2.jpg    “http://new2.org/about.html”     “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.0 Safari/537.36”   452     170     304     2013-08-06 20:03:07

zz.zz.zz.zz     HTTP/1.0        test1.com       8080    GET     /images/new.png “http://test1.com/contact.html”     “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”       492     7604    200     2013-08-06 20:11:12

### Script used to parse the log : (parse_logs.sh)

for i in `cat access.log`

do

       found=`echo $i | grep “test1.com” | wc -l`

       if [ $found -gt 0 ]

       then

               echo $i >> filtered.log

       fi

done

### Expected output: (filtered.log)

xx.xx.xx.xx     HTTP/1.0        test1.com       8080    GET     /images/logo.png        “http://test1.com/home.html”    “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”       490     7560    200     2013-08-06 20:00:02

zz.zz.zz.zz     HTTP/1.0        test1.com       8080    GET     /images/new.png “http://test1.com/contact.html” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)”       492     7604    200     2013-08-06 20:11:12

### Actual output got : (filtered.log)

test1.com

http://test1.com/home.html”

test1.com

http://test1.com/contact.html”

Reason and Fix :

The expected result was the lines which contain test1.com should be written into output file (filtered.log), but the texts that contain test1.com has been got as output. The reason is due to the behaviour of the internal field separator.

About IFS : http://en.wikipedia.org/wiki/Internal_field_separator

The logic inside the for loop was written to be executed against each lines from access log. But the default field separator value which is always whitespace characters (i.e space, tab, and the newline) splits each word from the input file and the grep is executed against them which results the wrong output.

To change this i had reset the IFS variable to new line character alone and the script worked as expected. And it’s important to reset the IFS to it’s old value in the end of our code block.

You can find the IFS value by this command in a shell prompt:

echo “$IFS” | cat -vte

### Modified script: (parse_logs_new.sh)

OLDIFS=$IFS

IFS=”

for i in `cat access.log`

do

found=`echo $i | grep “test1.com” | wc -l`

if [ $found -gt 0 ]

then

echo $i >> filtered.log

fi

done

IFS=$OLDIFS