Using Perl and Regular Expressions to Process Html Files – Part 2

In this article we will discuss how to change the contents of an HTML file by running a Perl script on it. The file we are going to process is called file1.htm:Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets ‘[..]‘ are used in HTML tags instead of angle brackets ”.[html][head][title]Sample HTML File[/title][link rel="stylesheet" type="text/css" rel="nofollow" onclick="javascript:pageTracker._trackPageview('/outgoing/article_exit_link');" href="style.css"][/head][body][h1]Introduction[/h1][p]Welcome to the world of Perl and regular expressions[/p][h2]Programming Languages[/h2][table border="1" width="400"][tr][th colspan="2"]Programming Languages[/th][/tr][tr][td]Language[/td][td]Typical use[/td][/tr][tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr][tr][td]Perl[/td][td]Processing HTML files[/td][/tr][tr][td]PHP[/td][td]Server-side scripts[/td][/tr][/table][h1]Summary[/h1][p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p][/body][/html]Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we’re just getting started here.To do this, we could use the following Perl script (script1.pl):1 open (IN, “file1.htm”);2 open (OUT, “>new_file1.htm”);3 while ($line = [IN]) {4 $line =~ s/[h1]/[h1 class="big"]/;5 (print OUT $line);6 }7 close (IN);8 close (OUT);Note: You don’t need to enter the line numbers. I’ve included them simply so that I can reference individual lines in the script.Let’s look at each line of the script.Line 1In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I’ve called this “opening” filehandle ‘IN’, but I could have used anything within reason. Filehandles are normally in capitals.Line 2This line creates a new file called ‘new_file1.htm’, which is written to by using another filehandle, OUT. The ‘>’ just before the filename indicates that the file will be written to.Line 3This line sets up a loop in which each line in file1.htm will be examined individually.Line 4This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].Looking at Line 4 in more detail:

John is a web developer working for My Health Questions Matter, a company dedicated to helping patients to get the most out of their interaction with health care professionals such as doctors, midwives, and consultants by generating a set of health questions a patient can ask at an appointment.

You can leave a response, or trackback from your own site.

Leave a Reply

Free WordPress Themes Design by New WordPress Themes | Thanks to Insurance and Home Insurance