Thursday, August 7, 2008

How To Count Lines Of Text (LOC) in Documents

Suppose you have a (large) number of files and you want to know how many lines of text each contains. In computer programming that number is know as Lines-Of-Code or LOC and is a measurement of code complexity. Or maybe you are a writer or editor and separates chapters of a book in different files. Or maybe you have this data sample that came in plain text files and you need to know how many entries (records) each sample has.

There are, of course, many ways to do it. The most straightforward one is to simply open each file on your preferred text-editor, be it notepad, notepad++, UltraEdit or whatever and then use CTRL-END to go to the end of the text, then simply look at the current line number.

This option is ok for a small set of files, say, ten or fifteen. But if you have a larger set, fifty or more, it probably will take too much time to do it. So, let's look at alternatives:

I. Using a DOS Batch

This option is simple and very quick to build. Experienced VBScript writers will prefer the next one since it's more flexible, but common users should default to this. They are very similar and have about the same basic result.

The Basics: The DOS Script is a scripting format for DOS (which now is called the Windows Command Prompt). Old timers will remember dealing with .bat files in the old DOS days, usually to pass parameters during program call. But, in fact, DOS bat files could do much more even back then, and now they are a very handy tool to any windows user that doesn't want to deal with the intrinsics of VBS programming.

Restrictions: Can only be applied to non-formatted files.
Downsides: It can only count the lines. VBScript solution might apply filters, etc.
Recommended: If all you need is the total number of lines for each file.
How to:

A DOS bat is a text file with the .bat extension (for instance, example.bat) that contains DOS native commands. The bat is purely interpreted -- it's as if each command is being typed separately by the user on the command line shell. To use it, open a text file on your favorite text editor (notepad, UltraEdit, notepad++, etc) and save it with the .bat extension (countLOC.bat). It's easier saving the file on c:\ so that we can access it easily from the command prompt. The first example will only count the lines of a single text file:

countLOC.bat (1)
  1. @echo off
  2. REM The following line sets delayed expansion, which is used to
  3. REM make sure variables have dynamic value.

  4. SETLOCAL ENABLEDELAYEDEXPANSION

  5. REM The following FOR loop reads the file line by line
  6. FOR /F "tokens=*" %%j in (data\text.txt) do (
  7. set /a numLines=!numLines!+1
  8. )
  9. echo data\text.txt !numLines!
To ran the script above, you will need to open the DOS command prompt (now simply called command prompt), and type the name of the script:
  1. Open the DOS prompt (Start → Run, "cmd").
  2. First switch to the disk partition where the files are by typing the letter of the partition followed by a colon, like "c:", then enter.
  3. If your shell tells you you're on a different folder, just type "cd \", which will bring you to the root.
  4. Run the program by typing its name on the prompt:

    c:\>countLOC
Now, let's examine the code. The first line disables echo, which means that commands itselves won't be displayed on the screen, only their output. Remove this line if you want to debug the bat, but keep it afterwards so that display is kept clean.

The lines 2 and 3 begin with REM, which is short for REMARK. Use it as a commentary mark (anything following REM is ignored). SETLOCAL in line 5 does what the commentary section explains: on DOS bat, if you declare a variable (which is done through a SET command), the variable is static. It means that it cannot be modified during execution of commands such as the FOR loop.

Finally, it's necessary to understand the FOR command. FOR is a very handy tool that enables script writers to do repeating tasks a certain number of times. It's help is very comprehensive (type c:\>for /?) if you want a description of its uses. The most interesting format is /f, which uses files (or text, actually) as input variables. The command will scan the file, "data\text.txt", line-by-line. Usually, this would be used in composition with another command, to extract some information from the file. In our case, though, we use the code on line 9 that will increment the numLines variable when executed.

The /a option in SET tells the command to interpret its input as a aritmethic expression, so line 9 adds one to numLines. If you ran the above script, you will see a output like this:

C:\>countLOC.bat
data\text.txt 9

The next goal is to do this with all the files inside a certain folder. Take a look at the next version:

countLOC.bat (2)
  1. @echo off

  2. REM The following line sets delayed expansion, which is used to make sure variables
  3. REM have real dynamic value.

  4. SETLOCAL ENABLEDELAYEDEXPANSION

  5. REM Get DIR from input
  6. SET DIR=%1

  7. for /f "tokens=*" %%i in ('dir /b /a-d %DIR%') do (
  8. for /f "tokens=*" %%j in (%DIR%\%%i) do (
  9. set /a numLines=!numLines!+1
  10. )
  11. echo %%i !numLines!
  12. )
I've highlighted the differences. As you can see, we've nested the FOR loop in another FOR loop. What I'm going to do here is to repeat our previous loop for each file in a given folder. I use as an argument to FOR the command dir /b %DIR%, which pass the output from dir command as a text file to FOR, so each file in the folder is read. The /b switch tells DIR to run in bare mode, which just output files name, without header or trailer and the /a-d tells it to not display directories.

Finally, I've "parametrized" the folder name. DOS bath can receive arguments from the command line through %1, %2, etc. So, the user must pass the folder name as an argument to the DOS bat the folder name, which is c:\data in our example:

C:\>countLOC data
asd.txt 7
cassi.txt 10
DBD_All.txt 8926
SourceSafe.txt 8927

The script is ready. All you have to do is copy the solution from countLOC.bat (2) and save it with this name. In case you want to learn more about DOS scripting, point your browser to Rob's van der Woude Scripting Page.

II. Use a VBScript

Using a scripted language to do it is probably the best way to go, since you can customize this solution to whatever you want. This solution is based on a article by Hey Scripting Guy, from MSDN. I've used it more than once, to count LOC from source files and records from data files.

The basics: You will need to write a very simple VBScript to do it. VBScript isn't the best scripting language out there, sure, but has a great advantage -- any Windows installation comes with it. To 'program', just open your prefferred text editor (we love text editors, don't we?) and sabe the file as something.vbs (as usual, "something" is anything you want). Then double click this file in Windows Explorer -- it will be compiled and ran.

"Compiled?", you ask. Yes, VBScript is, oddly enough, a compiled language. This means that your code is first fed into a compiler, generating a 'binary' of some sort (not necessarily native machine code) before it runs.

Restrictions: Can only be applied to non-formatted files.
Downsides: VBScript programming is a bit complex for newbies.
Recommend: If you want to apply some filters on the file, like verifying if a line is a comment, etc.

How to:

So, let's write this program. First you must understand that, to access files and other Operational System functions, VBScript will use a "component" from windows. It doesn't really matter what that object is, but if you are interested, just google it. To create this object, write this on the file:

countLOC.vbs (1):

Set fso = CreateObject("Scripting.FileSystemObject")

This will create a object named fso (short for file system object) which we'll use to access files and stuff. Let's say now that you want to open a file named "text.txt", which resides in "c:\data". All you have to do is:

countLOC.vbs (2):

Set fso = CreateObject("Scripting.FileSystemObject")
Set objTextFile = fso.OpenTextFile("c:\data\text.txt", 1)


This will give you access to the objTextFile, which you will use to manipulate the text file itself. Looking at the arguments (the stuff enclosed in parenthesis), you will notice a '1' being passed. This tells fso that the file is to be openend as read-only, preventing the script from messing up with your text. Now we need to count the lines from text.txt. We do that by reading until the end of the file and then checking how many lines were read:

countLOC.vbs (3):

Const ForReading = 1

Set fso = CreateObject("Scripting.FileSystemObject")

Set objTextFile = fso.OpenTextFile("c:\data\text.txt", ForReading)


objTextFile.ReadAll
Wscript.Echo(objTextFile.Line)

On the code above, we made two modifications. First, the number that tells fso that the file is for reading only was replaced by a constant, so that our code becomes more readable. Second, we read the whole file using the method ReadAll and then printed, through the use of Wscript.Echo, the number of the current line of the file. Since we read the whole file, the line printed will be the last one. The code above is the skeleton of what we will be using to create the report containing all the files number of line. The first difference is that we'll get a list of all the files in the folder and then we will analyzed each one of then. To do so, we will use interaction loops in our script. The second difference is that we will output the result to a file instead of pop-ups. It's easy but, if you are not really interested in learning the hows and whys, skip until the last code listing.

The way to retrieve the file listing from the folder is to use the method fso.GetFolder("folderPath"), which returns a folder object, and the use this object to retrieve the listing. We iterate through the listing by using For Each loop:

countLOC.vbs (4)
  1. On Error Resume Next
  2. Const ForReading = 1
  3. Set fso = CreateObject("Scripting.FileSystemObject")

  4. ' Open an output file to write results. Existing files will be overwritten.
  5. Set reportFile = fso.CreateTextFile("c:\FileList.txt", True)

  6. ' Get the file listing for the given folder
  7. Set folder = fso.GetFolder("c:\data")

  8. ' iterate through the the files
  9. For Each fileIdx In folder.Files
  10. ' Open the file using the name from folderIdx (folder index)
  11. Set objTextFile = fso.OpenTextFile("c:\data\" & fileIdx.Name, ForReading)
  12. objTextFile.ReadAll

  13. ' Output to the reportfile
  14. reportFile.WriteLine(fileIdx.Name & ";" & objTextFile.Line)

  15. Next
  16. ' Close the report file
  17. reportFile.Close
As you can see, we made many modifications. Line 1 informs the script that, in case of an error during the For Each loop, it should resume on the next item. Line 6 creates the output file and line 9 creates a folder object that can be used to access the file listing. Line 12 begins the loop, which sets a variable fileIdx to each file in the folder. As you can see in lines 12 and 18, fileIdx.Name returns the name of the file.

This is the gist of it. The script access a folder, "c:\data", iterate through all files in this folder and outputs each file LOC to "c:\FileList.txt". But we want to make this into a real tool, right? So what's wrong? Well, the script need to be edited everytime you want to change the folder that contains the files or the name to the report. So let's parametize those:

countLOC.vbs (4)
  1. On Error Resume Next
  2. Const ForReading = 1

  3. ' Get first parameter as the foldername
  4. sFolder = WScript.Arguments.Item(0)
  5. ' Second parameter is the report name
  6. sReport = WScript.Arguments.Item(1)

  7. If sFolder = "" Or sReport = "" Then
  8. Wscript.Echo "Invalid Syntax. Expected: countLOC.vbs folderName reportName"
  9. Wscript.Quit
  10. End If

  11. ' Create a file system object that will help us acessing files
  12. Set fso = CreateObject("Scripting.FileSystemObject")

  13. ' Open an output file to write results. Existing files will be overwritten.
  14. Set reportFile = fso.CreateTextFile(sReport, True)
  15. ' Get the file listing for the given folder
  16. Set folder = fso.GetFolder(sFolder)

  17. ' iterate through the the files
  18. For Each fileIdx In folder.Files
  19. ' Open the file using the name from folderIdx (folder index)
  20. Set objTextFile = fso.OpenTextFile(sFolder & "\" & fileIdx.Name, ForReading)
  21. objTextFile.ReadAll

  22. ' Output to the reportfile
  23. reportFile.WriteLine(fileIdx.Name & ";" & objTextFile.Line)

  24. Next

  25. ' Close the report file
  26. reportFile.Close
Lines 5 and 7 load the arguments from the command line. Lines 9 through 12 test to see if the parameters are passed correctly. To use the code above, you will have to open the command prompt (Start → Run, "cmd") and type

c:\>countLOC folder output

Where folder is the path of the folder that contains the files you want to analyze and output is the name of the report.

Now, a seasoned VBScript writer can re-adapt the code above and instead of using a ReadAll, read it line-by-line and check if the line is empty or not, if it's a comment or not, etc.