Spell checking Java source code

If your engineering team is like mine, it’s geographically distributed. Chances are that English is not everyone’s first language either. (To be clear, if you compare my Russian, Chinese or Indian to “their” English, I’m the one that comes up short!) So, I’ve been trying to determine an automated way to spell check Java-based source code artifacts.

Recently I found a set of open source tools that, when cobbled together, get me close to an ideal result. The answer began with finding a Google Code project called bSpell, which in turn brought in the following supporting cast:

  • Ant (build automation) – version 1.7 or later for formalized library (antlib) support
  • Checkstyle (coding standard adherence analysis)
  • Cobertura (code coverage analysis)
  • fant (Ant + Maven‘ishness)
  • JUnit (unit testing)
  • PMD (static rules analysis)

While Ant and JUnit are like old friends, it was nice to put “faces with names” for these other new acquaintances. For example, FxCop is a staple in my .NET-oriented development for best practice and coding standard rule checking; now I can experiment with the likes of Checkstyle and PMD to (hopefully) get similar results in Java.

To build bSpell, you first have to build fant. To build fant, you need to have Cobertura and PMD installed as Ant “extensions” (i.e. under an “extensions” sub-folder of %ANT_HOME% ${user.home}/.ant). You may want to upgrade the versions of JUnit and Checkstyle packaged with fant, too. You may also have to update other dependency versions referenced in build scripts (e.g. <fant-root>\etc\ant-inc\common.xml’s reference to PMD). A successful build of fant will result in <fant-root>\build\dist\fant-0.1.jar. A successful build of bSpell will result in <bSpell-root>\build\dist\bspell-0.1.jar.

To install bSpell et al into your Ant environment–where Cobertura and PMD are also installed under %ANT_HOME%\ ${user.home}/.ant/extensions–you need to copy your bSpell, fant, JUnit, and Checkstyle JAR files into %ANT_HOME%\lib ${user.home}/.ant/lib.

To incorporate bSpell into your Ant-based build process, create a new target in the appropriate XML build script. I called mine “spellcheck”:

<target name=”spellcheck” description=”Check DFS for proper spelling”>
   <taskdef name=”bspell” classname=”com.google.bspell.ant.BSpellTask” />  
      <!–Spell check the project (e.g. Java, properties, XML, XSD, WSDL, HTML and text content).–>
      <fileset dir=”${basedir}/**” includes=”**/*.*”/>

The first thing to note is an incorrect element name is referenced in the example bspell task call in the bSpell home page–it should read spellconfiguration as above, not configuration .

Next, note that there are four file-based inputs to the bspell task as follows:

  • project-bspell.config – I simply copied <bspell-root>\etc\spellCheck.config, renamed it within my project, and left its contents untouched. There are a number of settings herein; however, there isn’t any documentation to speak out; so, you have to scour bSpell source code to understand valid ranges and the effects of change.
  • english.jar – I simply copied this JAR file from <bspell-root>\etc into my project area as-is to improve project tool source control. Presumably there are non-English dictionaries that can be applied, too; however, I didn’t go looking for them since my needs are English-based.
  • project-bspell.reserved – bSpell will use this file to instruct its spell checker to ignore certain words. The baseline file I used, %ANT_HOME%\etc\fant\bspell\reserved.dict, contained two sets of words: one associated with “general” and another associated with “java.” I ended up leaving the Java word list as-is and simply added words to the general list. Given that I didn’t modify my bSpell configuration file (e.g. IGNORE_MIXED_CASE=true and IGNORE_UPPER_CASE=false), I had to specify all-lowercase and all-uppercase values for  acronyms to be ignored. Given the lack of documentation on, for example, the effect of a word addition to the general list versus the Java list, or the ability (or not) to create additional, named word lists, it may be worth scouring the bSpell sources for details.
  • project-bspell.registry – bSpell will use the file’s extension to detect which parser should be used. In case of .java, it will load the JavaParser to parse the Java source code (i.e. com.google.bspell.parsers.JavaParser.class within bspell-0.1.jar). Out-of-the-box, bSpell provides one other parser: TxtParser. You associate parsers with file extensions in the registry file. I associated com.google.bspell.parsers.TxtParser with txt, properties, xml, xsd, wsdl, and html. If you specify a file type in your bspell task file sets that isn’t listed in your registry or recognized by default in bSpell (i.e. .java), bSpell throw a java.lang.RuntimeException.

A run of “ant spellcheck” from the command line produced a spell check analysis of over 600 files in under two minutes on an average workstation. So, I can now spell check my Java projects in the time it takes me to grab a fresh soda from the office refrigerator. Nice!

So, what’s missing?

  • Well, for starters, bSpell (and fant) are only at version 0.1 (as numerically determined by their committers). Nevertheless, it would be useful to have a ready-to-deploy package for bSpell rather than have to go through a reasonably involved build-then-deploy process, as noted above.
  • bSpell needs documentation–not just Javadoc, but user guide style content (e.g. explaining bSpell configuration file values and how they alter tool behavior). There is a bSpell wiki, but it is empty at the time of this post.
  • bSpell needs more built-in parsers. While associating com.google.bspell.parsers.TxtParser with txt and properties works, associating TxtParser with xml, xsd, wsdl, and html, leads to extra work in reserved dictionary definition. That’s because TxtParser isn’t designed to ignore language keywords and other common-but-not-true-English constructs in such document types.

Thanks to James Mao & Co. for bSpell. (I see that James and Glen are both committers on Apache CXF, too.) Keep up the good work, guys!

Update 10/8/2007: A colleague of mine kindly reminded me that modifying the actual Ant distribution is the “old” way of doing things, and was replaced by the .lib directory several Ant releases ago. So, I tried to correct my post above to reflect a more appropriate way to achieve the same results. Thanks, Martin.