Automation
components in UsefulChem
Initial work with Excel / Excel VBA:
Molecule
entries in http://usefulchem-molecules.blogspot.com are characterized
primarily by a UC number (e.g., UC0188), a SMILES notation, and an
image, although other information, such as CAS number, is often
added. To summarize and expand on this data in a convenient format,
a program in Microsoft Excel Visual Basic for Applications (VBA)
(http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/MoleculeBlogInfo.zip)
was developed which downloads this page, parses out the desired
information, and generates a spreadsheet
(http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem-molecules/usefulchem-molecules.xls)
in which each row represents one blog entry. Given that the blog
format itself is rather loose – for example, the SMILES entry might
be prefixed by “SMILES” or “SMILES:” – and can change over
time, the search criteria for fields were made fully configurable by
placing them in an initialization (.ini) file.
Additional
information beyond that provided by the blog, such as links to
suppliers, were desired, and for this purpose several different
freely available software packages and libraries were used.
Molecular weight information and molecular format files (CML, MOL)
were generated from the SMILES using the CDK Java libraries, while
InChI descriptors were produced by OpenBabel. Image files were at
first generated using ChemSketch, although these are now simply
downloaded directly from the blog itself. Supplier information was
acquired by sending HTTP GET requests to chmoogle.com (now
eMolecules.com), and processing the responses gleaned from this
service.
In
addition to the spreadsheet, this software also creates HTML and CML
files (e. g.,
http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem-molecules/UC0088.htm)
for each blog entry, which in combination allow the molecules in the
blog to be viewed with the Jmol applet.
RSS feeds and Automation Software in
Java:
The
spreadsheet format for the usefulchem-molecules blog was a useful
beginning. It was, however, not very amenable to automated data
processing or other kinds of display desired, particularly for the
internet/web. An initial attempt to address these deficiencies
involved modifying the Excel VBA software to generate an RSS 1.0 feed
(http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem-molecules/usefulchem-molecules.rss)
of the blog data in addition to its other output. The advantage to
having the data in a feed is that can then be viewed using any number
of available desktop or web-based readers, such as RSS Bandit
(http://www.rssbandit.org) or Bloglines (http://www.bloglines.com).
Furthermore, as RSS is simply XML, feeds can contain other XML
formatted data, such as Chemical Markup Language (CML). Thus, a feed
can be downloaded and parsed for its CML by software such as
Bioclipse (http://www.bioclipse.net) or Jmol
(http://jmol.sourceforge.net).
A
shortcoming of using Excel VBA is that it does not easily lend itself
to automation. Also, it is neither truly an open source development
platform nor portable to other operating systems such as Unix or
Macintosh. Therefore, to address these shortcomings, I rewrote the
VBA code in the Java programming language, which is both free (see
http://java.sun.com/javase/downloads/index.jsp
to download the Java Development Kit) and is implemented on all major
operating systems. Once in Java, it was straightforward to set the
software up as an service to be run periodically. As a result, the
RSS feed and associated files are now regenerated automatically
whenever additions or changes are made the usefulchem-molecules blog.
A zip
file containing both the source and compiled code for the Java
software to convert the usefulchem-molecules blog to an RSS feed can
be found at
http://showme.physics.drexel.edu/usefulchem/Software/Java/MoleculeBlogInfo/MoleculeBlogInfo.zip.
CMLRSSReader:
Having
an RSS feed with special fields provides a launching platform of
essentially unlimited opportunities for further treatment of chemical
information. Standard RSS readers, however, rarely display little
more the
and several other standard fields in a feed. Furthermore, they are
not extendable or configurable to include additional processing via
plug-ins or “hook” programs on a feed, its entries, or the
various specialized fields it can contain. Thus, a specialized
reader seemed necessary.
Writing
a simple feed reader is actually not a particularly difficult
software project, and there is a lot of help available in books and
web sites (I used “RSS and Atom Programming” from Wrox books
(Wrox.com) as a guide for all my RSS programming). I have developed
such a reader, again using Java, which begins to address some of our
specialized requirements for feeds containing CML and other chemical
information. This reader and associated software, which can be
downloaded from
http://showme.physics.drexel.edu/usefulchem/Software/Java/CMLRSSReader/CMLRSSReader.zip,
is still at an early stage in development and can currently handle
only RSS 1.0 feeds (and so far has only been tested on the
usefulchem-molecules and two other closely related feeds), but
demonstrates some of what can be done along lines described above.
In addition to the standard reader features of automatically
downloading and managing multiple feeds, displaying information
contained their item entries, and as tracking new or changed items,
the software also allows specialized programs to be executed on the
feeds themselves and their contents. In its current form, programs
can be configured to run after feed file download and/or processing.
These programs can be written in any language, even DOS BAT files
(although Java must be used on processed feeds, as they are stored
via Java serialization), and can perform any processing/reporting
desired, such as calculations using the CML in the feed, internet
searches, database entry, and/or e-mailing results to the interested
parties.
Two
examples of this capability are already being used to automatically
generate and upload information for display on the web. One,
ExtractHTMLPages, is a Java program that parses the
usefulchem-molecules feed file for its item
fields and generates an HTML file for each item. ExtractHTMLPages
also generates an index file
(http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem-molecules/Items/UsefulChemistryMolecules.html)
of the item HTML files which, using a combination of JavaScript and
HTML iframes, allows any of them to be selected for viewing from a
drop-down list. When CMLRSSReader downloads a feed, which it does
whenever the feed has been updated (which in the case of
usefulchem-molecules, occurs whenever the blog
is updated), it automatically runs ExtractHTMLPages, generating and
uploading all of these files to the web server.
The
other example, ExtractNewItems, is a Java program which works with
processed feeds to record and detail changes to the feed. When new
items are added to the usefulchem-molecules feed, or new information
about an item is added or modified, ExtractNewItems generates and
uploads two files: newItems.html
(http://showme.physics.drexel.edu/usefulchem/Software/MoleculeBlogInfo/usefulchem-molecules/newItems.html)
and newItems.xls. True to their names, these files list items that
have been added or updated since the last time the program was run.
Ultimately, the reason for a new listing will also be given, such as
new supplier information, but this is not currently implemented.
Future Directions:
Quite
a bit of ground has been covered, and a lot of evolution occurred,
since the initial work with Excel VBA. A certain amount of
consolidation and strategic consideration would seem to be worthwhile
at this point. To begin, the numerous web sites and pages generated
would benefit from some organization. This can be done with a single
page, or small set of pages, providing links to and descriptions of
the various software tools and the pages they generate.
Second,
although I have tried to make the CML RSS reader software highly
flexible, it needs to be tested for compatibility with other RSS 1.0
feeds containing CML if it is to become of general use to the
scientific community. Additional development is almost certainly
going to be needed here (no one should expect to be that
lucky!). I am also eager to see how the reader might interact with
other software, such as Bioclipse, for example in providing CML and
other data in automated fashion. This should prove fruitful, as
Bioclipse obviously provides so much more in the way of processing
and visualization tools than the reader itself. Other enhancements
include a replacement for Java’s JEditorPane for displaying item
data (JEditorPane’s handling of HTML is fairly primitive), other
improvements to the user interface, and more configurable program
extensions and/or plug-ins.
Finally,
a lot of technologies have yet to be explored in this area. One
excellent candidate is the combination of Ajax in HTML pages with
chemical information web services. Ajax provides the ability to
dynamically query web sites and services without the overhead in time
and resources of retransmitting/reloading entire pages. In
conjunction with JavaScript events and dynamic HTML, this can
essentially turn an ordinary browser into a full-featured software
user interface. Ajax also appears quite easy to use. For some
simple examples of what can be done with Ajax, see
http://showme.physics.drexel.edu/usefulchem/Software/Ajax/UsefulChemistryMolecules/UsefulChemistryMolecules.htm
and
http://showme.physics.drexel.edu/usefulchem/Software/Ajax/UsefulChemistryMolecules/UsefulChemistryMolecules2.htm
(simply hover over any of the UC numbers).
Also,
I have just begun to learn about OpenOffice, and hope to convert the
Excel applications into them.