I am mid-way through a project which will query various web sites, process the data and dump any new items found into a database. The data will be presented in a grid with a check box on each row. The user can then run through the rows unchecking any which are not of interest. The unchecked rows aren’t physically deleted but will not be presented to the user.
This is a tactical application and works fine with the sites I targeted. I’ll be thinking next about how to make it generic so it can be easily adapted to new sites and changes in site layout. Rather than present immature code, for now I’m setting out below some general notes which may be of use for doing the analysis for this kind of application.
Querying and Presenting the Data
The goal will be to use “Linq to Xml” to query the data for presentation. By using Linq instead of XPath we ensure that only one data access technology will ever be needed in the application.
This means that in some cases the data will need to be converted to XML, and then to an “XDocument” (the latter conversion is very simple).
Ways to the Data
Web Services – Standard documentation will be a good guide.
HTTP GET – No issues here, it’s easy to add your query filters to the url.
HTTP POST – You will need to build up a string which can be posted to the site. Key issues are handling the ViewState and sending back the hidden controls. It will be useful to have some means of comparing your results with what the browser sends. I used a network sniffer to capture my POSTs.
I decided to use the .NET libraries directly rather than the browser control which will inherit all the setting from the browser, and will presumably execute the scripts that are returned. I don’t think either feature is desirable as there is a loss of control.
The Raw Data
Web Service – These return a valid xml document with a self-describing structure so there’s not much more to say. Just convert it to XDocument and make it available through Linq.
RSS Feeds – These return a valid xml document so there is no need for tidying up, all you need to do is convert it to an XDocument for Linq. The tags don’t carry any meaning though, so you’ll still need to do some string processing.
Atom – Haven’t dealt with this as I haven’t had an Atom data source to deal with – but it’s based on XML and can probably handled the same way as RSS.
Web Site – I originally thought I’d find a .NET library which would build an HTML document from the raw HTML returned so that I could walk through the collections. It’s doesn’t seem to be that simple. I experimented with one library, but the C# version appears to have significant bugs.
So it seemed easiest to convert directly to an XML document, since I’ll need that for my Linq queries. SgmlReader does this well and the dll version will be easy to integrate into your project. There’s no meaning in the structure of this XML document, so you’ll still have to do some string processing. By the way the conversion from an XmlDocument to an XDocument is very simple:
using System.Xml.Linq;
private XDocument XmlDocumentToXDocument(XmlDocument doc)
{
return Xdocument.Load(new XmlNodeReader(doc));
}
There are several types of work to be done:
Walking through any HREF links and querying them.
and/or
Parsing the data returned in markup elements and attributes
The tags will just contain HTML keywords, so the basic approach is to find something which identifies a piece of data (on the site I’m working with it’s words like “Name”, “Telephone” etc., followed by a colon), and try to extract the data from whatever follows it.
There are several issues to deal with. If the data is contained in an HTML table there may be a middle cell where most of the text is found and there is no label. I am thinking of writing new code to process an HTML table as a unit. It should then be possible to match the data to “labels”. Provided there are not many unlabelled cells left over, they can be tested to make sure that they don’t contain certain types of HTML markup and then mapped any expected data which is still missing.
Some data may also be found in “drilldowns” to other pages or as a javascript window which pops up on the same page.
All the above problems might be dealt with by segmenting the page. If tables, HREFs and popup windows can all be separated first and treated individually it will probably be a cleaner way to process the page.