Retrieving Web Data Programmatically in .NET 3.5

December 21, 2009

I am mid-way through a project which will query various web sites, process the data and dump any new items found into a database. The data will be presented in a grid with a check box on each row. The user can then run through the rows unchecking any which are not of interest. The unchecked rows aren’t physically deleted but will not be presented to the user.

This is a tactical application and works fine with the sites I targeted. I’ll be thinking next about how to make it generic so it can be easily adapted to new sites and changes in site layout. Rather than present immature code, for now I’m setting out below some general notes which may be of use for doing the analysis for this kind of application.

Querying and Presenting the Data

The goal will be to use “Linq to Xml” to query the data for presentation. By using Linq instead of XPath we ensure that only one data access technology will ever be needed in the application.

This means that in some cases the data will need to be converted to XML, and then to an “XDocument” (the latter conversion is very simple).

Ways to the Data

Web Services – Standard documentation will be a good guide.

HTTP GET – No issues here, it’s easy to add your query filters to the url.

HTTP POST – You will need to build up a string which can be posted to the site. Key issues are handling the ViewState and sending back the hidden controls. It will be useful to have some means of comparing your results with what the browser sends. I used a network sniffer to capture my POSTs.

I decided to use the .NET libraries directly rather than the browser control which will inherit all the setting from the browser, and will presumably execute the scripts that are returned. I don’t think either feature is desirable as there is a loss of control.

The Raw Data

Web Service – These return a valid xml document with a self-describing structure so there’s not much more to say. Just convert it to XDocument and make it available through Linq.

RSS Feeds – These return a valid xml document so there is no need for tidying up, all you need to do is convert it to an XDocument for Linq. The tags don’t carry any meaning though, so you’ll still need to do some string processing.

Atom – Haven’t dealt with this as I haven’t had an Atom data source to deal with – but it’s based on XML and can probably handled the same way as RSS.

Web Site – I originally thought I’d find a .NET library which would build an HTML document from the raw HTML returned so that I could walk through the collections. It’s doesn’t seem to be that simple. I experimented with one library, but the C# version appears to have significant bugs.

So it seemed easiest to convert directly to an XML document, since I’ll need that for my Linq queries. SgmlReader does this well and the dll version will be easy to integrate into your project. There’s no meaning in the structure of this XML document, so you’ll still have to do some string processing. By the way the conversion from an XmlDocument to an XDocument is very simple:

using System.Xml.Linq;
private XDocument XmlDocumentToXDocument(XmlDocument doc)
{
return Xdocument.Load(new XmlNodeReader(doc));
}

There are several types of work to be done:

Walking through any HREF links and querying them.

and/or

Parsing the data returned in markup elements and attributes

The tags will just contain HTML keywords, so the basic approach is to find something which identifies a piece of data (on the site I’m working with it’s words like “Name”, “Telephone” etc., followed by a colon), and try to extract the data from whatever follows it.

There are several issues to deal with. If the data is contained in an HTML table there may be a middle cell where most of the text is found and there is no label. I am thinking of writing new code to process an HTML table as a unit. It should then be possible to match the data to “labels”. Provided there are not many unlabelled cells left over, they can be tested to make sure that they don’t contain certain types of HTML markup and then mapped any expected data which is still missing.

Some data may also be found in “drilldowns” to other pages or as a javascript window which pops up on the same page.

All the above problems might be dealt with by segmenting the page. If tables, HREFs and popup windows can all be separated first and treated individually it will probably be a cleaner way to process the page.

Sql Tricks for Tricky Logic (3)

December 8, 2009

This is the last posting in my brief discussion of SQL tricks. I’m dealing here with an issue which causes quite a bit of confusion to newer developers. What’s the point in having LEFT and RIGHT joins defined in the SQL standard, how do you decide which to use, and is there a way to keep things simple?

The two most common types of SQL join are INNER and OUTER. The words “left” and “right” are not used with inner joins. This is because the joined tables are both treated in the same way. Rows will not appear in the result set except where the join condition/s are satisfied.

In an OUTER join tables are treated differently. You will see all rows from the “outer” table, plus any rows from the inner table satisfying the join condition. In a row where no inner table match is found, its columns are padded with Nulls. The purpose of the “LEFT” or “RIGHT” in the join is to indicate which one is the outer table.

Take the following join expression:“TableA LEFT OUTER JOIN TableB”. The words “left outer” indicate that the outer table is TableA, since it is named to the left of the keyword “JOIN”. So that’s the table that will be returned in full. Rows from Table B will only return where they match the join condition. Can a left outer join be rewritten as a right outer join? Yes, where the entire query includes only two tables, there is no difference. You’ll very rarely see a RIGHT OUTER JOIN for simple queries though. It doesn’t do anything more than a left join, and since most people find the left join much more readable there’s no need for the right join.

Where an query involves more than two tables a case is sometimes made for using right outer joins.

There is an exhaustive discussion of the issue here (and this should also be read in conjunction with it). I can’t add much to what’s said, but I wanted to point up the alternative to the RIGHT join that I prefer, namely the nested join. Please note that in SQL Server you can’t use the bracketed syntax as in the blog, so to make clear what’s going on I comment and indent it as follows:

SELECT *
FROM TableA
LEFT
OUTER JOIN TableB
–nested join–
JOIN TableC ON TableB.Col1 = TableC.Col1
ON
TableA.Col1 = TableB.Col1

If you don’t like the nested join the alternatives are:

  • Derived tables (also known as inline views)
  • CTEs (Common Table Expressions, also known as “With Clauses”)

…or if you don’t require your query to be self-contained, you could create a view or function to resolve the inner join for tables B and C, which can then be left outer joined to TableA, or even use a table variable or temporary table.

In most cases I want to avoid dependencies so I opt for one of the self-contained approaches. In the past I’ve always used derived tables. but nowadays I prefer the nested join. It seems to go better with my coding style since my column lists are vertical with each name on its own line. If you use a derived table the select list has to be repeated, which means quite a bulky expression.

I hope these three articles will be helpful in writing simpler SQL.

SQL Tricks for Tricky Logic (2)

October 23, 2009

I’ve been working with Linq in .NET lately and there have been a few posts I would have liked to add but since I said I was going to blog SQL tricks I’m putting other topics to one side for the moment.

So… another situation where it would be useful to have an SQL trick is the case of the filtered outer join. Suppose I have a chain of high street charity shops. Every shop should have a manager, who may be the only staff member (OK, not a fully-developed scenario). I want to find the manager’s name for each branch as well as listing any branches which have no manager, and new branches which have no staff at all.

The SQL coding style I prefer puts equi-joins between columns in the JOIN clause. All other types of column matching and filters go into the WHERE clause. Unfortunately, this doesn’t work with a Left Outer join as you can see in the example below (written in SQL Server 2008 Transact SQL). There’s an answer though: put the filter in the JOIN itself. Since this breaks my self-imposed style guide I usually comment it as shown to draw attention to the fact.

/*PLEASE REPLACE “SMART” SINGLE QUOTES BEFORE RUNNING THIS*/

SET NOCOUNT ON

CREATE TABLE Branch
(
BranchId int NOT NULL
)
INSERT INTO Branch
(
BranchId
)
VALUES(1),(2),(3)

CREATE TABLE BranchStaff
(
BranchId int NOT NULL
,
Name nvarchar(20) NOT NULL
,
Position nvarchar(20) NOT NULL)

INSERT INTO BranchStaff
(
BranchId
,Name
,Position
)
VALUES(1, ‘John Smith, ‘Branch Manager’),(2, ‘Arthur Jones’, ‘Sales Assistant’)

PRINT ‘***INNER JOIN:’

SELECT
B.BranchId

,Name
,Position
,‘Includes only rows matched in both tables.’ As Notes
FROM
Branch B
INNER JOIN BranchStaff S
ON B.BranchId = S.BranchId

PRINT ‘***Unfiltered LEFT OUTER JOIN:’

SELECT
B.BranchId
,Name
,Position
,CASE WHEN S.BranchId Is Not Null THEN ELSE ‘No match for this row in the main table so joined table columns padded with nulls.’ END As Notes
FROM
Branch B
LEFT OUTER JOIN BranchStaff S
ON B.BranchId = S.BranchId

PRINT ‘First Attempt – LEFT OUTER JOIN with filter in WHERE clause:’
SELECT
B.BranchId
,Name
,Position
,‘Acts like an Inner Join because filter excludes unmatched rows in joined table.’ As Notes
FROM
Branch B
LEFT OUTER JOIN BranchStaff S
ON B.BranchId = S.BranchId
WHERE S.Position = ‘Branch Manager’

PRINT ‘Second Attempt – Filter in the JOIN:’
SELECT *
,
‘Correct! Includes all rows from main table plus filtered rows in joined table’ As Notes
FROM
Branch B
LEFT OUTER JOIN BranchStaff S
ON B.BranchId = S.BranchId
/*not equijoin*/
And S.Position = ‘Branch Manager’

DROP TABLE Branch
DROP TABLE BranchStaff

SQL Tricks for Tricky Logic

September 19, 2009

SQL Tricks for Tricky Logic

One reason occasionally offered for using dynamic SQL widely, or even for keeping business logic out of database stored procedures, is the difficulty of coding optional filters in SQL. Sometimes we may want a row from table X where column A = this, column B = this and column C = this. Other times we may want all the rows where column A = this but we’ll accept any value in columns B and C.

I’m not taking a position here on where the business logic belongs – but is it hard to code optional filters in SQL? No it sure isn’t. I’ll illustrate this in SQL Server 2008 Transact SQL as follows (NB: you may have to replace the quotation marks with “unsmart” single quotes to get the SQL to run):

SET NOCOUNT ON

CREATE TABLE T(A int NOT NULL, B int NOT NULL, C int NOT NULL, NullableD int NULL)

INSERT INTO T ( A ,B ,C, NullableD ) VALUES ( 1 ,1 ,1 , Null)
INSERT
INTO T ( A ,B ,C,NullableD ) VALUES ( 2 ,2 ,2 ,2)

DECLARE @Filter1 int = 1
DECLARE
@Filter2 int = 1
DECLARE
@Filter3 int = 1
DECLARE
@LoopIdx int = 0

WHILE @LoopIdx < 4
BEGIN

PRINT ‘@Filter1:’ + COALESCE(CAST(@Filter1 as nvarchar(1)), ‘Null’)
PRINT ‘@Filter2:’ + COALESCE(CAST(@Filter2 as nvarchar(1)), ‘Null’)
PRINT ‘@Filter3:’ + COALESCE(CAST(@Filter3 as nvarchar(1)), ‘Null’)

PRINT ‘(@LoopIdx:’ + CAST(@LoopIdx as nvarchar(1)) + ‘)’
/****THIS IS THE IMPORTANT BIT****/
SELECT
* FROM T WHERE A = COALESCE(@Filter1, A) And B = COALESCE(@Filter2, B) And C=COALESCE(@Filter3, C)
/****END OF IMPORTANT BIT****/

SELECT @Filter3 = Null

IF @LoopIdx = 1
SELECT
@Filter2 = Null

IF @LoopIdx = 2
SELECT
@Filter1 = Null

SELECT @LoopIdx = @LoopIdx + 1

END

So any filter coded as COLUMN = COALESCE(@Filter,  COLUMN) will be optional. The stored procedure is defined with all the parameters that might be used.When calling it just pass in Null for any parameter which you don’t want to filter on.

I make no comment about how the optimizer will interpret this as it would require a lot of testing, but I started using this coding style in Oracle and SQL Server years ago and it hasn’t caused me any performance problems I’m aware of. I’m personally willing to sacrifice some performance for clarity and ease of maintenance – but if speed is the absolute priority it would no doubt be worth testing the Null and Not Null filters with different data distributions.

Another point to consider is what would happen with a nullable column. Assuming your database settings treat the result of [Null compared with Null] as Unknown, the logic won’t work. A null column value won’t be equal to itself – but if the column is non-numeric one possible approach is this:

SELECT * FROM T WHERE COALESCE(NullableD, ‘€£$UNLIKELYVALUE*&^’) = COALESCE(@Filter, ‘€£$UNLIKELYVALUE*&^’)

Note that wrapping the column in a function will prevent indexes being used but since a unique index can’t be created on a nullable column this doesn’t make a lot of difference in practice. Of course, if your database treats Null as equal to Null this whole approach is not useful, but I’ve come across very few databases set up like that.

In the case of date or numeric columns it’s harder to be sure that the “unlikely” value will never occur, and actually I think the following SQL is clearer and therefore better in all cases:

WHERE (NullableD = @Filter Or @Filter Is Null)

NB: The brackets keep the “Or” from combining with any other expressions in the WHERE clause.

If you drop the following code into the same SQL Server window you can check how it works:

DECLARE @Filter int
SELECT NullableD FROM T
WHERE (NullableD =@Filter Or @Filter Is Null)

SELECT @Filter = 2
SELECT NullableD FROM T
WHERE (NullableD = @Filter Or @Filter Is Null)

So that’s all there is to it. There are some other filtering issues when using left outer joins. I hope to deal that topic in a future posting.

Setting up a SQL Server Express 2008 Client Server Application for Access over a Peer-to-Peer Network

July 1, 2009

Today’s blog records the .NET and database technical issues I encountered setting up a SQL Server Express 2008 client server application on a peer-to-peer network.

Integrated security apparently doesn’t work on a peer-to-peer network so it’s necessary to set the server to mixed security. The connection string is stored in a config file. It doesn’t contain the user name and password, which are provided by the user through a login form which pops up when the application is started. Storing and retrieving the connection string from app.config presents a few minor issues:

In Visual Studio the “Project > {Your Project Name} Properties > Settings” location provides an easy way to define and validate a connection string in your application – but the connection string will apparently be “baked in” to your dll. I don’t think this presents a risk in relation to my project, but it seems like a bad habit to do the settings this way, so I edited app.config manually which does not have this effect.

One setting which you may need to check in Visual Studio is “app.config>Properties>Copy to Output Directory”. To make sure that Visual Studio automatically copies the file to the directory where it compiles the assembly, set this to “Copy always” or “Copy if newer”.

In writing code to retrieve the settings, the obvious “no no”, is using the configuration objects in the System namespace. This is now deprecated so you’ll be nagged to death about using obsolete code if you stick with it, and if you treat warnings as errors when you compile you’ll have bigger problems. Set a reference to System.Configuration and use that. Unfortunately many web code examples do utilise the obsolete namespace.

There were a few other issues which became clear to me once I understood key pieces of information: first off, my connection would always fail the login no matter how I configured it. Eventually I looked closely at the string brought back and found that it was completely different to my setting. I saw a mention of asp, so the config being read probably had something to do with the Visual Studio web server. The key point was that first item in the connection strings collection may come from a file higher up the search chain than your own app.config. To ensure that your local app.config is the one used, add a “clear” element before the “add” element where your connection string is defined. I’ll provide the format of the connection string I used later in this posting.

To avoid confusion when debugging, it’s also useful to know that {appname}.vshost.exe.config is the config file used when running the project in Visual Studio. This is just a copy made at run time though, so there are no practical implications

My project includes a library and the connection is set up by a helper object in the library. It might seem like a good idea to have a config file in the library so that all connection details are centralised, but this won’t work. No app.config is added when a Class Library is created, and
this is for the very good reason that it will be ignored in favour of the application’s config file. You can access settings from the library, but they will be defined by the calling application, not by the library.

I want to create and drop connections as needed. I pop up a login form to collect the user name and password. These are stored as module-level variables and the form is a Singleton so that it can log in and pass back a new connection when required.

There was quite a bit of good news when I looked at the database security. The application is used remotely by a normal “non-dev” user. The user needs enough permissions but not too many of course. My application works completely through stored procedures which simplifies requirements. Recent versions of SQL Server have features which make security even simpler. Schemas can now be defined within the database. Unlike Oracle, the schema owner need not be a user with the same name. This means that you can segregate objects in whatever way you like without creating an equal number of schema owners.

It’s useful to keep development and maintenance procedures in their own schemas for example. Having got them out of the way, it’s then a good idea to make sure that all application schemas are owned by one user. That will mean that each stored procedure has an “ownership chain”. Why does this matter? If a procedure has an ownership chain, there’s only one permission required to run it (assuming there is no EXECUTE AS CALLER clause in the definition). The permission needed is EXECUTE on the stored procedure; no permissions will be needed for the objects referenced by the procedure. If some of those objects are owned by other users though, the ownership chain will be not be formed and you will need to grant explicit permissions in relation to them.

Another new simplifying feature that now comes into play is the ability to grant user permissions on a schema. So GRANT EXECUTE ON SCHEMA::{your schema name} TO {user name} will ensure that every stored procedure in the schema will be executable by the user in question.

Having created stored procedures to do all the CRUD operations, I certainly don’t want my client to be able to read or write data in the tables directly. The new, simple way of ensuring this is to assign the user to two fixed database roles, db_denydatareader and db_denydatawriter. This will take effect across the whole database. The command used are as follows:
EXEC sp_addrolemember [db_denydatareader] , ‘{user name}’
EXEC sp_addrolemember [db_denydatawriter] , ‘{user name}’

With these measures I believe I’ve got a security model that allows a user all the access they need, but only through the stored procedures I’ve created. Incidentatlly, by using the EXECUTE AS {login} command in SQL Server Management Studio I was able to do some testing as the user from my normal Dev session.

The next step job was to enable connectivity from the user’s machine. I decided not to keep things simple and not enable SQL Browsing. This meant that I would have to provide the IP address of the host and the port that the database server was listening on. My program determines whether it is local to the database server or not. If it’s local it uses a connection string with “Integrated Security=SSPI”, and if it’s remote it uses the following:

“Application Name={the name of the application};Data Source={host machine IP address},{port number};Initial Catalog={database name};User ID=;Password=;Integrated Security=false”

The user name and password are inserted into this string at run time, as I mentioned above.

By default SQL Server assigns the port dynamically, so it was necessary to make the services needed listen on a defined port instead. SQL Server Configuration Manager is used for this, I won’t go into detail as the information is available fairly widely. I still could not telnet onto the database server , the reason being that the host machine’s firewall was still blocking the traffic. By opening the firewall and establishing a new incoming rule I opened up the port and was able to connect.

I decided to try the Visual Studio “Build>Publish” command and install to a subdirectory of C:\Users\Public (operating system is Vista – sigh…) so that the client machine would have access to the files. The process ran through without errors, but the browser window popped up at the end with a “File Not Found” error. I decided to ignore this and try to install it on the client machine anyway. I mapped a drive to the installation folder and it installed perfectly.

The client machine runs XP and you get the warnings about the publisher of the software not being verified. This isn’t the biggest of my worries so I’ll look at it later. To upgrade the software it seems to be necessary to go to “Add/Remove Programs” and uninstall it first, but this is because I selected the “Yes, this application is available online or offline” option in the Publish wizard. So that’s about it. I’m sure this posting will be useful to me next time I have to deploy an application, as by then I’ll need to relearn a lot of the detail. This was a few weeks ago so I may have missed a detail or two, and if so I’d be glad to hear about it.

The same application also queries an Oracle database so I’ll cover the work involved setting that up in a future post.

Database Object Types – Some Design Considerations

July 1, 2009

There’s now a wide range of objects that a database developer can use to solve problems. For this post or two I’m considering certain objects (as implemented in MS SQL Server) and how I think they are probably best used .

Stored Procedure
A stored procedure has the full range of DML operations and you can raise errors in the procedure, useful if you want to make run-time checks on data. It will return data to a client program, but within the database you can’t re-use the result set. That is to say, you wouldn’t be able to SELECT the proc results and join them to another table. In MS SQL Server you can EXECUTE the stored procedure into a temporary table (or possibly a permanent table) and join to that, but it isn’t standard SQL and was a bit clumsy last time I used the technique.

If your result set is a “foundation” element that will be re-used in other queries, it will probably be better to use a view, which can be treated just like a table in your SQL, or a function. The view or function can still be wrapped in a stored procedure to offer a consistent interface for client programs and allowing you to raise errors as above.

Trigger
A trigger is the best solution for auditing changes to a table. Consider alternatives if you’re willing to trust that all developers will write comprehensive auditing code for their programs, that it will be possible to merge the data from these individual audit systems, and that no-one apart from dbas will have direct access to the data so as to bypass the auditing.

In the systems I work on that’s never the case, so I use audit triggers. They’re the only way to make sure that changes are recorded whatever program or access method is in use. Seems paranoic? No doubt, but if you can’t be 100% sure that your audit records are available and complete, why bother auditing at all?

I try not to use triggers for any other purpose. You can’t pass in parameters to a trigger so it’s not as natural to code as a stored procedure. A trigger is easily forgotten and may only come to mind when it’s slowing down a big UPDATE or INSERT (probably in a script you’re running to solve a critical production issue). As a maintenance programmer I really don’t like the idea that code could suddenly jump out at me without warning. I’m sure most others feel the same, so if you want your application to survive and thrive during the maintenance phase, don’t get “trigger-happy”…

There are traps in the coding of the trigger. You have to consider whether the trigger itself would generate further trigger activity and whether you might get into an endless loop. I think the question to ask is – are there really no alternatives to using a trigger here? I’ve almost always found one, even for validating multiple rows of data being inserted into a table – one of the times when a trigger would be a “natural” solution.

View

These are virtual tables based on select statements. They can include multiple tables or views and subject to restrictions they can span partitions. INSERTS, UPDATES and DELETES can be done on views provided the SQL behind meets certain restrictions (with additional restrictions in the case of partitioned views).

It may be possible to bypass the DDL restrictions on some views by using an INSTEAD OF trigger. This can be attached to a view – even if the view isn’t defined with the SCHEMABINDING option (see below). I think the same caveats apply here as with ordinary triggers (see above). An INSTEAD OF trigger might be helpful if some client applications make direct inserts into tables – and the schema has changed so that the tables needed are no longer available. Apart from that they should probably be avoided, especially if access to the underlying tables is still allowed. Object views use INSTEAD OF triggers to keep the underlying relational data in synch with the object properties. That’s another debate which I won’t enter into here.

You can’t ORDER a view except if there is a TOP clause. One trick to force ordering is to create a view with a TOP 100 PERCENT clause, but apparently the ordering in that case is still not guaranteed.

Oracle is more liberal as to where it allows sorting. MS SQL Server’s prohibition on view and function ordering is probably a stricter implementation of relational theory which states that the definition of a table (or a virtual table I presume) does not include ordering of its rows. Lack of sorting isn’t a major issue. Explicit ordering should always be added to code released to Production – it’s just a convenience when you’re developing to have your data “ready-sorted”.

Note: Schemabinding means binding the view to the database schema. This means that any changes to the schema must leave the view in a consistent state. An error will prevent the dropping of any tables or columns which appear in the view.

Materialised Views
These are known as Indexed Views in MS SQL Server, and are used to boost performance. When a query references a view, its SQL is expanded by having the definition of the view written. This happens at run time, and if it imposes a big overhead, the view can be bound to the schema. The view can then be indexed and much of the work is therefore done before the data is queried.

The materialized view is an advanced technical feature, apparently more commonly used in data warehousing than in the kind of conventional reporting systems that I work on. It needs careful use. I’m going to consider Oracle as well as SQL Server in this posting; I’m more familiar with materialized views in Oracle and they seem to be more configurable (with more opportunity to get into trouble). There are a lot of technical details regarding how materialized views can be built and restrictions that apply. It seems to me that the two most important factors are:

1.Data refreshes
The Materialized view can speed up performance for SELECT queries, but the trade-off is added overhead for inserts, updates and deletes. So a materialized view will not help if all of the tables are volatile, with lots of DML going on.

2.Query Rewrites
This feature allows the optimizer to look at queries submitted and check them against the materialized view. Where the materialized view can satisfy part or all of the query, it will be “written in” to the query plan.

Point 1 is a normal design issue, and the worst penalty for a bad decision might seem to be a loss of performance. You could get into serious trouble in a live system though, if you work around the performance issue by taking control of the data refreshes. Oracle allows you to specify when these will happen, which is OK as long as there is guaranteed to be no change to the data between refreshes – or you’re happy to defer data changes to a batch process occuring at set times (and everyone else feels the same way).

Point 2 seems to be the feature that really needs a lot of thinking about, and I worked on a system where this` issue cropped up. Remember that once query rewrite is enabled, the optimiser will patch the materialized view into any submitted query as it sees fit.

This means that by enabling query rewrite on your materialized view, you could be redeveloping code that you’ve never seen. The optimizer considers the query predicate in making its decision. If there’s a filter on the materialized view it shouldn’t get used unless it’s compatible with the submitted query – but will you take that on trust, or will you test it? It won’t be easy to find the code that might be affected, and clearly you can’t test code that hasn’t been deployed yet, but such code could also be affected.

There’s a final nasty twist if you’ve got query rewrite enabled and the data refresh isn’t automatic. I assume that query rewrite doesn’t force a refresh (or the performance benefits would disappear) so – you could be patching old data into queries instead of the up-to-date information available in underlying tables. It’s possible to disable or re-enable the materialized view in line with the updating policy but this would need careful management.

For me the overall message is that as long as the environment is well controlled the materialized view could be a useful tool. For example, a good security model would help since there’ll be much more control over where the view could be used. Someone with oversight of the whole system at a good level of detail could manage the decision on whether data refreshes should be controlled instead of automatic. Such was not the case in the system I worked on…

Generating Template Code From Utility Stored Procedures

July 1, 2009

Like many maintenance programmers I’ve been the victim of badly-thought out database or program design. Sometimes the application is written in “CV++”, cramming in every trendy feature imaginable, presumably to add another bullet point to the programmer’s resume. Perhaps it was written in a hurry or without doing any real analysis.

Sometimes the perpetrator is someone else, other times it’s me. Unfortunately I do have to take my place along with many others as the source of half-baked ideas causing hours of pointless work. I do try not to repeat my mistakes though. I’m going to try recording my thoughts here on issues and solutions I’ve found.

Before I start I ought to mention one factor which I think is the root of many design problems – boredom. There’s no law of nature that says everything important will be interesting. So I’m not guaranteeing that each topic will be exciting (even to an IT guy).

My background is in the financial sector. Many aspects of the banking business are dramatic and possibly exciting, as we’ve found in the last year. Probably the design of a table to record trader’s limits is not one of those aspects. That’s precisely why I’d like to get it right first time so I don’t have to think about it again! So I’ll probably be talking about details more often than not, and they may be related to finance but most will probably have a wider application. Any comments and suggestions welcome.

I thought I’d start with an example stored MS SQL Server INSERT stored procedure based on my model. Nearly all of the inserts in my accounts system are done through stored procs.

It would be a heavy burden both to write all those procedures and make sure they conform to standard – unless you have a “helper” procedure which creates the text. That’s the way I do it. If the stored proc makes more than one insertion a transaction should be included so that everything rolls back. I may amend it later to show this. Here’s the basic frame work with comments on the features (they don’t normally appear in the procs of course)

CREATE PROCEDURE [Acs].[spAccountPersonal_I]
–ALTER PROCEDURE [Acs].[spAccountPersonal_I]
/*
Input, output and input/output params have i_, o_ and io_ prefixes, making it easier to tell them apart from local variables
*/
@i_BookName nvarchar(100)
,@i_CategType nvarchar(50)
,@i_CategName nvarchar(50)
,@i_DateFrom datetime
,@i_DateTo datetime
,@i_PYID int

/*
Debug param always included even if there is no code utilising it. Default to 0 means it doesn’t have to be included in a call. Statements giving more detail such as intermediate data can be wrapped with if @i_debug = 1
*/
,@i_Debug int = 0

AS
BEGIN

/*
Usual statement to suppress return of row count
*/

SET NOCOUNT ON

DECLARE @Msg nvarchar(4000)
DECLARE @RowCount int
DECLARE @ErrNum int

/*
Enable SQL Server TRY/CATCH error trapping
*/

BEGIN TRY

INSERT INTO
tblAccountPersonal
(
BookName
,CategType
,CategName
,DateFrom
,DateTo
,PYID
)
VALUES
(
@i_BookName
,@i_CategType
,@i_CategName
,@i_DateFrom
,@i_DateTo
,@i_PYID
)

/*
Capture row count directly after insert statement
before it is reset by subsequent statements
*/

SELECT @RowCount = @@ROWCOUNT
/*
Set a message text which can be used whether or not an error will be raised
*/
SELECT @Msg = CAST(@RowCount as nvarchar(5)) + ‘ ‘
+ ‘row/s inserted into tblAccountPersonal.’
IF @RowCount = 1
/*
Level 10 is information level so in production the client will ignore it.
The developer executing with SQL Management Studio gets confirmation that the row was inserted.
*/
RAISERROR(@Msg, 10, 1, 1)
ELSE
/*
Since data was supplied through input params exactly one row
should have been inserted therefore raise an error if anything else
happens
*/
RAISERROR(@Msg, 16, 1, 1)
END TRY

BEGIN CATCH
/*
Store error value for use lower down
*/

SELECT @ErrNum = @@ERROR

/*
Pass CATCH block function return values to a procedure which will build
an informative error message
*/
SELECT @Msg = Err.ErrMsg(ERROR_NUMBER(), ERROR_PROCEDURE(), ERROR_LINE(), ERROR_MESSAGE())

/*
Raise the error again with the new message
*/
RAISERROR(@Msg, 16, 1, 1)

END CATCH

/*
Return 0 if the function executed correctly otherwise pass back error number
*/
RETURN COALESCE(@ErrNum, 0)

END

Using the User-defined Table types with ADO.Net structured parameter type

July 1, 2009

Late in March I released (started using a lot) a small C#/SQL Server project. It utilises the new User-Defined Table Type in SQL Server 2008 and the “structured” parameter type in ADO.NET. This allows a table originated on the client to be passed in to a stored procedure as a parameter. Oracle has had table data types for a while of course, and there are still some limitations to the SQL Server version, but it’s certainly very easy to use.

The application is an uploader of OCR’d bank statement data to the database, performing validations on the way in so that data in the final table is up to standard.

I’ve found that the structured type makes a very neat design possible The essentials of the design are as follows. The target table is tblBankStatement and I have defined a table type which mirrors its columns, datatypes and primary key.

Using system tables I created a stored procedure providing a dataset of column names, types and sizes, plus whether each column is nullable or not. Apart from the primary key, most of them are nullable. I don’t think the table type should have too many business rules otherwise it could get out of sync with any changes in the base tables.

The application has to upload data from a variety of bank statements. They all differ as to number and order of columns supplied. Some have debit and credit columns, others a single plus/minus column etc.. I created another table in the database to store layout details for each source document. By left outer joining this layout to the list of columns in the datatype it’s possible to see which columns are supplied and in what order – and which data will have to be created artificially.

The client code starts by bringing back this metadata which it uses to build a DataTable to supply as the value for the ADO.NET structured-type parameter.

I created my own C# UploadDataTable and UploadDataColumn types by subclassing DataTable and DataColumn. The DataColumn base class has an Ordinal property – I added SourceOrdinal, which, where not Null, indicates the position of the column in the source data. I also added a SourceColumns property to the DataTable providing the collection of columns expected in the source data. Once the UploadDataTable is built, the client code can iterate through the SourceColumns collection and test that there are no gaps or type mismatches in the data.

Any columns in the table type not found in SourceColumns will have to be created in code.
For example, the primary key of tblBankStatement includes [LineNum], a column which doesn’t occur in all source files. In that case the AutoIncrement property of the DataColumn is used to generate the numbers. [StatDate] is another column in the primary key, but may not be in the file. If the DefaultValue property is set to the Statement Date, the column will be assigned correctly. These were the only two methods I needed to use.

Once the data has been assembled and passed to the data loader stored procedure it’s very easy to pass the table on to other procedures and functions which can do specialised validation such as checking that the balance brought forward matches the closing balance of the previous statement.

This lightens the work done by the data loading procedure and makes it more readable since it’s now a controller. One criticism could be that most of the business rules are now in the database, but that’s the way my system works. I tend to think that the best place for business logic is one place, and many of the banking systems I’ve worked on have started with a tiered design but gradually the business logic has leaked across to the database. Since it’s so common to find some business logic in the database I decided to put all of it there.

July 1, 2009

It’s kludge time folks – but a relatively up-to-the-minute kludge at least, using  the new SQL Server 2008 user-defined table type. One of the things I’d like to be able to do with this is to define a function that returns a table of my chosen type, making the following code possible (I haven’t of course defined the function so it won’t execute):

CREATE TYPE Test.TestType
AS TABLE
(
C1 int NOT NULL
,C2 int NOT NULL
)

DECLARE @Vals Test.TestType
INSERT INTO @Vals SELECT * FROM Test.TestTypeFunction()

Perhaps I’d better quickly justify the use of SELECT * FROM. I’ve had the misfortune to be a maintenance programmer on an application which made a lot of use of this construct so I don’t lightly drop it into my code. For this statement though, the INSERT and SELECT column sets are guaranteed compatible – provided the function returns a table of the same table type as the caller’s variable.

So everything’s fine as long as we can guarantee that the function returns the expected table type. This function will have to be a “Multi-statement table-valued function”, since a table-type variable has to be declared to hold the data returned. That won’t be allowed in an inline function. Syntax for declaring a multi-statement table-valued function is as follows:

CREATE FUNCTION [ schema_name. ] function_name ( [ { @parameter_name [ AS ] [ type_schema_name. ] parameter_data_type [ = default ] [READONLY] } [ ,...n ] ] )
RETURNS @return_variable TABLE <table_type_definition>
[ WITH <function_option> [ ,...n ] ]
[ AS ]
BEGIN
function_body
RETURN END [ ; ]

Here’s where a small problem pops up. The bold line shows where you make the table declaration for the data returned. Instead of including a table definition, I wish I could write the following:
RETURNS @return_variable TABLE Test.TestType
…but apparently this isn’t possible. Of course I could simply paste in the definition of the table type. I would then have lost my guarantee of table type compatibility – anyone would be able to redefine the table in the function.

I think it’s very useful to have that guarantee so I don’t mind a little bit of kludgery in order to get it. The following script shows how the function can be tied in to the data type even though my handy syntax isn’t available. I’m using the WITH SCHEMABINDING option so that the table type can’t be dropped without re-editing the function. This means that an explicit column list has to be provided within the function but that’s not a problem since we don’t want “SELECT *” becoming a habit…

CREATE TYPE Test.TestType
AS TABLE
(
C1 int NOT NULL
,C2 int NOT NULL
)

GO

CREATE FUNCTION Test.TestTypeFunction
(
)
RETURNS
@ReturnValue TABLE
(
C1 int NOT NULL
,C2 int NOT NULL
) WITH SCHEMABINDING /*
prevents Test.TestType being dropped
without amending this function
*/

AS
BEGIN
DECLARE @T Test.TestType
INSERT INTO @ReturnValue
SELECT 1, 1
/*

UNION ALL ensures function output matches Test.TestType column list

*/

UNION ALL

SELECT C1, C2 FROM @T

RETURN
END

GO

/*CALL THE FUNCTION*/

DECLARE @Vals Test.TestType

INSERT INTO @Vals SELECT* FROM Test.TestTypeFunction()

SELECT* FROM @Vals

GO

DROP FUNCTION Test.TestTypeFunction
DROP TYPE Test.TestType

I think that being able to call the function as shown makes the jiggery-pokery worthwhile. The comments inside the function are needed though. Anyone seeing this for the first time may well be confused or irritated to see a UNION with an empty table.

Incidentally, with a stored procedure there’s no problem achieving the same goal. A variable of the table type you want can easily be defined inside the procedure. Instead of returning results to the client, the SELECT statement/s populate the variable. The output from the procedure then becomes a SELECT from it.

CLR Stored Procedures in SQL Server 2008

June 11, 2009

I recently got around to setting up Integrated CLR in my SQL Server 2008 installation, and coding a couple of stored procedures. These notes reflect my first look and a minor “gotcha” to be aware of.

Apparently the performance can be as good as Transact-SQL code in many situations, unless there’s a lot of data access going on. Pipelined functions may even offer a performance advantage in cases where a large amount of data must be delivered to the client.

I’m comfortable with Transact-SQL so for the moment I’ll be using CLR integration only in situations where it offers a very clear advantage over conventional stored procedures and functions. My CLR code will be doing some calculating and manipulating of data rather than just shovelling back the results of basic SQL.

I had a problem with most of the online examples of CLR stored procedures. They demonstrate using SqlContext.Pipe.ExecuteAndSend to return the results of an SQL query directly to the client, but I want to manipulate the data before sending it to the client. Other examples demonstrating how to send a scalar value, a stream of text or a single data record are no more useful.

I finally found what I was looking for at http://msdn.microsoft.com/en-us/library/microsoft.sqlserver.server.sqlpipe.sendresultsrow.aspx

The above is a teaching example and doesn’t query the database but it has the important information you need to build CLR procedures that can return synthetic data. I adapted it for my own use and the pseudo-code is as follows:

//Create DataTable from an sql query
//Build an SqlDataRecord
//Call method to signal that results will start to be returned
//Loop through rows in the DataTable
//Instantiate an object to process values from the current DataTable row
//Assign values from the business object to the SqlDataRecord
//Send the row to the client
//Call method to signal that results are complete

I won’t provide an example of the above since it’s all straightforward stuff. The code below should be enough to get started.

A key point to remember is that an SqlDataRecord is exactly that – a single record. I started out by thinking of it as a table so I came across a minor “gotcha” which I illustrate in the code samples below. SqlProcedure1 returns the correct data but as a collection of one-line resultsets. I can’t see this being useful very often.

I made a step in the right direction with StoredProcedure2, which returns one resultset, but then I noticed that SqlContext.Pipe.SendResultsRow(record) doesn’t clear the columns. They retain whatever value was last assigned to them. This makes sense since it is a record and we haven’t done anything to clear the columns, we’ve only passed the values to the client. StoredProcedure3 returns on resultset with correct values, and also uses the inbuilt method of setting the columns to Null, avoiding type matching problems.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredProcedures
{
[Microsoft.SqlServer.Server.SqlProcedure]
public static void StoredProcedure1()
{
//
//Sends back correct results but one resultset returned
//for each iteration of the for loop
//
SqlMetaData dateCol = new SqlMetaData(“Date”, SqlDbType.Date);
for(int i = 0; i < 10; i++)
{
SqlDataRecord record = new SqlDataRecord(dateCol);
SqlContext.Pipe.SendResultsStart(record);
//Every other row returned should have a null value
if (i % 2 == 0)
{
record.SetDateTime(0, DateTime.Today.AddDays(i));
}
SqlContext.Pipe.SendResultsRow(record);
SqlContext.Pipe.SendResultsEnd();
}
}
[Microsoft.SqlServer.Server.SqlProcedure]
public static void StoredProcedure2()
{
//
//One resultset returned but where value should be Null
//the value from the preceding row is passed back
//
SqlMetaData dateCol = new SqlMetaData("Date", SqlDbType.Date);
SqlDataRecord record = new SqlDataRecord(dateCol);

SqlContext.Pipe.SendResultsStart(record);
for (int i = 0; i < 10; i++)
{
if (i % 2 == 0)
{
record.SetDateTime(0, DateTime.Today.AddDays(i));
}
SqlContext.Pipe.SendResultsRow(record);
}
SqlContext.Pipe.SendResultsEnd();
}
[Microsoft.SqlServer.Server.SqlProcedure]
public static void StoredProcedure3()
{
//
//One resultset returned with correct values
//
SqlMetaData dateCol = new SqlMetaData("Date", SqlDbType.Date);
SqlDataRecord record = new SqlDataRecord(dateCol);

SqlContext.Pipe.SendResultsStart(record);
for (int i = 0; i < 10; i++)
{
if (i % 2 == 0)
{
record.SetDateTime(0, DateTime.Today.AddDays(i));
}
else
{
//avoids problem with non-nullable data type
record.SetDBNull(0);
}
SqlContext.Pipe.SendResultsRow(record);
}
SqlContext.Pipe.SendResultsEnd();
}

};


Follow

Get every new post delivered to your Inbox.