Michael Friis' Blog

About


Linq-to-SQL, group-by, subqueries and performance

If you’re using Linq-to-SQL, doing group-by and selecting other columns than those in the grouping-key, performance might suffer. This is because there is no good translation of such queries to SQL and Linq-to-SQL has to resort to doing multiple subqueries. Matt Warren explains here. I experienced this firsthand when grouping a lot of geocoded events by latitude and longitude and selecting a few more columns (EventId and CategoryId in the example below):

from e in db.Events
group e by new { e.Lat, e.Lng } into g
select new
{
    g.Key.Lat,
    g.Key.Lng,
    es = g.Select(_ => new { _.EventId, _.CategoryId })
};

One possible solution is to fetch all events, to a ToList() and do the grouping in-memory.

var foo =
    from e in db.Events
    select new { e.Lat, e.Lng, e.EventId, e.CategoryId };

var bar = from e in foo.ToList()
            group e by new { e.Lat, e.Lng } into g
            select new
            {
                g.Key.Lat,
                g.Key.Lng,
                es = g.Select(_ => new { _.EventId, _.CategoryId })
            };

EC2 SQL Server backup strategies and tactics

The many backup modes offered by Microsoft SQL server, combined with the prodigious hardware options offered on Amazon EC2 can make choosing a backup strategy for your setup a little confusing. In this post, I’ll sketch some options and end with a simple PowerShell script usable on both Express and Standard versions, that’ll backup your database to S3.

To start with, you should probably be running your database off an EBS (Elastic Block Storage) volume. They can sustain many more random IOPS than instance disks (good for typical workloads) and they live independently of your instances. While i haven’t had an instance die from under me, if one should cop it, all data on the local disks will be gone-gone.

EBS volumes can fail too however, and will do se at an annualised rate of 0.1% to 0.5% according to Amazon. You may decide this is good enough for your purposes and leave it at that. Bear in mind, however, that this failure rate is compounded by other factors such as Windows or SQL Server malfunctioning and corrupting the volume, you pressing the wrong button in AWS console/Management Studio, a disgruntled employee doing it on purpose or something else entirely. In other words, you should take backups.

A simple approach is to use the snapshotting feature of EBS. This basically saves the (diff of the) contents of your volume to S3, from whence it can be restored to life if something happens to the volume. I’ve used this to muck around with test-environments and such. It works fine and could conceivably be automated using the AWS API. It’s a rather low-level approach though, and you could easily find yourself restoring from a snapshot taken with SQL Server’s pants around its ankles, in the middle of a transaction. While obviously capable of recovering from such an indescretion and rolling back to a safe state, this can be something of a hassle.

Another option is to do normal backups to another EBS volume mounted on the same instance. While I have no knowledge of Amazon datacenter topologies, one could fear that different EBS volumes attached to the same instance end up being hosted on the same EBS-SAN-thingamebob, the death of which would then also be your undoing.

You could also copy backup-files to another instance mounting its own EBS volume, or set up replication — allowing you to recover very quickly. Note that SQL Server Express can subscribe to a Standard instance in a replication setup, although it cannot publish. Your replicated instance could even live in a different availability zone, although you would then incur bandwidth cost on exchanged data on top of the cost of running an extra instance.

The approach we ended up taking uses S3 however. Amazon promises S3 to be very safe (“no single point of failure”) and has the added benefit of being available independently of EC2 instances. To do a backup, we basically do a full database backup to one of the local disks and then move the file to S3. This is handled by a PowerShell script invoked as a scheduled task, making it usable on SQL Server Express instances (where the native SQL Server backup scheduling is not otherwise available). To handle the S3 interaction, we use the free CloudBerry snap-in. A few gotchas:

  1. If you’re running on a X64 system, install the snap-in with that .Net version
  2. You probably have to modify the PowerShell script execution policy on your instance
  3. You need the DotNetZip lib for zipping

Some possible improvements are zipping of files and shrinking of logfile before upload (*both added February 1. 2009*) and perhaps an incremental backup scheme.

Script is included below.

# This Powershell script is used to backup a SQL Server database and move the backup file to S3
# It can be run as a scheduled task like this:
# C:\WINDOWS\system32\WindowsPowerShell\v1.0\powershell.exe &'C:\Path\dbbackup.ps1'
# Written by Michael Friis (http://friism.com)

$key = "yourkey"
$secret = "yoursecret"
$localfolder = "C:\path\tobackupfolder"
$s3folder = "somebucket/backup/"
$name = Get-Date -uformat "backup_%Y_%m_%d"
$filename = $name + ".bak"
$zipfilename = $name + ".zip"
$dbname = "yourdb"
$dblogname = "yourdb_log"
$ziplibloc = "C:\pathto\ziplib\Ionic.Utils.Zip.dll"

# Remove existing db backup file
if(Test-Path -path ($localfolder + "\" + $filename)) { Remove-Item ($localfolder + "\" + $filename) }

$query =
"
USE {2}
GO

DBCC SHRINKFILE({3})

GO

BACKUP DATABASE [{2}] TO  DISK = N'{0}\{1}'
        WITH NOFORMAT, NOINIT,  NAME = N'backup', SKIP, REWIND, NOUNLOAD,  STATS = 10
GO
declare @backupSetId as int
select @backupSetId = position from msdb..backupset
where database_name=N'{2}' and backup_set_id=(select max(backup_set_id)
from msdb..backupset where database_name=N'{2}' )

if @backupSetId is null
begin
        raiserror(N'Verify failed. Backup information for database ''{2}'' not found.', 16, 1)
end
RESTORE VERIFYONLY FROM  DISK = N'{0}\{1}'
        WITH  FILE = @backupSetId,  NOUNLOAD,  NOREWIND" -f $localfolder, $filename, $dbname, $dblogname

sqlcmd -Q $query -S "."

# Remove existing zip file
if(Test-Path -path ($localfolder + "\" + $zipfilename)) { Remove-Item ($localfolder + "\" + $zipfilename) }

#Zip the backup file
[System.Reflection.Assembly]::LoadFrom($ziplibloc);
$zipfile =  new-object Ionic.Utils.Zip.ZipFile($localfolder + "\" + $zipfilename);
$e= $zipfile.AddFile($localfolder + "\" + $filename)
$zipfile.Save()

#Upload to S3
Add-PSSnapin CloudBerryLab.Explorer.PSSnapIn
$s3 = Get-CloudS3Connection -Key $key -Secret $secret
$destination = $s3 | Select-CloudFolder -path $s3folder
$src = Get-CloudFilesystemConnection | Select-CloudFolder $localfolder
$src | Copy-CloudItem $destination –filter $zipfilename

Techniques for unique, correct and fast geo-queries II

You may remember my last post on this topic ended with a question. Just to recap, we have a table with a lot of rows that have geographical coordinates and we want to find a random subset that lies in a given map window. The problem with the query demonstrated last was that there are rows with the exact same coordinates and for those coordinates the query would always return the same row (namely the one with the highest intid).

I posted a question on Stackoverflow and Tom H. came up with a solution that was 90% there. The full query looks like this:

with cte as
(
    select
        intid,
        row_number() over
            (
                    partition by geoLat, geoLng
                    order by newid()
            ) as row_num,
        count(intid) over (partition by geoLat, geoLng) as TotalCount
    from
        documents d
	where
		d.geoLat < @maxLat
			and d.geoLat > @minLat
			and
			(
				((@maxLng > @minLng) and
					(d.geoLng < @maxLng and d.geoLng > @minLng))
				or
				((@maxLng < @minLng) and
					((d.geoLng > @minLng and d.geoLng < 180)
						or (d.geoLng > -180 and d.geoLng < @maxLng))
					)
			)
)
select top (@maxcount)
    dd.*, cte.intid, rand(cte.intid)
from
    cte,documents dd
where
    row_num = 1 + floor(rand() * TotalCount) and cte.intid = dd.intid
order by newid()

The query uses Common Table Expressions, which I’ve dabbled in before. Looking at the execution plan makes my head hurt, but it’s at least as fast as the original version. See the new version in action at the TEDBot site.

Processing and SQL server on Windows

Tim Regan has a comprehensive description of how he got SQL Server 2008 running with processing. I found some steps to be superfluous while others were missing (it was still an invaluable guide though), here’s how I did it:

  1. Get the SQL Server JDBC driver, be sure to get the Windows version
  2. Unpack it somewhere and copy sqljdbc.jar to libraries\sqljdbc\library in your Processing folder
  3. Copy qljdbc_auth.dll from the enu\auth\x86 folder of jdbc to C:\Windows\System32
  4. In your sketch, do Sketch->Import Library->sqljdbc
  5. You can now use the helper class from Tim’s post (which is inspired by the one in Ben Fry’s book)

Note that it is apparantly not necessary to muck around with the CLASSPATH. Happy querying!

Techniques for unique, correct and fast geo-queries

UPDATE: Better solution here.

Here’s the scenario: You have a lot (a million+) of geotagged addresses in a database and you want to show them as markers on Google Maps. Some further requirements are:

  • You only want to show some constant (10) amount at any given time, since too many markers clutters the map and kills the browser
  • You want the markers shown for a given frame to be selected randomly so that a user looking at some particular area with many geocoded addresses is not always shown the same 10 markers
  • You want the markers not to be on top of each other, so that even though many addresses in the database have the same coordinates (i.e. “Copenhagen, Denmark”), you only return one result per coordinate (this is to avoid placing two or more markers in the same spot)
  • You want it to run at interactive speed on a crummy box

Imagine you have a Documents table with columns including geoLng, geoLat representing geographical coordinates and intid, a unique integer identifier. @maxcount is the maximum number of rows desired while @maxLat, @minLat, @maxLng and @minLng define the corners of the map viewport. This query will do the job:

select *
from Documents dd
where dd.intid in
(
    select top (@maxcount) max(intid)
    from Documents d
    where d.geoLat < @maxLat
        and d.geoLat > @minLat
        and
        (
            ((@maxLng > @minLng) and (d.geoLng < @maxLng and d.geoLng > @minLng))
            or
            ((@maxLng < @minLng) and
            (
                (d.geoLng > @minLng and d.geoLng < 180) or
                (d.geoLng > -180 and d.geoLng < @maxLng))
            )
        )
    group by d.geoLng, d.geoLat
    order by newid()
)

“What’s this monkeying around with the longitudes?” I hear your cry? Well, if the map viewport straddles the International Dateline (which is somewhere in the Pacific), Google Maps will feed you viewport corners that “wrap around” and that has to be handled. If-elses in SQL is a mess, so it’s better to cook up some pure boolean voodoo like above. “Who the hell looks at Google Maps of the International Dateline?” you ask. Good point, but it’s actually not that uncommon. Looking at Japan at low zoom-levels will often provoke this behaviour, as will looking at French Polynesia. Note that this trick is not needed for latitudes because the maps don’t wrap vertically

The slowest bit will usually be the order by newid() part, which gives each row in the prospective result a new (random) guid, and then sorts all the rows based on this column. It can be replaced with tablesample calls which are much faster, but also a lot more erratic.

There’s a very slight problem with this query in that the max(intid) will always cause the same row to be returned for any given Lat/Lng coordinate. The use of max(intid) is completely arbitrary and min(intid) would work too. Had there been a rand(intid) aggregate, the problem would have been solved. I haven’t figured out an elegant solution to this problem yet. Note that max()doesn’t work on guids produced by newid().

To get this to perform, the tables in question are organised around a clustered (geoLat, geoLng, intid) index. You can see somewhat more elaborate versions of this query doing their thing at the TEDBot website.

Attribute summing in CRM entity hierarchies

I’ve implemented some rather hefty Reporting Services (RS) reports lately, with some of the requirements including recursive summing af attribute values through hierarchies of CRM accounts. This sort of aggregation may be possible in RS, but I dislike that tool with a vengeance and prefer doing math in the database.

My first attempt (and the one that is now running at the client) used recursive SQL-functions. The scoping rules for recursive functions are highly non-intuitive however and when I look at the functions now, I have only a vague idea of why — let alone how – they work. Further, functions have to be created in the database independently of uploading the reports that use them, a minor deployment headache. Clearly there must be a better way…

Enter Common Table Expressions, a new feature introduced in SQL Server 2005. The syntax and exactly how the recursion works is somewhat opaque, the best explanation I’ve found is this article. You’ll probably need to read it if you want to grok this bit of SQL that computes the total number of employees in the parent/child hierarchy of each account:

with num_employees as
(
	select fa.accountid,
		accountid as original_id,
		fa.[name] as original_name,
		isnull(fa.numberofemployees,0) as numberofemployees
	from filteredaccount fa

	union all

	select fa.accountid,
		ne.original_id,
		ne.original_name,
		isnull(fa.numberofemployees,0) as numberofemployees
	from filteredaccount fa
		inner join num_employees ne on fa.parentaccountid = ne.accountid
)

select sum(numberofemployees) as numberofemployees, original_name
from num_employees
group by original_id,original_name

Here’s similar code for a prefilterable report dataset:

with num_employees as
(
	select accountid,
		accountid as original_id,
		name as original_name,
		isnull(numberofemployees, 0) as numberofemployees
	from filteredAccount as CRMAF_fa

	union all

	select fa.accountid,
		ne.original_id,
		ne.original_name,
		isnull(fa.numberofemployees, 0) as numberofemployees
        from filteredAccount as fa
		inner join num_employees as ne on fa.parentaccountid = ne.accountid
)
select sum(numberofemployees) as numberofemployees, original_name
from num_employees as num_empl_1
group by original_id, original_name

The most elegant formulation I’ve come up with is based on this example however:

with num_employees as
(
        select          accountid,
                        sum(numberofemployees) as numberofemployees
        from            filteredaccount
        group by        accountid

        union all

        select          fa.parentaccountid,
                        y.numberofemployees
        from            filteredaccount as fa
			inner join num_employees as y on y.accountid = fa.accountid
)
select sum(numberofemployees),accountid
from num_employees
group by accountid

I challenge you, the reader, to come up with a more concise example :-).

More resources:

Using SQL Server Native Web Services

This week I spent some time implementing an integration-scenario for a client. There were some complicating factors that we managed to overcome, and in case it may have wider applicability, I’ve written up our experiences.

The client has a successful e-commerce solution, and wanted updated information on their customers and their orders available in Dynamics CRM, so that complaints and other service requests could be handled consistently. For this sort of scenario, we would typically use ImportManager Studio, a superb piece if kit from from CRM Extensions. Given an ODBC connection to a data source and access to the CRM web service, it will merrily integrate data to almost all parts of the CRM data model.

Unfortunately, the e-commerce database was not hosted at the client’s premises. A rickety VPN-line was available, but deemed unsuitable and the hosting company was unwilling to open port 1433 directly to the SQL-server — a reasonable stand. We decided that a web service extracting information from the database and deployed somewhere on the e-commerce website would provide a solution acceptable to all, and I set to work.

First up, I wrote stored procedures to extract the required data. The database layout was very… different, but the forces of good prevailed. Next up was the implementation of the web services themselves. The group architect was assigned to the project, and figuring I had better do it right, I set about creating business-entity classes with associated data-accessors. This involved lots of tedious copy-paste boilerplate-coding but after a while, I was serving out tidy XML:

<ArrayOfOrder>
    <Order>
        <OrderNum>123</OrderNum>
        <OrderDate>03-08-2007</OrderDate>
        .
        .
        .
        <OrderLines>
            <OrderLine>
                <Linenr>1</Linenr>
                <Productcode>321</Productcode>
                .
                .
                .
            </OrderLine>
        </OrderLines>
    </Order>
</ArrayOfOrder>

Before moving onto the customers, I paused and looked at the code. While I was happy with the output, the associated effort and complexity seemed excessive. Indeed, each order-attribute exposed through the web service was named, renamed, referenced or manhandled in some other way no less than 10 times: twice in the stored procedure, twice in the data-access class, twice again in the entity-class constructor and four times around the property with associated getters and setters. While Visual Studio or SQL Server checked the validity of most af these occurrences (and their number can be significantly reduced in .Net 3.5), several where just free-floating strings. Worse, there was the risk of similar-looking attributes being cross-assigned due to some cut-‘n-paste error — I had found several of these already. Further, if the customer decided they needed more attributes migrated, adding them would be significant hassle.

Leaning back, unhappy with this state of affairs, I tried to come up with better, simpler solution. It was then I remembered something about SQL Server having it’s own web service thingamabob, and — after some bumbling around — lo and behold: Native XML Web Services for Microsoft SQL Server 2005. The introduction has it pretty well:

Microsoft SQL Server 2005 provides a standard mechanism for accessing the database engine using SOAP via HTTP. Using this mechanism, you can send SOAP/HTTP requests to SQL Server to execute:

  • Transact-SQL batch statements, with or without parameters.
  • Stored procedures, extended stored procedures, and scalar-valued user-defined functions.

Neat huh? SQL Server seems to accomplish this feat by registering itself directly with the OS(through http.sys), independent of any IIS, ensuring that relevant requests comes its way. The documentation on the “create endpoint” statement is terse and I couldn’t find any non-trivial examples. After some trying, I managed to concoct a statement palatable to the sophisticated tastes of SQL Server:

create endpoint Orders
	state = started
as http
(
	path='/sql/orders',
	authentication = (integrated),
	ports = (clear),
	clear_port = 8080,
	site = 'db-server'
)
for soap
(
	webmethod 'GetOrdersModifiedAfter'
		(name = 'Orderdb.dbo.GetOrdersModifiedAfter'),
	batches = disabled,
	wsdl = default,
	database = 'Orderdb',
	namespace = 'foospace',
	schema = standard
)

Obviously these web services can only be made accessible from the server running SQL Server (i.e. deploying them to the web-server running the e-commerce stuff was not an option). So — having validated the approach — I had to check with the hosting guys that it was OK to poke a hole in the firewall to the SQL server on port 8080. Security for this approach should be solid: Requests are authenticated at the http-level before being checked again by SQL Server. As a further measure at the host-site, only requests from the CRM-server IP are allowed through the firewall.

The downside to this approach was that my web service now disgorged object-arrays containing ADO DataSets (that is, if you’ve been a good boy…). The slightly added complexity in the code interacting with the web service was well worth the elimination of the entire middle tier on the database-end though. The DataSet can be found with this code (more than one may be returned, so watch out):

private static DataSet GetDataSet(object[] objectArray)
{
	DataSet ds = null;
	foreach (object o in objectArray)
	{
		if (o.GetType().ToString() == "System.Data.DataSet")
		{
			ds = (DataSet)o;
		}
	}
	return ds;
}

So there you have it: Found a off-site SQL Server with interesting data to be had? Native Web Services provides quick and low-maintenance access to it.