T-SQL syntax analysis

It is nice to be able to analyze the code for either code review automation or rule enforcement. While Roslyn is an extremely useful platform for C# code analysis, more often than not, .net applications come backed by SQL Server. And this means one thing…

We need Roslyn for SQL!

We probably are not getting full blown SQL dependency tracking system just yet. But apparently Microsoft does offer a SqlParser library that is very capable of handling most of static code analysis on SQL. Let us run through a reasonably common scenario. Developers rely on feature flags for all new functionality. Product Owner explicitly enables features are they become ready. States are kept in the database and managed via simple scripts. Since we release code multiple times a day, unfinished features can ship to production as long as respective flags are safely turned off. Now our task is to ensure that change scripts only create feature flag definitions but never turn them on by default.

A bit of theory

Parsing language (usually) happens in stages:

  1. Lexical analysis (convert character stream to basic building blocks – tokens)
  2. Syntactic analysis (group tokens into constructs according to selected grammar – parse tree)
  3. Semantic analysis (convert parse tree into abstract syntax tree)

Once we’ve got the Abstract Syntax Tree (AST), we can interact with it and perform certain operations depending on node type and context. Probably the easiest way to traverse such tree would be to implement a Visitor. Both Roslyn and SQL Parser offer stubs to aid implementing this pattern.

Sample scripts

This particular example revolves around two workflows:

-- case 1 - definŠµ new feature flag
INSERT INTO dbo.FeatureFlag ( [Key], Value ) VALUES ( @featureFlagId, N'true')
-- case 2 - update existing flag
UPDATE dbo.FeatureFlag SET Value = 'true' WHERE [Key] = @featureFlagId -- enable/disable flag state

Let’s imagine we’ve got a directory where developers put new scripts. We need to scan it and figure out whether any scripts attempt to update the Value in our table:

using Microsoft.SqlServer.Management.SqlParser.Parser;
......	
	foreach (var file in Directory.EnumerateFiles(@"C:\ChangeScriptsFolder"))
	{
		var result = Parser.Parse(File.ReadAllText(file)); // parse SQL. That's all code we need to get an Abstract Syntax Tree
		var visitor = new FeatureFlagStateVisitor(file); // prep our visitor
		result.Script.Accept(visitor); // run
		if (!visitor.Result.IsSuccess)
		{
			string errorList = visitor.Result.Errors.Aggregate(new StringBuilder(), (sb, error) => sb.AppendLine(error)).ToString();
			throw new Exception($"FeatureFlag state must be disabled on build. Sql text analysis indicates following issues: {errorList}");
		}
	}
.......

And the visitor itself can look something like this:

    public class SqlAnalysisResult
    {
        public bool IsSuccess { get; set; }
        public IList<string> Errors { get; set; }
    }
    public class FeatureFlagStateVisitorContext
    {
        public bool IsFtUpdating { get; set; }
        public bool IsFtInserting { get; set; }
        public string StatementLocation { get; set; }
        public int InsertColumnRef { get; set; }
    }
    public class FeatureFlagStateVisitor: SqlCodeObjectRecursiveVisitor
    {
        public SqlAnalysisResult Result { get; set; }
        private readonly FeatureFlagStateVisitorContext _context;
        private readonly string _fileName;

        public FeatureFlagStateVisitor(string fileName)
        {
            Result = new SqlAnalysisResult
            {
                Errors = new List<string>(),
                IsSuccess = true
            };
            _context = new FeatureFlagStateVisitorContext();
            _fileName = fileName;
        }

        public override void Visit(SqlInsertSpecification codeObject)
        {
            if (codeObject.Target.Sql.CaseInsensitiveContains("FeatureFlag"))
            {
                _context.InsertColumnRef = codeObject.TargetColumns.IndexOf(codeObject.TargetColumns.FirstOrDefault(c => c.ColumnName.Sql.Contains("Value")));
                if (_context.InsertColumnRef >= 0)
                {
                    _context.IsFtInserting = true;
                    _context.StatementLocation = $"L{codeObject.StartLocation.LineNumber}:{codeObject.StartLocation.Offset} - L{codeObject.EndLocation.LineNumber}:{codeObject.EndLocation.Offset}";
                }
            }
            base.Visit(codeObject);
            _context.IsFtInserting = false;
        }

        public override void Visit(SqlRowConstructorExpression codeObject)
        {
            if (_context.IsFtInserting && codeObject.Values[_context.InsertColumnRef].Sql.CaseInsensitiveContains("true"))
            {
                Result.IsSuccess = false;
                Result.Errors.Add($"INSERT {_fileName} - {_context.StatementLocation}");
            }
            base.Visit(codeObject);
        }

        public override void Visit(SqlUpdateSpecification codeObject)
        {
            if (codeObject.Target.Sql.CaseInsensitiveContains("FeatureFlag"))
            {
                _context.IsFtUpdating = true;
                _context.StatementLocation = $"L{codeObject.StartLocation.LineNumber}:{codeObject.StartLocation.Offset} - L{codeObject.EndLocation.LineNumber}:{codeObject.EndLocation.Offset}";
            }
            base.Visit(codeObject);
            _context.IsFtUpdating = false;
        }

        public override void Visit(SqlColumnAssignment codeObject)
        {
            if (_context.IsFtUpdating && codeObject.Column.Sql.CaseInsensitiveContains("Value") && codeObject.Value.Sql.CaseInsensitiveContains("true"))
            {
                Result.IsSuccess = false;
                Result.Errors.Add($"UPDATE {_fileName} - {_context.StatementLocation}");
            }
            base.Visit(codeObject);
        }
    }

The idea is pretty simple – we keep track where we are at the moment and refer back to this context when making final decision on SqlColumnAssignment/SqlRowConstructorExpression level.

SQL Server computed columns: not just mathematics

Computed columns in SQL server are very handy when we’ve got a formula that we can rely on and some CPU cycles to spare. There’s heaps of articles online about what these are, so we’d not repeat it here. Let us rather look at one peculiar case…

What if I’ve got two bit columns?

A very plausible scenario indeed. Let’s create a table like so:

CREATE TABLE Test (
Id INT PRIMARY KEY,
Check1 BIT NOT NULL,
Check2 BIT NOT NULL,
CheckAll AS (Check1 = 1 OR Check2 = 1) -- Incorrect syntax near '='.
)

that didn’t seem to work and the error message is a bit unhelpful: we know it’s wrong but why?

Reading the documentation once again

computed_column_expression
Is an expression that  defines the value  of a computed column. A computed column is a virtual column that is not physically stored in the table but is computed from an expression that uses other columns in the same table. For example, a computed column could have the definition: cost AS price * qty. The  expression can be a noncomputed column name, constant, function, variable , and any combination of these connected by one or more operators. The expression cannot be a subquery or include an alias data type.

Okay, this kinda hints at the fact we need to return a value. So a simple search condition does not cut it. Let us try one more time:

CREATE TABLE Test (
Id INT PRIMARY KEY,
Check1 BIT NOT NULL,
Check2 BIT NOT NULL,
CheckAll AS (CASE WHEN Check1 = 1 OR Check2 = 1 THEN 1 ELSE 0 END) -- this time is better! we can roll with it
)

Take it up another notch

The above example works but looks a bit too verbose and explicit. Is there a better way? If we deal with bit columns, we might be in luck and opt for t-sql bitwise operators:

CREATE TABLE Test (
Id INT PRIMARY KEY,
Check1 BIT NOT NULL,
Check2 BIT NOT NULL,
CheckAll AS (Check1|Check2) 
)