Variable scoping in TSQL isn't a thing

March 22, 2017, 12:15 pm

≪ Previous: Getting Windows share via python

It's a pop quiz kind of day: run the code through your mental parser.


BEGIN TRY
DECLARE @foo varchar(30) = 'Created in try block';
DECLARE @i int = 1 / 0;
END TRY
BEGIN CATCH
PRINT @foo;
SET @foo = 'Catch found';
END CATCH;

PRINT @foo;

Crazy enough, the last two are correct. It seems that unlike every other language I've worked with, all variables are scoped to the same local scope regardless of where in the script they are defined. Demo the first

Wanna see something even more crazy? Check this version out


BEGIN TRY
DECLARE @i int = 1 / 0;
DECLARE @foo varchar(30) = 'Created in try block';
END TRY
BEGIN CATCH
PRINT @foo;
SET @foo = 'Catch found';
END CATCH;

PRINT @foo;

As above, the scoping of variables remains the same but the forced divide by zero error occurs before the declaration and initialization of our variable @foo. The result? @foo remains uninitialized as evidenced by the first print in the Catch block but it still exists/was parsed to instantiate the variable but not so the value assignment. Second demo

What's all this mean? SQL's weird.

↧

Rename default constraints

June 30, 2017, 8:42 am

≫ Next: Generate TSQL time slices

≪ Previous: Variable scoping in TSQL isn't a thing

This week I'm dealing with synchronizing tables between environments and it seems that regardless of what tool I'm using for schema compare, it still gets hung up on the differences in default names for constraints. Rather than fight that battle, I figured it'd greatly simplify my life to systematically rename all my constraints to non default names. The naming convention I went with is DF__schema name_table name_column name. I know that my schemas/tables/columns don't have spaces or "weird" characters in them so this works for me. Use this as your own risk and if you are using pre-2012 the CONCAT call will need to be adjusted to classic string concatenation, a.k.a. +


DECLARE @query nvarchar(4000);
DECLARE
    CSR CURSOR
FAST_FORWARD
FOR
SELECT
    CONCAT('ALTER TABLE ', QUOTENAME(S.name), '.', QUOTENAME(T.name), ' DROP CONSTRAINT [', DC.name, '];', CHAR(10)
    , 'ALTER TABLE ', QUOTENAME(S.name), '.', QUOTENAME(T.name)
    , ' ADD CONSTRAINT [', 'DF__', (S.name), '_', (T.name), '_', C.name, ']'
    , ' DEFAULT ', DC.definition, ' FOR ', QUOTENAME(C.name)) AS Query
FROM
    sys.schemas AS S
INNERJOIN
        sys.tables AS T
ON T.schema_id = S.schema_id
INNERJOIN
        sys.columns AS C
ON C.object_id = T.object_id
INNERJOIN
        sys.default_constraints AS DC
ON DC.parent_object_id = T.object_id
AND DC.object_id = C.default_object_id
WHERE
    DC.name LIKE'DF__%'
AND DC.name <> CONCAT('DF__', (S.name), '_', (T.name), '_', C.name);

OPEN CSR
FETCHNEXTFROM CSR INTO @query;
WHILE@@FETCH_STATUS = 0
BEGIN
BEGIN TRY
EXECUTE sys.sp_executesql @query, N'';
END TRY
BEGIN CATCH
PRINT ERROR_MESSAGE()
PRINT @query;
END CATCH
FETCHNEXTFROM CSR INTO @query;
END
CLOSE CSR;
DEALLOCATE CSR;

↧

Generate TSQL time slices

July 28, 2017, 10:02 am

≫ Next: Biml build Database collection nodes

≪ Previous: Rename default constraints

I had some log data I wanted to bucket into 15 second time slices and I figured if I have solved this once, I will need to do it again so to the blog machine! This will use the LEAD, TIMEFROMPARTS and ROW_NUMBER() to accomplish this.


SELECT
    D.Slice AS SliceStart
,   LEAD
    (
        D.Slice
    ,   1
-- Default to midnight
    ,   TIMEFROMPARTS(0,0,0,0,0)
    )
OVER (ORDERBY D.Slice) AS SliceStop
,   ROW_NUMBER() OVER (ORDERBY D.Slice) AS SliceLabel
FROM
(
-- Generate 15 second time slices
SELECT
        TIMEFROMPARTS(A.rn, B.rn, C.rn, 0, 0) AS Slice
FROM
        (SELECTTOP (24) -1 + ROW_NUMBER() OVER (ORDERBY(SELECTNULL)) FROM sys.all_objects AS AO) AS A(rn)
CROSS APPLY (SELECTTOP (60) (-1 + ROW_NUMBER() OVER (ORDERBY(SELECTNULL))) FROM sys.all_objects AS AO) AS B(rn)
-- 4 values since we'll aggregate to 15 seconds
CROSS APPLY (SELECTTOP (4) (-1 + ROW_NUMBER() OVER (ORDERBY(SELECTNULL))) * 15  FROM sys.all_objects AS AO) AS C(rn)
) D

That looks like a lot, but it really isn't. Starting from the first inner most query, we select the top 24 rows from sys.all_objects and use the ROW_NUMBER function to generate us a monotonically increasing set of values, thus 1...24. However, since the allowable range of hours is 0 to 23, I deduct one from this value (A). I repeat this pattern to generate minutes (B) except we get the top 60. Since I want 15 second intervals, for the seconds query, I only get the top 4 values. I deduct one so we have {0,1,2,3} and then multiply by 15 to get my increments (C). If you want different time slices, that's how I would modify this pattern.

Finally having 3 columns of numbers, I use TIMEFROMPARTS to build a time data type with the least amount of precision and present that as "Slice" and encapsulate that a derived table (D). Running that query gets me a list of periods but I don't know what the end period is.

We can calculate the end period by using the LEAD function. I present my original Slice as SliceStart. I then use the LEAD function to calculate the next (1) value based on the Slice column. In the case of 23:59:45, the "next" value in our data set is NULL. To address that scenario, we pass in a the default value for the lead function.

↧

Biml build Database collection nodes

August 15, 2017, 6:00 am

≫ Next: Python pyinstaller erroring with takes 4 positional arguments but 5 were given

≪ Previous: Generate TSQL time slices

Biml build Database collection nodes aka what good are Annotations

In many of the walkthroughs on creating relational objects via Biml, it seems like people skim over the Databases collection. There's nothing built into the language to really support the creation of database nodes. The import database operations are focused on tables and schemas and assume the database node(s) have been created. I hate assumptions.

Connections.biml

Add a biml file to your project to house your connection information. I'm going to create two connections to the same Adventureworks 2014 database.


<Bimlxmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<OleDbConnectionName="Aw2014"ConnectionString="Provider=SQLNCLI11;Server=localhost\dev2016;Initial Catalog=AdventureWorks2014;Integrated Security=SSPI;"/>
<OleDbConnectionName="Aw2014Annotated"ConnectionString="Provider=SQLNCLI11;Server=localhost\dev2016;Initial Catalog=AdventureWorks2014;Integrated Security=SSPI;">
<Annotations>
<AnnotationTag="DatabaseName">AdventureWorks2014</Annotation>
<AnnotationTag="ServerName">localhost\dev2016</Annotation>
<AnnotationTag="Provider">SQLNCLI11</Annotation>
</Annotations>
</OleDbConnection>
</Connections>
</Biml>

The only material difference between the first and second OleDbConnection node is the declaration of Annotations on the second instance. Annotations are free form entities that allow you to enrich your Biml with more metadata. Here I've duplicated the DatabaseName, ServerName and Provider properties from my connection string. As you'll see in the upcoming BimlScript, prior proper planning prevents poor performance.

Databases.biml

The Databases node is a collection of AstDatabaseNode. They require a Name and a ConnectionName to be valid. Let's look at how we can construct these nodes based on information in our project. The first question I'd ask is what metadata do I readily have available? My Connections collection - that is already built out so I can enumerate through the items in there to populate the value of the ConnectionName. The only remaining item then is the Name for our database. I can see three ways of populating it: parsing the connection string for the database name, instantiating the connection manager and querying the database name from the RDBMS, or pulling the database name from our Annotations collection.

The basic approach would take the form


<Databases>
<#
string databaseName = string.Empty;

foreach(AstOleDbConnectionNode conn inthis.RootNode.OleDbConnections)
    {
        databaseName = "unknown";
// Logic goes here!
#>
<Database Name="<#= conn.Name #>.<#= databaseName#>" ConnectionName="<#= conn.Name #>" />
<#
    }
#>
</Databases>

The logic we stuff in there can be as simple or complex as needed but the end result would be a well formed Database node.

Parsing

Connection strings are delimited (semicolon) key value pairs unique to the connection type. In this approach we'll split the connection string by semicolon and then split each resulting entity by the equals sign.


// use string parsing and prayer
try
         {
string [] kVpish;
            KeyValuePair<string, string> kvp ;
// This is probably the most fragile as it needs to take into consideration all the
// possible connection styles. Get well acquainted with ConnectionStrings.com
// Split our connnection string based on the delimiter of a semicolon
foreach (var element in conn.ConnectionString.Split(newchar [] {';'}))
            {
                kVpish = element.Split(newchar[]{'='});
if(kVpish.Count() > 1)
                {
                    kvp = new KeyValuePair<string, string>(kVpish[0], kVpish[1]);
if (String.Compare("Initial Catalog", kvp.Key, StringComparison.InvariantCultureIgnoreCase) == 0)
                    {
                        databaseName = kvp.Value;
                    }
                }
            }
         }
catch (Exception ex)
         {
             databaseName = string.Format("{0}_{1}", "Error", ex.ToString());
         }

The challenge with this approach is that it's a fragile approach. As ConnectionStrings.com can attest, there are a lot of ways of constructing a connection string and what it denotes as the database name property.

Query

Another approach would be to instantiate the connection manager and then query the information schema equivalent and ask the database what the database name is. SQL Server makes this easy


// Or we can use database connectivity
// This query would need to have intelligence built into it to issue correct statement
// per database. This works for SQL Server only
string queryDatabaseName = @"SELECT db_name(db_id()) AS CurrentDB;";
            System.Data.DataTable dt = null;
            dt = ExternalDataAccess.GetDataTable(conn, queryDatabaseName);
foreach (System.Data.DataRow row in dt.Rows)
            {
                databaseName = row[0].ToString();
            }

The downside to this is that, much like parsing the connection string it's going to be provider specific. Plus, this will be the slowest due to the cost of instantiating connections. Not to mention ensuring the build server has all the correct drivers installed.

Annotations

"Prior proper planning prevents poor performance" so let's spend a moment up front to define our metadata and then use Linq to extract that information. Since we can't guarantee that a connection node has an Annotation tag named DatabaseName, we need to test for the existence (Any) and if we find one, we'll extract the value.


if (conn.Annotations.Any(an => an.Tag=="DatabaseName"))
            {
// We use the Select method to pull out only the thing, Text, that we are interested in
                databaseName = conn.Annotations.Where(an => an.Tag=="DatabaseName").Select(t => t.Text).First().ToString();
            }

The downside to this approach is that it requires planning and a bit of double entry as you need to keep the metadata (annotations) synchronized with the actual connection string. But since we're automating kind of people, that shouldn't be a problem...

Databases.biml

Putting it all together, our Databases.biml file becomes


<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#@ template tier="10" #>
<Databases>
<#
foreach(AstOleDbConnectionNode conn inthis.RootNode.OleDbConnections)
    {
string databaseName = "unknown";

// Test whether the annotation collection contains a tag named DatabaseName
if (conn.Annotations.Any(an => an.Tag=="DatabaseName"))
        {
// We use the Select method to pull out only the thing, Text, that we are interested in
            databaseName = conn.Annotations.Where(an => an.Tag=="DatabaseName").Select(t => t.Text).First().ToString();
        }
else
        {
// No annotations found
bool useStringParsing = true;
if (useStringParsing)
            {
// use string parsing and prayer
try
         {
string [] kVpish;
            KeyValuePair<string, string> kvp ;
// This is probably the most fragile as it needs to take into consideration all the
// possible connection styles. Get well acquainted with ConnectionStrings.com
// Split our connnection string based on the delimiter of a semicolon
foreach (var element in conn.ConnectionString.Split(newchar [] {';'}))
            {
                kVpish = element.Split(newchar[]{'='});
if(kVpish.Count() > 1)
                {
                    kvp = new KeyValuePair<string, string>(kVpish[0], kVpish[1]);
if (String.Compare("Initial Catalog", kvp.Key, StringComparison.InvariantCultureIgnoreCase) == 0)
                    {
                        databaseName = kvp.Value;
                    }
                }
            }
         }
catch (Exception ex)
         {
             databaseName = string.Format("{0}_{1}", "Error", ex.ToString());
         }
            }
else
            {
// Or we can use database connectivity
// This query would need to have intelligence built into it to issue correct statement
// per database. This works for SQL Server only
string queryDatabaseName = @"";
                System.Data.DataTable dt = null;
                dt = ExternalDataAccess.GetDataTable(conn, queryDatabaseName);
foreach (System.Data.DataRow row in dt.Rows)
                {
                    databaseName = row[0].ToString();
                }
            }

        }

    #>
<Database Name="<#= conn.Name #>.<#= databaseName#>" ConnectionName="<#= conn.Name #>" />
<#
        }
        #>
</Databases>
</Biml>

And that's what it takes to use BimlScript to build out the Database collection nodes based on Annotation, string paring or database querying. Use Annotations to enrich your Biml objects with good metadata and then you can use it to simplify future operations.

↧

Python pyinstaller erroring with takes 4 positional arguments but 5 were given

August 28, 2017, 8:13 am

≫ Next: SQL Server Query Metadata

≪ Previous: Biml build Database collection nodes

Pyinstaller is a program for turning python files into executable programs. This is helpful as it removes the requirement for having the python interpreter installed on a target computer. What was really weird was I could generate a multi-file package (pyinstaller .\MyFile.py) but not a onefile version.

C:\tmp>pyinstaller -onefile .\MyFile.py

Traceback (most recent call last):
  File "C:\Program Files (x86)\Python35-32\Scripts\pyinstaller-script.py", line 9, in 
    load_entry_point('PyInstaller==3.2.1', 'console_scripts', 'pyinstaller')()
  File "c:\program files (x86)\python35-32\lib\site-packages\PyInstaller\__main__.py", line 73, in run
    args = parser.parse_args(pyi_args)
  File "c:\program files (x86)\python35-32\lib\argparse.py", line 1727, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "c:\program files (x86)\python35-32\lib\argparse.py", line 1759, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "c:\program files (x86)\python35-32\lib\argparse.py", line 1967, in _parse_known_args
    start_index = consume_optional(start_index)
  File "c:\program files (x86)\python35-32\lib\argparse.py", line 1907, in consume_optional
    take_action(action, args, option_string)
  File "c:\program files (x86)\python35-32\lib\argparse.py", line 1835, in take_action
    action(self, namespace, argument_values, option_string)
TypeError: __call__() takes 4 positional arguments but 5 were given

What's the root cause? The argument is --onefile not -onefile.

↧

SQL Server Query Metadata

August 29, 2017, 6:00 am

≫ Next: Biml Query Table Builder

≪ Previous: Python pyinstaller erroring with takes 4 positional arguments but 5 were given

SQL Server Query Metadata

Pop quiz, how you determine the metadata of a query in SQL Server? For a table, you can query the sys.schemas/sys.tables/sys.columns tables but a query? You might start pulling the query apart and looking up each column and its metadata but then you have to factor in function calls and suddenly, you're writing a parser within your query and you have an infinite recursion error.

But, if you're on SQL Server 2012+, you have a friend in sys.dm_exec_describe_first_result_set.

Let's start with a random query from Glen Berry's diagnostic query set


-- Drive level latency information (Query 28) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT tab.[Drive], tab.volume_mount_point AS [Volume Mount Point], 
CASE
WHEN num_of_reads = 0 THEN 0 
ELSE (io_stall_read_ms/num_of_reads) 
ENDAS [Read Latency],
CASE
WHEN num_of_writes = 0 THEN 0 
ELSE (io_stall_write_ms/num_of_writes) 
ENDAS [Write Latency],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
ELSE (io_stall/(num_of_reads + num_of_writes)) 
ENDAS [Overall Latency],
CASE
WHEN num_of_reads = 0 THEN 0 
ELSE (num_of_bytes_read/num_of_reads) 
ENDAS [Avg Bytes/Read],
CASE
WHEN num_of_writes = 0 THEN 0 
ELSE (num_of_bytes_written/num_of_writes) 
ENDAS [Avg Bytes/Write],
CASE
WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes)) 
ENDAS [Avg Bytes/Transfer]
FROM (SELECTLEFT(UPPER(mf.physical_name), 2) AS Drive, SUM(num_of_reads) AS num_of_reads,
SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall, vs.volume_mount_point 
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
INNERJOIN sys.master_files AS mf WITH (NOLOCK)
ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.[file_id]) AS vs 
GROUPBYLEFT(UPPER(mf.physical_name), 2), vs.volume_mount_point) AS tab
ORDERBY [Overall Latency] OPTION (RECOMPILE);

Drive	Volume Mount Point	Read Latency	Write Latency	Overall Latency	Avg Bytes/Read	Avg Bytes/Write	Avg Bytes/Transfer
C:	C:\	0	0	0	64447	4493	31990

The results of the query aren't exciting, but what are the columns and expected data types? Pre-2012, most people dump the query results into a table with an impossible filter like WHERE 1=2 and then query the above system tables.

With the power of SQL Server 2012+, let's see what we can do. I'm going to pass in as the first argument our query and specify NULL for the next two parameters.


SELECT
    DEDFRS.column_ordinal
,   DEDFRS.name
,   DEDFRS.is_nullable
,   DEDFRS.system_type_name
,   DEDFRS.max_length
,   DEDFRS.precision
,   DEDFRS.scale
FROM
    sys.dm_exec_describe_first_result_set(N'
SELECT tab.[Drive], tab.volume_mount_point AS [Volume Mount Point], 
    CASE 
        WHEN num_of_reads = 0 THEN 0 
        ELSE (io_stall_read_ms/num_of_reads) 
    END AS [Read Latency],
    CASE 
        WHEN num_of_writes = 0 THEN 0 
        ELSE (io_stall_write_ms/num_of_writes) 
    END AS [Write Latency],
    CASE 
        WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
        ELSE (io_stall/(num_of_reads + num_of_writes)) 
    END AS [Overall Latency],
    CASE 
        WHEN num_of_reads = 0 THEN 0 
        ELSE (num_of_bytes_read/num_of_reads) 
    END AS [Avg Bytes/Read],
    CASE 
        WHEN num_of_writes = 0 THEN 0 
        ELSE (num_of_bytes_written/num_of_writes) 
    END AS [Avg Bytes/Write],
    CASE 
        WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
        ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes)) 
    END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, SUM(num_of_reads) AS num_of_reads,
             SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
             SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
             SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall, vs.volume_mount_point 
      FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
      INNER JOIN sys.master_files AS mf WITH (NOLOCK)
      ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
      CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.[file_id]) AS vs 
      GROUP BY LEFT(UPPER(mf.physical_name), 2), vs.volume_mount_point) AS tab
ORDER BY [Overall Latency] OPTION (RECOMPILE);', NULL, NULL) AS DEDFRS;

Look at our results. Now you can see the column names from our query, their basic type and whether they're nullable. That's pretty freaking handy.

column_ordinal	name	is_nullable	system_type_name	max_length	precision
1	Drive	1	nvarchar(2)	4	0
2	Volume Mount Point	1	nvarchar(256)	512	0
3	Read Latency	1	bigint	8	19
4	Write Latency	1	bigint	8	19
5	Overall Latency	1	bigint	8	19
6	Avg Bytes/Read	1	bigint	8	19
7	Avg Bytes/Write	1	bigint	8	19
8	Avg Bytes/Transfer	1	bigint	8	19

I'm thinking that I can use this technique against an arbitrary source of queries to build out the result tables and then ETL data into them. That should simplify my staging step for table loads. What can you use this for? Add links in the comments to how you use sys.dm_exec_describe_first_result_set

↧

Biml Query Table Builder

September 5, 2017, 6:00 am

≫ Next: Broken View Finder

≪ Previous: SQL Server Query Metadata

Biml Query Table Builder

We previously noted the awesomeness that is SQL Server 2012+'s ability to expose a query's metadata. Let's look how we can couple that information with creating Biml Table objects.

Prep work

Add a static Biml file to your project that defines and OLE DB Connection and then a database and schema. e.g.


<Bimlxmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<OleDbConnectionName="msdb"ConnectionString="Provider=SQLNCLI11;Data Source=localhost\dev2016;Integrated Security=SSPI;Initial Catalog=msdb"/>
</Connections>
<Databases>
<DatabaseName="msdb"ConnectionName="msdb"/>
</Databases>
<Schemas>
<SchemaName="dbo"DatabaseName="msdb"/>
</Schemas>
</Biml>

Now that we have a database connection named msdb and a valid database and schema, save the file and let's get into the good stuff.

Given the reference query in previous post, Drive level latency information, we would need to declare the following Biml within our Tables collection.


<TableName="Query 28"SchemaName="msdb.dbo">
<Columns>
<ColumnName="Drive"DataType="String"Length="4"Precision="0"Scale="0"IsNullable="true"/>
<ColumnName="Volume Mount Point"DataType="String"Length="512"Precision="0"Scale="0"IsNullable="true"/>
<ColumnName="Read Latency"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
<ColumnName="Write Latency"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
<ColumnName="Overall Latency"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
<ColumnName="Avg Bytes/Read"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
<ColumnName="Avg Bytes/Write"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
<ColumnName="Avg Bytes/Transfer"DataType="Int64"Length="8"Precision="19"Scale="0"IsNullable="true"/>
</Columns>
</Table>

That could be accomplished purely within the declarative nature of Biml wherein we do lots of text nuggets <#= "foo" #> but that's going to be ugly to maintain as there's a lot of conditional logic to muck with. Instead, I'm going to create a C# method that returns the the Biml table object (AstTableNode). To do that, we will need to create a Biml Class nugget <#+ #>. I ended up creating two methods: GetAstTableNodeFromQuery and then a helper method to translate the SQL Server data types into something Biml understood.


<#+
/// <summary>
/// Build out a Biml table based on the supplied query and connection. 
/// This assumes a valid SQL Server 2012+ OLEDB Connection is provided but the approach
/// can be adapted based on providers and information schemas.
/// We further assume that column names in the query are unique.
/// </summary>
/// <param name="connection">An OleDbConnection</param>
/// <param name="query">A SQL query</param>
/// <param name="schemaName">The schema our table should be created in</param>
/// <param name="queryName">A name for our query</param>
/// <returns>Best approximation of a SQL Server data type</returns>
public AstTableNode GetAstTableNodeFromQuery(AstOleDbConnectionNode connection, string query, string schemaName, string queryName)
    {
string template = @"SELECT
    DEDFRS.name
,   DEDFRS.is_nullable
,   DEDFRS.system_type_name
,   DEDFRS.max_length
,   DEDFRS.precision
,   DEDFRS.scale
FROM
    sys.dm_exec_describe_first_result_set(N'{0}', NULL, NULL) AS DEDFRS ORDER BY DEDFRS.column_ordinal;";
        AstTableNode atn = null;

        atn = new AstTableNode(null);
        atn.Name = queryName;
        atn.Schema = this.RootNode.Schemas[schemaName];
string queryActual = string.Format(template, query.Replace("'", "''"));

string colName = string.Empty;
string typeText = string.Empty;
        System.Data.DbType dbt = DbType.UInt16;
int length = 0;
int precision = 0;
int scale = 0;

try
        {
            System.Data.DataTable dt = null;
            dt = ExternalDataAccess.GetDataTable(connection, queryActual);
foreach (System.Data.DataRow row in dt.Rows)
            {
try
                {
                    AstTableColumnBaseNode col = new AstTableColumnNode(atn);
// This can be empty -- see DBCC TRACESTATUS (-1)
if(row[0] == DBNull.Value)
                    {
                        atn.Annotations.Add(new AstAnnotationNode(atn){Tag = "Invalid", Text = "No Metadata generated"});
break;
                    }
else
                    {
                        colName = row[0].ToString();
                    }

                    typeText = row[2].ToString();
                    dbt = TranslateSqlServerTypes(row[2].ToString());
                    length = int.Parse(row[3].ToString());
                    precision = int.Parse(row[4].ToString());
                    scale = int.Parse(row[5].ToString());

                    col.Name = colName;
                    col.IsNullable = (bool)row[1];
                    col.DataType = dbt;
                    col.Length = length;
                    col.Precision = precision;
                    col.Scale = scale;

                    atn.Columns.Add(col);
                }
catch (Exception ex)
                {
// Something went awry with making a column for our table
                    AstTableColumnBaseNode col = new AstTableColumnNode(atn);
                    col.Name = "FailureColumn";
                    col.Annotations.Add(new AstAnnotationNode(col){Tag = "colName", Text = colName});
                    col.Annotations.Add(new AstAnnotationNode(col){Tag = "typeText", Text = typeText});
                    col.Annotations.Add(new AstAnnotationNode(col){Tag = "dbtype", Text = dbt.ToString()});
                    col.Annotations.Add(new AstAnnotationNode(col){Tag = "Error", Text = ex.Message});
                    col.Annotations.Add(new AstAnnotationNode(col){Tag = "Stack", Text = ex.StackTrace});
                    atn.Columns.Add(col);
                }
            }
        }
catch (Exception ex)
        {
// Table level failures
            AstTableColumnBaseNode col = new AstTableColumnNode(atn);
            col.Name = "Failure";
            col.Annotations.Add(new AstAnnotationNode(col){Tag = "Error", Text = ex.ToString()});
            col.Annotations.Add(new AstAnnotationNode(col){Tag = "SourceQuery", Text = query});
            col.Annotations.Add(new AstAnnotationNode(col){Tag = "QueryActual", Text = queryActual});
            atn.Columns.Add(col);
        }
return atn;
    }

/// <summary>
/// A rudimentary method to convert SQL Server data types to Biml types. Doesn't cover
/// UDDT, sql_variant(well)
/// </summary>
/// <param name="typeName">Data type with optional length/scale/precision</param>
/// <returns>Best approximation of a SQL Server data type</returns>
public DbType TranslateSqlServerTypes(string typeName)
    {
// typeName might contain length - strip it
string fixedName = typeName;
if(typeName.Contains("("))
        {
            fixedName = typeName.Substring(0, typeName.IndexOf("("));
        }
// Approximate translation of https://msdn.microsoft.com/en-us/library/System.Data.DbType.aspx
// https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql-server-data-type-mappings
        Dictionary<string, DbType> translate = new Dictionary<string, DbType> {

            {"bigint", DbType.Int64 }
        ,   {"binary", DbType.Binary }
        ,   {"bit", DbType.Boolean }
        ,   {"char", DbType.AnsiStringFixedLength }
        ,   {"date", DbType.Date }
        ,   {"datetime", DbType.DateTime }
        ,   {"datetime2", DbType.DateTime2 }
        ,   {"datetimeoffset", DbType.DateTimeOffset }
        ,   {"decimal", DbType.Decimal }
        ,   {"float", DbType.Double }
//,   {"geography", 
//,   {"geometry", 
//,   {"hierarchyid", 
        ,   {"image", DbType.Binary }
        ,   {"int", DbType.Int32 }
        ,   {"money", DbType.Decimal }
        ,   {"nchar", DbType.StringFixedLength }
        ,   {"ntext", DbType.String }
        ,   {"numeric", DbType.Decimal }
        ,   {"nvarchar", DbType.String }
        ,   {"real", DbType.Single }
        ,   {"smalldatetime", DbType.DateTime }
        ,   {"smallint", DbType.Int16 }
        ,   {"smallmoney", DbType.Decimal }
        ,   {"sql_variant", DbType.Object }
        ,   {"sysname", DbType.String }
        ,   {"text", DbType.String }
        ,   {"time", DbType.Time }
        ,   {"timestamp", DbType.Binary }
        ,   {"tinyint", DbType.Byte }
        ,   {"uniqueidentifier", DbType.Guid }
        ,   {"varbinary", DbType.Binary }
        ,   {"varchar", DbType.AnsiString }
        ,   {"xml", DbType.Xml }
        };

try
        {
return translate[fixedName];
        }
catch
        {
return System.Data.DbType.UInt64;
        }
    }
<#+

Good grief, that's a lot of code, how do I use it? The basic usage would be something like


<Tables>
<#=  GetAstTableNodeFromQuery(this.RootNode.OleDbConnections["msdb"], "SELECT 100 AS demo", "dbo", "DemoQuery").GetBiml() #>
</Tables>

The call to GetAstTableNodeFromQuery return an AstTableNode which is great, but what we really want is the Biml behind it so we chain a call to .GetBiml() onto the end.

What would make that better though is to make it a little more dynamic. Let's improve the code to create tables based on a pairs of names and queries. I'm going to use a Dictionary called namedQueries to hold the names and queries and then enumerate through them, calling our GetAstTableNodeFromQuery for each entry.


<#
    Dictionary<string, string> namedQueries = new Dictionary<string,string>{{"Query 28", @"-- Drive level latency information (Query 28) (Drive Level Latency)
-- Based on code from Jimmy May
SELECT tab.[Drive], tab.volume_mount_point AS [Volume Mount Point], 
    CASE 
        WHEN num_of_reads = 0 THEN 0 
        ELSE (io_stall_read_ms/num_of_reads) 
    END AS [Read Latency],
    CASE 
        WHEN num_of_writes = 0 THEN 0 
        ELSE (io_stall_write_ms/num_of_writes) 
    END AS [Write Latency],
    CASE 
        WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
        ELSE (io_stall/(num_of_reads + num_of_writes)) 
    END AS [Overall Latency],
    CASE 
        WHEN num_of_reads = 0 THEN 0 
        ELSE (num_of_bytes_read/num_of_reads) 
    END AS [Avg Bytes/Read],
    CASE 
        WHEN num_of_writes = 0 THEN 0 
        ELSE (num_of_bytes_written/num_of_writes) 
    END AS [Avg Bytes/Write],
    CASE 
        WHEN (num_of_reads = 0 AND num_of_writes = 0) THEN 0 
        ELSE ((num_of_bytes_read + num_of_bytes_written)/(num_of_reads + num_of_writes)) 
    END AS [Avg Bytes/Transfer]
FROM (SELECT LEFT(UPPER(mf.physical_name), 2) AS Drive, SUM(num_of_reads) AS num_of_reads,
             SUM(io_stall_read_ms) AS io_stall_read_ms, SUM(num_of_writes) AS num_of_writes,
             SUM(io_stall_write_ms) AS io_stall_write_ms, SUM(num_of_bytes_read) AS num_of_bytes_read,
             SUM(num_of_bytes_written) AS num_of_bytes_written, SUM(io_stall) AS io_stall, vs.volume_mount_point 
      FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
      INNER JOIN sys.master_files AS mf WITH (NOLOCK)
      ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id
      CROSS APPLY sys.dm_os_volume_stats(mf.database_id, mf.[file_id]) AS vs 
      GROUP BY LEFT(UPPER(mf.physical_name), 2), vs.volume_mount_point) AS tab
ORDER BY [Overall Latency] OPTION (RECOMPILE);"}};
#>
<Tables>
<# foreach(var kvp in namedQueries){ #>
<#=  GetAstTableNodeFromQuery(this.RootNode.OleDbConnections["msdb"], kvp.Value, "dbo", kvp.Key).GetBiml() #>
<# } #>
</Tables>

How can we improve this? Let's get rid of the hard coded query names and actual queries. Tune in to the next installment to see how we'll make that work.

Full code is over on github

↧

Broken View Finder

October 5, 2017, 6:00 am

≫ Next: Temporal table maker

≪ Previous: Biml Query Table Builder

Broken View Finder

Shh, shhhhhh, we're being very very quiet, we're hunting broken views. Recently, we were asked to migrate some code changes and after doing so, the requesting team told us we had broken all of their views, but they couldn't tell us what was broken, just that everything was. After a quick rollback to snapshot, thank you Red Gate SQL Compare, I thought it'd be enlightening to see whether anything was broken before our code had been deployed.

You'll never guess what we discovered </clickbain&grt;

How can you tell a view is broken

The easiest way is SELECT TOP 1 * FROM dbo.MyView; but then you need to figure out all of your views.

That's easy enough, SELECT * FROM sys.schemas AS S INNER JOIN sys.views AS V ON V.schema_id = S.schema_id;

But you know, there's something built into SQL Server that will actually test your views - sys.sp_refreshview. That's much cleaner than running sys.sp_executesql with our SELECT TOP 1s


-- This script identifies broken views
-- and at least the first error with it
SET NOCOUNT ON;
DECLARE
    CSR CURSOR
FAST_FORWARD
FOR
SELECT
    CONCAT(QUOTENAME(S.name), '.', QUOTENAME(V.name)) AS vname
FROM
    sys.views AS V
INNERJOIN
        sys.schemas AS S
ON S.schema_id = V.schema_id;

DECLARE
    @viewname nvarchar(776);
DECLARE
    @BROKENVIEWS table
(
    viewname nvarchar(776)
,   ErrorMessage nvarchar(4000)
,   ErrorLine int
);

OPEN
    CSR;
FETCH
NEXTFROM CSR INTO @viewname;

WHILE
@@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXECUTE sys.sp_refreshview
            @viewname;
END TRY
BEGIN CATCH
INSERTINTO @BROKENVIEWS(viewname, ErrorMessage, ErrorLine)
VALUES
        (
            @viewname
        ,   ERROR_MESSAGE()
        ,   ERROR_LINE()
        );

END CATCH

FETCH
NEXTFROM CSR INTO @viewname;
END

CLOSE CSR;
DEALLOCATE CSR;

SELECT
    B.*
FROM
    @BROKENVIEWS AS B

Can you think of ways to improve this? Either way, happy hunting!

↧

Temporal table maker

October 12, 2017, 6:00 am

≫ Next: What's my transaction isolation level

≪ Previous: Broken View Finder

Temporal table maker

This post is another in the continuing theme of "making things consistent." We were voluntold to help another team get their staging environment set up. Piece of cake, SQL Compare made it trivial to snap the tables over.

Oh, we don't want these tables in Custom schema, we want them in dbo. No problem, SQL Compare again and change owner mappings and bam, out come all the tables.

Oh, can we get this in near real-time? Say every 15 minutes. ... Transaction replication to the rescue!

Oh, we don't know what data we need yet so could you keep it all, forever? ... Temporal tables to the rescue?

Yes, temporal tables is perfect. But don't put the history table in the same schema as the table, put in this one. And put all of that in its own file group.

And that's what this script does. It

generates a table definition for an existing table, copying it into a new schema while also adding in the start/stop columns for temporal tables.

crates the clustered column store index command

creates a non-clustered index against the start/stop columns and the natural key(s)

Alters the original table to add in our start/stop columns with defaults and the period

Alters the original table to turn on versioning

How does it do all that? It finds all the tables that exist in our source schema and doesn't yet exist in the target schema. I build out a select * query against that table and feed it into sys.dm_exec_describe_first_result_set to identify the columns. And since sys.dm_exec_describe_first_result_set so nicely brings back the data type with length, precision and scale specified, we might as well use that as well. By specifying a value of 1 for browse_information_mode parameter, we will get the key columns defined for us. Which is handy when we want to make our non-clustered index.


DECLARE
    @query nvarchar(4000)
,   @targetSchema sysname = 'dbo_HISTORY'
,   @tableName sysname
,   @targetFileGroup sysname = 'History'

DECLARE
    CSR CURSOR
FAST_FORWARD
FOR
SELECTALL
    CONCAT(
'SELECT * FROM '
    ,   s.name
    ,   '.'
    ,   t.name) 
,   t.name
FROM
    sys.schemas AS S
INNERJOIN sys.tables AS T
ON T.schema_id = S.schema_id
WHERE
    1=1
AND S.name = 'dbo'
AND T.name NOTIN
    (SELECT TI.name FROM sys.schemas AS SI INNERJOIN sys.tables AS TI ON TI.schema_id = SI.schema_id WHERE SI.name = @targetSchema)

;
OPEN CSR;
FETCHNEXTFROM CSR INTO @query, @tableName;
WHILE@@FETCH_STATUS = 0
BEGIN
-- do something
SELECT
        CONCAT
    (
'CREATE TABLE '
    ,   @targetSchema
    ,   '.'
    ,   @tableName
    ,   '('
    ,   STUFF
        (
            (
SELECT
                CONCAT
                (
','
                ,   DEDFRS.name
                ,   ''
                ,   DEDFRS.system_type_name
                ,   ''
                ,   CASE DEDFRS.is_nullable
WHEN 1 THEN''
ELSE'NOT '
END
                ,   'NULL'
                )
FROM
                sys.dm_exec_describe_first_result_set(@query, N'', 1) AS DEDFRS
ORDERBY
                DEDFRS.column_ordinal
FOR XML PATH('')
            )
        ,   1
        ,   1
        ,   ''
        )
        ,   ', SysStartTime datetime2(7) NOT NULL'
        ,   ', SysEndTime datetime2(7) NOT NULL'
        ,   ')'
        ,   ' ON '
        ,   @targetFileGroup
        ,   ';'
        ,   CHAR(13)
        ,   'CREATE CLUSTERED COLUMNSTORE INDEX CCI_'
        ,   @targetSchema
        ,   '_'
        ,   @tableName
        ,   ' ON '
        ,   @targetSchema
        ,   '.'
        ,   @tableName
        ,   ' ON '
        ,   @targetFileGroup
        ,   ';'
        ,   CHAR(13)
        ,   'CREATE NONCLUSTERED INDEX IX_'
        ,   @targetSchema
        ,   '_'
        ,   @tableName
        ,   '_PERIOD_COLUMNS '
        ,   ' ON '
        ,   @targetSchema
        ,   '.'
        ,   @tableName

        ,   '('
        ,   'SysEndTime'
        ,   ',SysStartTime'
        ,   (
SELECT
                    CONCAT
                    (
','
                    ,   DEDFRS.name
                    )
FROM
                    sys.dm_exec_describe_first_result_set(@query, N'', 1) AS DEDFRS
WHERE
                    DEDFRS.is_part_of_unique_key = 1
ORDERBY
                    DEDFRS.column_ordinal
FOR XML PATH('')
                )
        ,   ')'
        ,   ' ON '
        ,   @targetFileGroup
        ,   ';'
        ,   CHAR(13)
        ,   'ALTER TABLE '
        ,   'dbo'
        ,   '.'
        ,   @tableName
        ,   ' ADD '
        ,   'SysStartTime datetime2(7) GENERATED ALWAYS AS ROW START HIDDEN'
        ,   ' CONSTRAINT DF_'
        ,   'dbo_'
        ,   @tableName
        ,   '_SysStartTime DEFAULT SYSUTCDATETIME()'
        ,   ', SysEndTime datetime2(7) GENERATED ALWAYS AS ROW END HIDDEN'
        ,   ' CONSTRAINT DF_'
        ,   'dbo_'
        ,   @tableName
        ,   '_SysEndTime DEFAULT DATETIME2FROMPARTS(9999, 12, 31, 23,59, 59,9999999,7)'
        ,   ', PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime);'
        ,   CHAR(13)
        ,   'ALTER TABLE '
        ,   'dbo'
        ,   '.'
        ,   @tableName
        ,   ' SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = '
        ,   @targetSchema
        ,   '.'
        ,   @tableName
        ,   '));'

    )

FETCHNEXTFROM CSR INTO @query, @tableName;
END
CLOSE CSR;
DEALLOCATE CSR;

Lessons learned

The exampled I cobbled together from MSDN were great, until they weren't. Be wary of anyone who doesn't specify lengths - one example used datetime2 for the start/stop columns, the other specified datetime2(0). The default precision with datetime2 is 7, which is very much not 0. Those data types differences were incompatible for temporal table and history.

Cleaning up from that mess was ugly. I couldn't drop the start/stop columns until I dropped the PERIOD column. One doesn't drop a PERIOD though, one has to DROP PERIOD FOR SYSTEM_TIME

I prefer to use the *FromParts methods where I can so that's in my default instead of casting strings. Out ambiguity of internationalization!

This doesn't account for tables with bad names and potentially without primary/unique keys defined. My domain was clean so beware of this a general purpose temporal table maker.

Improvements

How can you make this better? My hard coded dbo should have been abstracted out to a @sourceSchema variable. I should have used QUOTENAME for all my entity names. I could have stuffed all those commands into either a table or invoked it directly with a sp_execute_sql call. ~~I should have abused CONCAT more~~ Wait, that's done. That's very well done.

Finally, you are responsible for the results of this script. Don't run it anywhere without evaluating and understanding the consequences.

↧

What's my transaction isolation level

November 9, 2017, 2:06 pm

≫ Next: Python Azure Function requestor's IP address

≪ Previous: Temporal table maker

What's my transaction isolation level

That's an easy question to answer - StackOverflow has a fine answer.

But, what if I use sp_executesql to run some dynamic sql - does it default the connection isolation level? If I change isolation level within the query, does it propagate back to the invoker? That's a great question, William. Let's find out.


SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT CASE transaction_isolation_level 
WHEN 0 THEN 'Unspecified'
WHEN 1 THEN 'ReadUncommitted'
WHEN 2 THEN 'ReadCommitted'
WHEN 3 THEN 'Repeatable'
WHEN 4 THEN 'Serializable'
WHEN 5 THEN 'Snapshot' END AS TRANSACTION_ISOLATION_LEVEL 
FROM sys.dm_exec_sessions 
where session_id = @@SPID;

DECLARE
    @query nvarchar(max) = N'-- Identify iso level
SELECT CASE transaction_isolation_level 
WHEN 0 THEN ''Unspecified''
WHEN 1 THEN ''ReadUncommitted''
WHEN 2 THEN ''ReadCommitted''
WHEN 3 THEN ''Repeatable''
WHEN 4 THEN ''Serializable''
WHEN 5 THEN ''Snapshot'' END AS TRANSACTION_ISOLATION_LEVEL 
FROM sys.dm_exec_sessions 
where session_id = @@SPID;

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- Test iso level
SELECT CASE transaction_isolation_level 
WHEN 0 THEN ''Unspecified''
WHEN 1 THEN ''ReadUncommitted''
WHEN 2 THEN ''ReadCommitted''
WHEN 3 THEN ''Repeatable''
WHEN 4 THEN ''Serializable''
WHEN 5 THEN ''Snapshot'' END AS TRANSACTION_ISOLATION_LEVEL 
FROM sys.dm_exec_sessions 
where session_id = @@SPID'

EXECUTE sys.sp_executesql @query, N'';

SELECT CASE transaction_isolation_level 
WHEN 0 THEN 'Unspecified'
WHEN 1 THEN 'ReadUncommitted'
WHEN 2 THEN 'ReadCommitted'
WHEN 3 THEN 'Repeatable'
WHEN 4 THEN 'Serializable'
WHEN 5 THEN 'Snapshot' END AS TRANSACTION_ISOLATION_LEVEL 
FROM sys.dm_exec_sessions 
where session_id = @@SPID;

I begin my session in read uncommitted aka "nolock". I then run dynamic sql which identifies my isolation level, still read uncommitted, change it to a different level, confirmed at read committed, and then exit and check my final state - back to read uncommitted.

Finally, thanks to Andrew Kelly (b|t) for answering the #sqlhelp call.

↧

Python Azure Function requestor's IP address

December 29, 2017, 11:00 am

≫ Next: Staging Metadata Framework for the Unknown

≪ Previous: What's my transaction isolation level

Python Azure Function requestor's IP address

I'm working on an anonymous level Azure Function in python and couldn't find where they stored the IP address of the caller, if applicable. It's in the request headers, which makes sense but not until I spent far too much time looking in all the wrong places. A minimal reproduction would look something like


import os
iptag = "REQ_HEADERS_X-FORWARDED-FOR"
ip = "Tag name:{} Tag value:{}".format(iptag, os.environ[iptag])
print(ip)

Now, something to note is that it will return not only the IP address but the port the call came in through. Thus, I see a value of 192.168.1.200:33496 instead of just the ipv4 value.

Knowing where to look, I can see that the heavy lifting had already been done by the most excellent HTTPHelper but as a wise man once said: knowing is half the battle.


import os
from AzureHTTPHelper import HTTPHelper
http = HTTPHelper()
#Notice the lower casing of properties here and the trimming of the type (REQ_HEADERS)
iptag = "x-forwarded-for"
ip = "Tag name:{} Tag value:{}".format(iptag, http.headers[iptag])
print(ip)

Yo Joe!

↧

Staging Metadata Framework for the Unknown

January 22, 2018, 6:00 am

≫ Next: What are all the functions and their parameters?

≪ Previous: Python Azure Function requestor's IP address

Staging metadata framework for the unknown

That's a terrible title but it's the best I got. A client would like to report out of ServiceNow some metrics not readily available in the PowerBI App. The first time I connected, I got a quick look at the Incidents and some of the data we'd be interested in but I have no idea how that data changes over time. When you first open a ticket, maybe it doesn't have a resolved date or a caused by field populated. And since this is all web service stuff and you can customize it, I knew I was looking at lots of iterations to try and keep up with all the data coming back from the service. How can I handle this and keep sane? Those were my two goals. I thought it'd be fun to share how I solved the problem using features in SQL Server 2016.

To begin, I created a database called RTMA to perform my real time metrics analysis. CREATE DATABASE RTMA; With that done, I created a schema within my database like USE RTMA; GO CREATE SCHEMA ServiceNow AUTHORIZATION dbo; To begin, we need a table to hold our discovery metadata.


CREATE TABLE 
    ServiceNow.ColumnSizing
(
    EntityName varchar(30) NOT NULL
,   CollectionName varchar(30) NOT NULL
,   ColumnName varchar(30) NOT NULL
,   ColumnLength int NOT NULL
,   InsertDate datetime NOT NULL
    CONSTRAINT DF_ServiceNow_ColumnSizing_InsertDate DEFAULT (GETDATE())
);

CREATE CLUSTERED COLUMNSTORE INDEX
    CCI_ServiceNow_ColumnSizing
    ON ServiceNow.ColumnSizing;

The idea for this metadata table is that we'll just keep adding more information in for the entities we survey. All that matters is the largest length for a given combination of Entity, Collection, and Column.

In the following demo, we'll add 2 rows into our table. The first batch will be our initial sizing and then "something" happens and we discover the size has increased.


INSERT INTO
    ServiceNow.ColumnSizing
(
    EntityName
,   CollectionName
,   ColumnName
,   ColumnLength
,   InsertDate
)
VALUES
    ('DoesNotExist', 'records', 'ABC', 10, current_timestamp)
,   ('DoesNotExist', 'records', 'BCD', 30, current_timestamp);

Create a base table for our DoesNotExist. What columns will be available? I know I'll want my InsertDate and that's the only thing I'll guarantee to begin. And that's ok because we're going to get clever.


DECLARE @entity nvarchar(30) = N'DoesNotExist'
,   @Template nvarchar(max) = N'DROP TABLE IF EXISTS ServiceNow.Stage;
    CREATE TABLE
        ServiceNow.Stage
    (

    InsertDate datetime CONSTRAINT DF_ServiceNow_Stage_InsertDate DEFAULT (GETDATE())
    );
    CREATE CLUSTERED COLUMNSTORE INDEX
        CCI_ServiceNow_Stage
    ON
        ServiceNow.Stage;'
,   @Columns nvarchar(max) = N'';

DECLARE @Query nvarchar(max) = REPLACE(REPLACE(@Template, '', @Entity), '', @Columns);
EXECUTE sys.sp_executesql @Query, N'';

We now have a table with one column so let's look at using our synthetic metadata (ColumnSizing) to augment it. The important thing to understand in the next block of code is that we'll use FOR XML PATH('') to concatenate rows together and the CONCAT function to concatenate values together.

See more here for the XML PATH "trick"

If we're going to define columns for a table, it follows that we need to know what table needs what columns and what size those columns should be. So, let the following block be that definition.


DECLARE @Entity varchar(30) = 'DoesNotExist';

SELECT
    CS.EntityName
,   CS.CollectionName
,   CS.ColumnName
,   MAX(CS.ColumnLength) AS ColumnLength
FROM
    ServiceNow.ColumnSizing AS CS
WHERE
    CS.ColumnLength > 0
    AND CS.ColumnLength =  
    (
        SELECT
            MAX(CSI.ColumnLength) AS ColumnLength
        FROM
            ServiceNow.ColumnSizing AS CSI
        WHERE
            CSI.EntityName = CS.EntityName
            AND CSI.ColumnName = CS.ColumnName
    )
    AND CS.EntityName = @Entity
GROUP BY
    CS.EntityName
,   CS.CollectionName
,   CS.ColumnName;

We run the above query and that looks like what we want so into the FOR XML machine it goes.

DECLARE @Entity varchar(30) = 'DoesNotExist'
,   @ColumnSizeDeclaration varchar(max);

;WITH BASE_DATA AS
(
    -- Define the base data we'll use to drive creation
    SELECT
        CS.EntityName
    ,   CS.CollectionName
    ,   CS.ColumnName
    ,   MAX(CS.ColumnLength) AS ColumnLength
    FROM
        ServiceNow.ColumnSizing AS CS
    WHERE
        CS.ColumnLength > 0
        AND CS.ColumnLength =  
        (
            SELECT
                MAX(CSI.ColumnLength) AS ColumnLength
            FROM
                ServiceNow.ColumnSizing AS CSI
            WHERE
                CSI.EntityName = CS.EntityName
                AND CSI.ColumnName = CS.ColumnName
        )
        AND CS.EntityName = @Entity
    GROUP BY
        CS.EntityName
    ,   CS.CollectionName
    ,   CS.ColumnName
)
SELECT DISTINCT
    BD.EntityName
,   (
        SELECT
            CONCAT
            (
''
            ,   BDI.ColumnName
            ,   ' varchar('
            ,   BDI.ColumnLength
            ,   '),'
            ) 
        FROM
            BASE_DATA AS BDI
        WHERE
            BDI.EntityName = BD.EntityName
            AND BDI.CollectionName = BD.CollectionName
        FOR XML PATH('')
) AS ColumnSizeDeclaration
FROM
    BASE_DATA AS BD;

That looks like a lot, but it's not. Run it and you'll see we get one row with two elements: "DoesNotExist" and "ABC varchar(10),BCD varchar(30)," That trailing comma is going to be a problem, that's generally why you see people either a leading delimiter and use STUFF to remove it or in the case of a trailing delimiter LEFT with LEN -1 does the trick.

But we're clever and don't need such tricks. If you look at the declaration for @Template, we assume there will *always* be at final column of InsertDate which didn't have a comma preceding it. Always define the rules to favor yourself. ;)

Instead of the static table declaration we used, let's marry our common table expression, CTE, with the table template.


DECLARE @entity nvarchar(30) = N'DoesNotExist'
,   @Template nvarchar(max) = N'DROP TABLE IF EXISTS ServiceNow.Stage;
    CREATE TABLE
        ServiceNow.Stage
    (

    InsertDate datetime CONSTRAINT DF_ServiceNow_Stage_InsertDate DEFAULT (GETDATE())
    );
    CREATE CLUSTERED COLUMNSTORE INDEX
        CCI_ServiceNow_Stage
    ON
        ServiceNow.Stage;'
,   @Columns nvarchar(max) = N'';

-- CTE logic patched in here

;WITH BASE_DATA AS
(
    -- Define the base data we'll use to drive creation
    SELECT
        CS.EntityName
    ,   CS.CollectionName
    ,   CS.ColumnName
    ,   MAX(CS.ColumnLength) AS ColumnLength
    FROM
        ServiceNow.ColumnSizing AS CS
    WHERE
        CS.ColumnLength > 0
        AND CS.ColumnLength =  
        (
            SELECT
                MAX(CSI.ColumnLength) AS ColumnLength
            FROM
                ServiceNow.ColumnSizing AS CSI
            WHERE
                CSI.EntityName = CS.EntityName
                AND CSI.ColumnName = CS.ColumnName
        )
        AND CS.EntityName = @Entity
    GROUP BY
        CS.EntityName
    ,   CS.CollectionName
    ,   CS.ColumnName
)
SELECT DISTINCT
    @Columns = (
        SELECT
            CONCAT
            (
''
            ,   BDI.ColumnName
            ,   ' varchar('
            ,   BDI.ColumnLength
            ,   '),'
            ) 
        FROM
            BASE_DATA AS BDI
        WHERE
            BDI.EntityName = BD.EntityName
            AND BDI.CollectionName = BD.CollectionName
        FOR XML PATH('')
) 
FROM
    BASE_DATA AS BD;

DECLARE @Query nvarchar(max) = REPLACE(REPLACE(@Template, '', @Entity), '', @Columns);
EXECUTE sys.sp_executesql @Query, N'';

Bam, look at it now. We took advantage of the new DROP IF EXISTS (DIE) syntax to drop our table and we've redeclared it, nice as can be. Don't take my word for it though, ask the system tables what they see.


SELECT
    S.name AS SchemaName
,   T.name AS TableName
,   C.name AS ColumnName
,   T2.name AS DataTypeName
,   C.max_length
FROM
    sys.schemas AS S
    INNER JOIN
        sys.tables AS T
        ON T.schema_id = S.schema_id
    INNER JOIN
        sys.columns AS C
        ON C.object_id = T.object_id
    INNER JOIN
        sys.types AS T2
        ON T2.user_type_id = C.user_type_id
WHERE
    S.name = 'ServiceNow'
    AND T.name = 'StageDoesNotExist'
ORDER BY
    S.name
,   T.name
,   C.column_id;

Excellent, we now turn on the actual data storage process and voila, we get a value stored into our table. Simulate it with the following.


INSERT INTO ServiceNow.StageDoesNotExist
(ABC, BCD) VALUES ('Important', 'Very, very important');

Truly, all is well and good.

*time passes*

Then, this happens


WAITFOR DELAY ('00:00:03');

INSERT INTO
    ServiceNow.ColumnSizing
(
    EntityName
,   CollectionName
,   ColumnName
,   ColumnLength
,   InsertDate
)
VALUES
    ('DoesNotExist', 'records', 'BCD', 34, current_timestamp);

Followed by


INSERT INTO ServiceNow.StageDoesNotExist
(ABC, BCD) VALUES ('Important','Very important, yet ephemeral data');

To quote Dr. Beckett: Oh boy

↧

What are all the functions and their parameters?

January 25, 2018, 6:00 am

≫ Next: Pop quiz - altering column types

≪ Previous: Staging Metadata Framework for the Unknown

What are all the functions and their parameters?

File this one under: I wrote it once, may I never need it again

In my ever expanding quest for getting all the metadata, I how could I determine the metadata for all my table valued functions? No problem, that's what sys.dm_exec_describe_first_result_set is for. SELECT * FROM sys.dm_exec_describe_first_result_set(N'SELECT * FROM dbo.foo(@xmlMessage)', N'@xmlMessage nvarchar(max)', 1) AS DEDFRS

Except, I need to know parameters. And I need to know parameter types. And order. Fortunately, sys.parameters and sys.types makes this easy. The only ugliness comes from the double invocation of row rollups



SELECT 
    CONCAT
    (
''
    ,   'SELECT * FROM '
    ,   QUOTENAME(S.name)
    ,   '.'
    ,   QUOTENAME(O.name)
    ,   '('
        -- Parameters here without type
    ,   STUFF
        (
            (
                SELECT 
                    CONCAT
                    (
''
                    ,   ','
                    ,   P.name
                    ,   ''
                    )
                FROM
                    sys.parameters AS P
                WHERE
                    P.is_output = CAST(0 AS bit)
                    AND P.object_id = O.object_id
                ORDER BY
                    P.parameter_id
                FOR XML PATH('')
            )
        ,   1
        ,   1
        ,   ''
        )

    ,   ') AS F;'
    ) AS SourceQuery
,   (
        STUFF
        (
            (
                SELECT 
                    CONCAT
                    (
''
                    ,   ','
                    ,   P.name
                    ,   ''
                    ,   CASE 
                        WHEN T2.name LIKE '%char' THEN CONCAT(T2.name, '(', CASE P.max_length WHEN -1 THEN 'max' ELSE CAST(P.max_length AS varchar(4)) END, ')')
                        WHEN T2.name = 'time' OR T2.name ='datetime2' THEN CONCAT(T2.name, '(', P.scale, ')')
                        WHEN T2.name = 'numeric' THEN CONCAT(T2.name, '(', P.precision, ',', P.scale, ')')
                        ELSE T2.name
                    END
                    )
                FROM
                    sys.parameters AS P
                    INNER JOIN
                        sys.types AS T2
                        ON T2.user_type_id = P.user_type_id
                WHERE
                    P.is_output = CAST(0 AS bit)
                    AND P.object_id = O.object_id
                ORDER BY
                    P.parameter_id
                FOR XML PATH('')
            )
        ,   1
        ,   1
        ,   ''
        )
    ) AS ParamterList
FROM
    sys.schemas AS S
    INNER JOIN
        sys.objects AS O
        ON O.schema_id = S.schema_id
WHERE
    O.type IN ('FT','IF', 'TF');

How you use this is up to you. I plan on hooking it into the Biml Query Table Builder to simulate tables for all my TVFs.

↧

Pop quiz - altering column types

February 21, 2018, 9:23 am

≫ Next: Altering table types, part 2

≪ Previous: What are all the functions and their parameters?

Pop quiz

Given the following DDL


CREATE TABLE dbo.IntToTime
(
    CREATE_TIME int
);

What will be the result of issuing the following command?

ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME time NULL;

Clearly, if I'm asking, it's not what you might expect. How can an empty table not allow you to change data types? Well it seems Time and datetime2 are special cases as they'll raise errors of the form

Msg 206, Level 16, State 2, Line 47 Operand type clash: int is incompatible with time

If you're in this situation and need to get the type converted, you'll need to make two hops, one to varchar and then to time.


ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME varchar(10) NULL;
ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME time NULL;

↧

Altering table types, part 2

February 22, 2018, 6:00 am

≫ Next: Pop Quiz - REPLACE in SQL Server

≪ Previous: Pop quiz - altering column types

Altering table types - a compatibility guide

In yesterday's post, I altered a table type. Pray I don't alter them further. What else is incompatible with an integer column? It's just a morbid curiosity at this point as I don't recall having ever seen this after working with SQL Server for 18 years. Side note, dang I'm old

How best to answer the question, by interrogating the sys.types table and throwing operations against the wall to see what does/doesn't stick.


DECLARE
    @Results table
(
    TypeName sysname, Failed bit, ErrorMessage nvarchar(4000)
);

DECLARE
    @DoOver nvarchar(4000) = N'DROP TABLE IF EXISTS dbo.IntToTime;
CREATE TABLE dbo.IntToTime (CREATE_TIME int);'
,   @alter nvarchar(4000) = N'ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME @type'
,   @query nvarchar(4000) = NULL
,   @typeName sysname = 'datetime';

DECLARE
    CSR CURSOR
FORWARD_ONLY
FOR
SELECT 
    T.name
FROM
    sys.types AS T
WHERE
    T.is_user_defined = 0

OPEN CSR;
FETCH NEXT FROM CSR INTO @typeName
WHILE @@FETCH_STATUS = 0
BEGIN
    BEGIN TRY   
        EXECUTE sys.sp_executesql @DoOver, N'';
        SELECT @query = REPLACE(@alter, N'@type', @typeName);
        EXECUTE sys.sp_executesql @query, N'';

        INSERT INTO
            @Results
        (
            TypeName
        ,   Failed
        ,   ErrorMessage
        )
        SELECT @typeName, CAST(0 AS bit), ERROR_MESSAGE();
    END TRY
    BEGIN CATCH
        INSERT INTO
            @Results
        (
            TypeName
        ,   Failed
        ,   ErrorMessage
        )
        SELECT @typeName, CAST(1 AS bit), ERROR_MESSAGE()
    END CATCH
    FETCH NEXT FROM CSR INTO @typeName
END
CLOSE CSR;
DEALLOCATE CSR;

SELECT
*
FROM
    @Results AS R
ORDER BY
    2,1;

TypeName	Failed	ErrorMessage
bigint	0
binary	0
bit	0
char	0
datetime	0
decimal	0
float	0
int	0
money	0
nchar	0
numeric	0
nvarchar	0
real	0
smalldatetime	0
smallint	0
smallmoney	0
sql_variant	0
sysname	0
tinyint	0
varbinary	0
varchar	0
date	1	Operand type clash: int is incompatible with date
datetime2	1	Operand type clash: int is incompatible with datetime2
datetimeoffset	1	Operand type clash: int is incompatible with datetimeoffset
geography	1	Operand type clash: int is incompatible with geography
geometry	1	Operand type clash: int is incompatible with geometry
hierarchyid	1	Operand type clash: int is incompatible with hierarchyid
image	1	Operand type clash: int is incompatible with image
ntext	1	Operand type clash: int is incompatible with ntext
text	1	Operand type clash: int is incompatible with text
time	1	Operand type clash: int is incompatible with time
timestamp	1	Cannot alter column 'CREATE_TIME' to be data type timestamp.
uniqueidentifier	1	Operand type clash: int is incompatible with uniqueidentifier
xml	1	Operand type clash: int is incompatible with xml

↧

Pop Quiz - REPLACE in SQL Server

February 23, 2018, 6:00 am

≫ Next: Python pandas repeating character tester

≪ Previous: Altering table types, part 2

It's amazing the things I've run into with SQL Server this week that I never noticed. In today's pop quiz, let's look at REPLACE


DECLARE
    @Repro table
(
    SourceColumn varchar(30)
);

INSERT INTO 
    @Repro
(
    SourceColumn
)
SELECT
    D.SourceColumn
FROM
(
    VALUES 
        ('None')
    ,   ('ABC')
    ,   ('BCD')
    ,   ('DEF')
)D(SourceColumn);

SELECT
    R.SourceColumn
,   REPLACE(R.SourceColumn, 'None', NULL) AS wat
FROM
    @Repro AS R;

In the preceding example, I load 4 rows into a table and call the REPLACE function on it. Why? Because some numbskull front end developer entered None instead of a NULL for a non-existent value. No problem, I will simply replace all None with NULL. So, what's the value of the wat column?

Well, if you're one of those people who reads instruction manuals before attempting anything, you'd have seen Returns NULL if any one of the arguments is NULL. Otherwise, you're like me thinking "maybe I put the arguments in the wrong order". Nope, , REPLACE(R.SourceColumn, 'None', '') AS EmptyString that works. So what the heck? Guess I'll actually read the manual... No, this work, I can just use NULLIF to make the empty strings into a NULL , NULLIF(REPLACE(R.SourceColumn, 'None', ''), '') AS EmptyStringToNull

Much better, replace all my instances of None with an empty string and then convert anything that is empty string to null. Wait, what? You know what would be better? Skipping the replace call altogether.


SELECT
    R.SourceColumn
,   NULLIF(R.SourceColumn, 'None') AS MuchBetter
FROM
    @Repro AS R;

Moral of the story and/or quiz: once you have a working solution, rubber duck out your approach to see if there's an opportunity for improvement (only after having committed the working version to source control).

↧

Python pandas repeating character tester

March 2, 2018, 6:00 am

≫ Next: 2018 MVP Summit retrospective

≪ Previous: Pop Quiz - REPLACE in SQL Server

Python pandas repeating character tester

At one of our clients, we are data profiling. They have a mainframe, it's been running for so long, they no longer have SMEs for their data. We've been able to leverage Service Broker to provide a real-time, under 3 seconds, remote file store for their data. It's pretty cool but now they are trying to do something with the data so we need to understand what the data looks like. We're using a mix of TSQL and python to understand nullability, value variances, etc. One of the "interesting" things we've discovered is that they loved placeholder values. Everyone knows a date of 88888888 is a placeholder for the actual date which they'll get two steps later in the workflow. Except sometimes we use 99999999 because the eights are the placeholder for the time.

Initially, we were just searching for one sentinel value, then two values until we saw the bigger pattern of "repeated values probably mean something." For us, this matters because we then need to discard those rows for data type suitability. 88888888 isn't a valid date so our logic might determine that column is best served by a numeric data type. Unless we exclude the eights value in which we get a 100% match rate on the column's ability to be converted to a date.

How can we determine if a string is nothing but repeated values in python? There's a very clever test from StackOverflow

source == source[0] * len(source)I would read that as "is the source variable exactly equal to the the first character of source repeated for the length of source?"

And that was good, until we hit a NULL (None in python-speak). We then took advantage of the ternary equivalent in python to make it

(source == source[0] * len(source)) if source else False

Enter Pandas (series)

Truth is a funny thing in an Pandas Series. Really, it is. The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. We were trying to apply the above function as we were doing everything else

df.MyColumn.str.len()
# this will fail magnificantly
(df.MyColumn == df.MyColumn[0] * len(df.MyColumn)) if df.MyColumn else False

It took me a while since I hadn't really used the pandas library beyond running what my coworker had done. What I needed to do, was get a row context to apply the calculations for true/false. As it stands, the Series stuff wants to try and aggregate the booleans or something like that. And it makes sense from a SQL perspective, you can't really apply aggregates to bit fields (beyond COUNT).

So, what's the solution? As always, you're likely to say the exact thing you're looking for. In this case, apply was the keyword.

df.MyColumn.apply(lambda source: (source == source[0] * len(source)) if source else False)

Full code you can play with would be


import pandas
import pprint

def isRepeated(src):
    return (src == src[0] * len(src)) if src else False

df = pandas.DataFrame({"MyCol":pandas.Series(['AB', 'BC', 'BB', None])})

pprint.pprint(df)

print()
# What rows have the same character in all of them?

pprint.pprint(df.MyCol.apply(lambda source:(source == source[0] * len(source)) if source else False))
#If you'd like to avoid the anonymous function...
pprint.pprint(df.MyCol.apply(isRepeated))

In short, python is glorious and I'm happy to writing in it again ;)

↧

2018 MVP Summit retrospective

March 12, 2018, 6:00 am

≫ Next: Sort SQL Server tables into similarly sized buckets

≪ Previous: Python pandas repeating character tester

2018 MVP Summit retrospective

Another year of the MVP Summit is in the bag and as always, I have months worth of learning I'm excited to do.

Thank you

I'd like to extend a hearty thank you to Microsoft and the various teams for hosting us. I can't imagine the sheer amount of hours spent in preparation, actual time not-spent-working-on-technology-X, much less the expense of caffeinating, feeding, lodging, and transporting us.

What I'm excited about

Stream Analytics

We have a high performance (60M messages per day averaging 130ms throughput) messaging system that allows us to expose mainframe data as a SQL Server database for analytics. The devil with Service Broker is that there's no built in monitoring. We have a clever dashboard built on the PowerBI reporting streaming dataset source that provides an at-a-glance health check for data processing. What we need though, is something that can drive action based on changes. The September changes in Stream Analytics look like the perfect fit. It allows us to detect not just hard limits (we've violated our 3 second SLA) but the squishier metrics like a background process just woke up and swamped us with a million rows in the past three minutes or our processing time is trending upwards and someone needs to figure out why.

SQL Graph improvements

While we are not yet using graph features, I can see opportunities for it with our client that I want to build some proof of concept models.

Cosmos DB

Alongside the Stream Analytics improvements, perhaps we need to feed the change data into Cosmos and then leverage the Change Feed support to push to analytics processing. And just generally, I need to invest some time in Apache Spark. I also learned that I don't need to discover all the patterns for lambda architecture as it's already out there with a handy URL to boot.

Cognitive Services

Ok, while picking up information about this was just to scratch a very silly itch, I was impressed how easy it was from the web interface. I have bird feeders and even though most seed will state that squirrels are not interested in it, that's a downright lie.

Don't mind me, I'm just a fuzzy bird

I want a camera pointed at my bird feeder and if a squirrel shows, I want to know about it. I used about a dozen pictures of my bird feeders with and without my nemesis to train the model and then fed back assorted photos to see how smart it was. Except for an image of a squirrel hiding in shadow, it was able to give me high confidence readings on what was featured in the photo. Here we can see that my dog is neither a bird nor a squirrel.
Not a squirrel, just a lazy dog

I'm so excited to get these bots built out. One for the Raspberry Pi to detect presence at the feeder and then an Azure based recognizer for friend versus foe. Once that's done, the next phase will be to identify specific bird species. And then tie it to feed type and feeder style (tray/platform versus house versus tube) and time of day and ... yes, lot of fun permutations that are easily available without having to learn all the computer vision and statistics. Feel free to give it a whirl at https://customvision.ai

SQLOps studio

This is the new cross platform SQL Server Management Studio replacement - sort of. It's not designed to do everything SSMS does but instead the vision is to solve the most needed problems and with the open source model, the community can patch in their own solutions. I'm excited to put together a better reporting interface for the SSISDB. Something that you can actually copy text out of - how crazy is that?

Azure Data Lake Analytics

It had been a year since I had worked through some of the ADLA/USQL so it was good to get back into the language and environment. I need to get on a project that is actually using the technology though to really cement my knowledge.

What I learned

In October of 2016, I launched Sterling Data Consulting as my company. I sub under a good friend and it's been an adventure running a business but I don't feel like I'm really running a business since I have no other business. One of my TODOs at the conference was to talk to other small shop owners to see if I could discover their "secret sauce." While I got assorted feedback, the two I want to send a special thank you to are John Sterrett of Procure SQL and Tim Radney. Their advice ranged from straight forward "I don't know what you do", "are you for hire" to thoughts on lead acquisition and my lack of vision for sales.

Tim was also my roommate and it was great just getting to know him. We traded Boy Scout leader stories and he had excellent ideas for High Adventure fundraisers since that's something our troop is looking to do next year. For being a year younger than me, he sure had a lot more wisdom on the things I don't do or don't do well. You should check him at at the Atlanta SQL Saturday and attend his precon on Common SQL Server mistakes and how to avoid them.

Photos

Bellevue is less scenic than Seattle but the sunshine and warmth on Tuesday made for some nice photos of the treehouses. Yes, the Microsoft Campus has adult sized treehouses in it. How cool is that?

↧

Sort SQL Server tables into similarly sized buckets

April 5, 2018, 6:00 am

≫ Next: A date dimension for SQL Server

≪ Previous: 2018 MVP Summit retrospective

Sort SQL Server Tables into similarly sized buckets

You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance---it really doesn't matter. What does matter is that you'd like to get it done sooner rather than later. If time is no consideration, then you'd likely just do one table at a time until you've done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You're paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I'm staging data in SSIS, I often use a row count as an approximation for a time cost. It's not perfect - a million row table 430 columns wide might actually take longer than the 250 million row key-value table.

A sincere tip of the hat to Daniel Hutmacher (b|t)for his answer on this StackExchange post. He has some great logic for sorting tables into approximately equally sized bins and it performs reasonably well.


SET NOCOUNT ON;
DECLARE
    @bucketCount tinyint = 6;

IF OBJECT_ID('tempdb..#work') ISNOTNULL
BEGIN
DROPTABLE #work;
END

CREATETABLE #work (
    _row    intIDENTITY(1, 1) NOTNULL,
    [SchemaName] sysname,
    [TableName] sysname,
    [RowsCounted]  bigintNOTNULL,
    GroupNumber     intNOTNULL,
    moved   tinyint NOTNULL,
PRIMARYKEYCLUSTERED ([RowsCounted], _row)
);

WITH cte AS (
SELECT B.RowsCounted
,   B.SchemaName
,   B.TableName
FROM
    (
SELECT
            s.[Name] as [SchemaName]
        ,   t.[name] as [TableName]
        ,   SUM(p.rows) as [RowsCounted]
FROM
            sys.schemas s
LEFTOUTERJOIN
                sys.tables t
ON s.schema_id = t.schema_id
LEFTOUTERJOIN
                sys.partitions p
ON t.object_id = p.object_id
LEFTOUTERJOIN
                sys.allocation_units a
ON p.partition_id = a.container_id
WHERE
            p.index_id IN (0,1)
AND p.rowsISNOTNULL
AND a.type = 1
GROUPBY
            s.[Name]
        ,   t.[name]
    ) B
)

INSERTINTO #work ([RowsCounted], SchemaName, TableName, GroupNumber, moved)
SELECT [RowsCounted], SchemaName, TableName, ROW_NUMBER() OVER (ORDERBY [RowsCounted]) % @bucketCount AS GroupNumber, 0
FROM cte;


WHILE (@@ROWCOUNT!=0)
WITH cte AS
(
SELECT
        *
    ,   SUM(RowsCounted) OVER (PARTITION BY GroupNumber) - SUM(RowsCounted) OVER (PARTITION BY (SELECTNULL)) / @bucketCount AS _GroupNumberoffset
FROM
        #work
)
UPDATE
    w
SET
    w.GroupNumber = (CASE w._row
WHEN x._pos_row THEN x._neg_GroupNumber
ELSE x._pos_GroupNumber
END
            )
,   w.moved = w.moved + 1
FROM
    #workAS w
INNERJOIN
    (
SELECTTOP 1
            pos._row AS _pos_row
        ,   pos.GroupNumber AS _pos_GroupNumber
        ,   neg._row AS _neg_row
        ,   neg.GroupNumber AS _neg_GroupNumber
FROM
            cte AS pos
INNERJOIN
                cte AS neg
ON pos._GroupNumberoffset > 0
AND neg._GroupNumberoffset < 0
AND
--- To prevent infinite recursion:
            pos.moved < @bucketCount
AND neg.moved < @bucketCount
WHERE--- must improve positive side's offset:
            ABS(pos._GroupNumberoffset - pos.RowsCounted + neg.RowsCounted) <= pos._GroupNumberoffset
AND
--- must improve negative side's offset:
            ABS(neg._GroupNumberoffset - neg.RowsCounted + pos.RowsCounted) <= ABS(neg._GroupNumberoffset)
--- Largest changes first:
ORDERBY
            ABS(pos.RowsCounted - neg.RowsCounted) DESC
    ) AS x
ON w._row IN
       (
           x._pos_row
       ,   x._neg_row
       );

Now what? Let's look at the results. Run this against AdventureWorks and AdventureWorksDW


SELECT
    W.GroupNumber
,   COUNT_BIG(1) AS TotalTables
,   SUM(W.RowsCounted) AS GroupTotalRows
FROM
    #workAS W
GROUP BY
    W.GroupNumber
ORDER BY
    W.GroupNumber;


SELECT
    W.GroupNumber
,   W.SchemaName
,   W.TableName
,   W.RowsCounted
,   COUNT_BIG(1) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS TotalTables
,   SUM(W.RowsCounted) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS GroupTotalRows
FROM
    #workAS W
ORDER BY
    W.GroupNumber;

For AdventureWorks (2014), I get a nice distribution across my 6 groups. 12 to 13 tables in each bucket and a total row count between 125777 and 128003. That's less than 2% variance between the high and low - I'll take it.

If you rerun for AdventureWorksDW, it's a little more interesting. Our 6 groups are again filled with 5 to 6 tables but this time, group 1 is heavily skewed by the fact that FactProductInventory accounts for 73% of all the rows in the entire database. The other 5 tables in the group are the five smallest tables in the database.

I then ran this against our data warehouse-like environment. We had a 1206 tables in there for 3283983766 rows (3.2 million). The query went from instantaneous to about 15 minutes but now I've got a starting point for bucketing my tables into similarly sized groups.

What do you think? How do you plan to use this? Do you have a different approach for figuring this out? I looked at R but without knowing what this activity is called, I couldn't find a function to perform the calculations.

↧

A date dimension for SQL Server

August 14, 2018, 6:00 am

≫ Next: Reading Excel files without Excel

≪ Previous: Sort SQL Server tables into similarly sized buckets

A date dimension for SQL Server

The most common table you will find in a data warehouse will be the date dimension. There is no "right" implementation beyond what the customer needs to solve their business problem. I'm posting a date dimension for SQL Server that I generally find useful as a starting point in the hopes that I quit losing it. Perhaps you'll find it useful or can use the approach to build one more tailored to your environment.

As the comments indicate, this will create: a DW schema, a table named DimDate and then populate the date dimension from 1900-01-01 to 2079-06-06 endpoints inclusive. I also patch in 9999-12-31 as a well known "unknown" date value. Sure, it's odd to have an incomplete year - this is your opportunity to tune the supplied code ;)


-- At the conclusion of this script, there will be
-- A schema named DW
-- A table named DW.DimDate
-- DW.DimDate will be populated with all the days between 1900-01-01 and 2079-06-06 (inclusive)
--   and the sentinel date of 9999-12-31

IFNOTEXISTS
(
SELECT * FROM sys.schemas AS S WHERE S.name = 'DW'
)
BEGIN
EXECUTE('CREATE SCHEMA DW AUTHORIZATION dbo;');
END
GO
IFNOTEXISTS
(
SELECT * FROM sys.schemas AS S INNERJOIN sys.tables AS T ON T.schema_id = S.schema_id
WHERE S.name = 'DW'AND T.name = 'DimDate'
)
BEGIN
CREATETABLE DW.DimDate
    (
        DateSK intNOTNULL
    ,   FullDate dateNOTNULL
    ,   CalendarYear intNOTNULL
    ,   CalendarYearText char(4) NOTNULL
    ,   CalendarMonth intNOTNULL
    ,   CalendarMonthText varchar(12) NOTNULL
    ,   CalendarDay intNOTNULL
    ,   CalendarDayText char(2) NOTNULL
    ,   CONSTRAINT PK_DW_DimDate
PRIMARYKEYCLUSTERED
            (
                DateSK ASC
            )
WITH (ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE)
    ,   CONSTRAINT UQ_DW_DimDate UNIQUE (FullDate)
    );
END
GO
WITH
-- Define the start and the terminal value
    BOOKENDS(FirstDate, LastDate) AS (SELECT DATEFROMPARTS(1900,1,1), DATEFROMPARTS(9999,12,31))
-- itzik ben gan rapid number generator
-- Builds 65537 rows. Need more - follow the pattern
--  Need fewer rows, add a top below
,    T0 AS
(
-- 2
SELECT 1 AS n
UNIONALLSELECT 1
)
,    T1 AS
(
-- 2^2 => 4 
SELECT 1 AS n
FROM
        T0
CROSS APPLY T0 AS TX
)
,    T2 AS
(
-- 4^4 => 16
SELECT 1 AS n
FROM
        T1
CROSS APPLY T1 AS TX
)
,    T3 AS
(
-- 16^16 => 256
SELECT 1 AS n
FROM
        T2
CROSS APPLY T2 AS TX
)
,    T4 AS
(
-- 256^256 => 65536
-- or approx 179 years
SELECT 1 AS n
FROM
        T3
CROSS APPLY T3 AS TX
)
,    T5 AS
(
-- 65536^65536 => basically infinity
SELECT 1 AS n
FROM
        T4
CROSS APPLY T4 AS TX
)
-- Assume we now have enough numbers for our purpose
,    NUMBERS AS
(
-- Add a SELECT TOP (N) here if you need fewer rows
SELECT
CAST(ROW_NUMBER() OVER (ORDERBY (SELECTNULL)) ASint) -1 AS number
FROM
        T4
UNION
-- Build End of time date
-- Get an N value of 2958463 for
-- 9999-12-31 assuming start date of 1900-01-01
SELECT
        ABS(DATEDIFF(DAY, BE.LastDate, BE.FirstDate))
FROM
        BOOKENDS AS BE
)
, DATES AS
(
SELECT
    PARTS.DateSk
,   FD.FullDate
,   PARTS.CalendarYear
,   PARTS.CalendarYearText
,   PARTS.CalendarMonth
,   PARTS.CalendarMonthText
,   PARTS.CalendarDay
,   PARTS.CalendarDayText
FROM
    NUMBERS AS N
CROSS APPLY
    (
SELECT
            DATEADD(DAY, N.number, BE.FirstDate) AS FullDate
FROM
            BOOKENDS AS BE
    )FD
CROSS APPLY
    (
SELECT
CAST(CONVERT(char(8), FD.FullDate, 112) ASint) AS DateSk
        ,   DATEPART(YEAR, FD.FullDate) AS [CalendarYear] 
        ,   DATENAME(YEAR, FD.FullDate) AS [CalendarYearText]
        ,   DATEPART(MONTH, FD.FullDate) AS [CalendarMonth]
        ,   DATENAME(MONTH, FD.FullDate) AS [CalendarMonthText]
        ,   DATEPART(DAY, FD.FullDate)  AS [CalendarDay]
        ,   DATENAME(DAY, FD.FullDate) AS [CalendarDayText]

    )PARTS
)
INSERTINTO
    DW.DimDate
(
    DateSK
,   FullDate
,   CalendarYear
,   CalendarYearText
,   CalendarMonth
,   CalendarMonthText
,   CalendarDay
,   CalendarDayText
)
SELECT
    D.DateSk
,   D.FullDate
,   D.CalendarYear
,   D.CalendarYearText
,   D.CalendarMonth
,   D.CalendarMonthText
,   D.CalendarDay
,   D.CalendarDayText
FROM
    DATES AS D
WHERENOTEXISTS
(
SELECT * FROM DW.DimDate AS DD
WHERE DD.DateSK = D.DateSk
);

↧