Sunday, July 26, 2015

SQL Server Brain Basher of the Week #022 - Instance Name

Here is the question for the week;

Can I name my Named Instance as MSSQLServer?

Generally, when we install SQL Server as a named instance, we set a meaningful name for it based on the requirement. However we cannot name as we want, there are set of rules we must follow when name it.

For more info on instances and rules, read this: https://msdn.microsoft.com/en-us/library/ms143531.aspx.

Now the question, can we name my instance as MSSQLServer. Yes you can if you do not have default instance, but it will not be considered as a named instance. Once the installation is done with MSSQLServer name, it will be treated as the default instance and you can connect with it using the ways you use to connect with default instance.

Here are some additional info: SQL Server: Why we need multiple instances in production environment?

Saturday, July 25, 2015

Can we connect to a named instance without Browser Service running?

SQL Server Browser Service is one of the services installed when SQL Server is installed. Generally this service is required to be running for accessing named instances since ports or named pipes are assigned dynamically to named instances. 

At the start, it claims UDP port 1434 and collects all ports and named pipes for all instances reading the registry. When a client makes a requests to an instance, the client network library sends a UDP message to the server using port 1434 and then Browser service responds with the correct port number or named pipe of the requested instance. Client network library uses the received information for completing the connection and then communicate with required instance.

For mor info on Browser Service, read this: https://technet.microsoft.com/en-us/library/ms181087(v=sql.105).aspx

As per above fact, we cannot access a named instance if Browser service is not running. But there are certain situations that allow us to make the connectivity to instances even when the services is disabled. An instance can be access when;
  • Connecting to the default instance that uses port 1433 or pipe \sql\query.
  • Connecting to an instance that uses a specific port number.
  • Connecting to an instance using a connection string that contains the port number.
Here is an example of it. I have two named instances called DINESH-PC\SQLTest and DINESH-PC\SQLTest2. Protocols of Shared Memory and Named Pipes are disabled for both and SQLTest instance is set with dynamic ports and SQLTest2 is set with specific port. 




Let's disbale Browser Service and test.



Connecting to SQLTest


Connecting to SQLTest2



Wednesday, July 22, 2015

Formatting for 2 digits number: Month, Day, Hour, Minute

I have seen many ways of formatting a number related to month, day, hour, minute to two digits number. It is all about checking whether the given value has two digits, if not, add a leading zero for making it as two digit number. Some use IF statement, some use CASE statement and some use CONVERT and SUBSTRING. What is the best way? Based on the implementation require, you may use an appropriate one but I feel best way is, just add a leading zero without checking the length and then take right two digits as below;

DECLARE @GivenDate date = getdate();
DECLARE @GivenMonth1 int = 7;
DECLARE @GivenMonth2 int = 12;

SELECT RIGHT('0' + CONVERT(varchar(2), MONTH(@GivenDate)), 2) AS MonthOfGivenDateWithTwoDigits;
SELECT RIGHT('0' + CONVERT(varchar(2), @GivenMonth1), 2) AS MonthOfGivenMonthWithTwoDigits;
SELECT RIGHT('0' + CONVERT(varchar(2), @GivenMonth2), 2) AS MonthOfGivenMonthWithTwoDigits;


Tuesday, July 21, 2015

How to create a Hive table and execute queries using HDInsight

I wrote a post on Hive (What is Hive, What is Hive Database, What is Hive Table?) discussing key elements of Hive. If you want to try with it, there are multiple ways of doing it, you can do it with a Hadoop cluster configured in your environment, using a sandbox provided by vendors, or using a cloud computing platform and infrastructure like Microsoft Azure or Amazon. This post speaks about how to use HDInsight that is Microsoft Hadoop Cloud Cluster, for performing Hive related operations.

First of all, you need to make sure that a Storage and HDInsight Cluster are created with your Azure account. This post explains how to do it: How to navigate HDInsight cluster easily: CloudXplorer.

Let's try to create a simple External table using a file that holds data like below;



Let's place this file in one of the HDFS location, in my case, I have created a folder called \MyFiles\CustomerSource and placed the file in that folder using CloudXplorer.


In order to create an External table and query, there are many ways of doing it. For implementations, Powershell, .NET or even SSIS can be used but for this, let's use the standard interface given with HDInsight: Query Console.


You need a user name and password for opening Query Console. If you have not given a specific user name when creating the cluster, your user name is admin. Password you have to use is the one you entered when creating the cluster. Once submitted, you should see the below page and should click on Hive Editor link.




Here is the code for creating an External table called customer. Note that complex data types such as map, array, struct have been used for handling data in the text file and no specific database is mentioned, hence table will be created on the default database. Last clause of the statement points to the location where we have data files.

create external table customer
(
 id int
 , name struct
 , telephone map
 , ranks array
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/MyFiles/CustomerSource/';

Place the code in the Editor and click on Submit. Once submitted, job will be created, and you should see that Status of the job is getting changed from Initialize, Running, and Completed. Once completed, you can click on View Details link in the job and see details related to the execution.


Done. External table has been created. Now we should be able to query the file using the table created, of course this Schema-on-read. Let's execute a query like below;

select name.lname, telephone["mobile"], ranks[0]
from customer;


Click on View Details to see the result. If need, result can be downloaded too.


Let's see more practical example with Powershell and SSIS with future posts.

Monday, July 20, 2015

SQL Server Brain Basher of the Week #021 - Documentation

Here is the Brain Basher of the week. It is related to Database Administrators but even if you are Database Engineer, it is very important to know.

What is run book?

Run book is a documentation that is compiled and maintained by Database Administrators. Although it is a practice that is a must, many of Database Administrators (in a way, all IT engineers) do not compile much documents related to tasks they are responsible. How ever well maintained detailed rub book is invaluable in many situations. Run book generally includes following;
  • Contact details of everyone responsible and their responsibilities
  • Configuration settings of all databases, instances, files and details of them.
  • Standard maintenance procedures, scheduled operations and details of them.
  • Procedures to be followed with disaster recovery.
Assume that something happens with one of SQL Server instance, and DBA is not on duty, what should you do? What is the immediate action you should take? Who should be contacted? If DBA maintains well organized run book, you have answers for all your questions.

This is not only for us when DBA is not available. It is helpful for DBA too. When something has not been run as expected, run book will help DBA to find reasons, related operations, and actions to be taken as no one can memorized all related to the job.

Sunday, July 19, 2015

SQL Server Management Studio 2015 June - Preview is available for downloading

Microsoft SQL Server Management Studio 2015 June Preview is available for downloading. This comes as a separate installation, not as a part of standard SQL Server installation. I believe that the whole idea of this is, making Management Studio as a standalone tool for managing all SQL products including Azure databases.

As per MSDN, enhancements made are;
  • New SSMS Installer - SSMS can now be installed with a light weight stand-alone web installer.
  • SSMS Updates - Receive notification within SSMS when new updates are available, and choose to install them at your convenience.
  • Enhanced SSMS support for Azure SQL Database – Several fixes and enhancements, including expanded SQL Server Management Objects (SMO) coverage, and an updated Import/Export wizard.
Here is the page for downloading: https://msdn.microsoft.com/en-us/library/mt238290.aspx


Installation is very simple and lightweight. Just as SQL Server 2016, it is recommended to install this in a separate machine as this is still a Preview





Saturday, July 18, 2015

SQL Server: Why we need multiple instances in production environment?

Installing multiple copies of SQL Server on a single machine is not uncommon but it is not recommended in production environment. The copy of SQL Server installation is called as an instance. Multiple instances are installed when we need multiple environments such as development and test, and when we do not have multiple physical machines.

Although it is not recommended to have multi-instance setup in production, there are instances where we install multiple instances in a single server. We install;
  • when we have different types of application that require different versions of SQL Server. If we do not have different servers, two different versions can be installed side-by-side using multiple instances.
  • when we need different security configurations for different application. Assume that one application needs to connect only with Windows Authentication and should not allowed Mixed Authentication, and another application requires both. This can be solved by installing multiple instances.
  • when we need different level of services for databases. There are some instances which need different SLAs (Service Level Agreement) for different databases.
  • when we need different collation set up for databases. Even though it is possible to have different collations for databases in a single instance, tempdb can be configured only with one collation. This leads us to have multiple instances if required.
  • when we need different server level configurations for specific application. We may have an application that needs standard settings but another that needs something like optimize for ad hoc workloads enabled.

Friday, July 17, 2015

What is Hive, What is Hive Database, What is Hive Table?

When I go through Apache Software Foundation, most attractive and most relevant projects I see for me are Hadoop and Hive. Hadoop is an open source platform that offers highly optimized distributed processing and distributed storage that can be configured with inexpensive infrastructure. Scalability is one of the key advantages related to Hadoop, it can be started with few number of servers and can be scaled out to thousands without any issue.

For more info on Hadoop:




What is Hive now? Hive is a supporting project that was originally developed by Facebook, as an abstraction layer on top of MapReduce model. In order to understand Hive, MapReduce has to be understood.

MapReduce is a solution for scaling data processing which is one of main components of Hadoop. This means that it helps to do parallel data processing using multiple machines in Hadoop (or HDFS). It is considered as a framework as it is used for writing programs for distributed data processing. MapReduce programs requires two separate and distinct methods; Map and Reduce (optionally). Function related to these can be written using languages like Java, Perl and Python.

When it comes to a complex MapReduce implementation, many who do data analysis including database engineers find difficultly because languages to be used are not much familiar for them. Not only that there are some other constraints like time it takes time for implementing, complexities, and less re-usability. This made Facebook team to implement Hive.

Hive is data warehouse infrastructure built on top of Hadoop. It provides SQL-Like language called HiveQL allowing us to query data files in HDFS for processing data using familiar SQL-Like techniques. This converts our SQL commands to MapReduce jobs, hence we do not need to worry about MapReduce implementation.

What is Hive Database? Even though it appears as a relational database we are familiar with, it is not. It is just a name that can be used for grouping set of tables created, or it can be considered as namespace (like we used to group our classes, methods in .Net). When a hive database is created, it creates a folder with the name given suffixing .db. If the location is not specified, it will be created in /hive/warehouse folder, else folder will created in the given location in HDFS. For example, following code will create a folder called sales.db inside the /hive/wearehouse folder.

CREATE DATABASE sales;


What is Hive Table? It uses similar concept. When we create a table with relational database management systems, it creates a container for us with constraints we added like columns, data types, rules, and allows us to add records matching with constraints added. But Hive table does not create a container for us like that, it creates a schema on top of a data file we have placed in HDFS, or data files we are supposed to place as it uses schema-on-read not schema-on-write (read this What is Schema-on-write and Schema-on-Read?).

Hive supports two types of tables: Hive Managed Table and External Table. Hive Managed Tables creates a sub folder in side the database folder with a schema. And later we can place files into the folder, this is how record-insert process works though Hive does not offer interactive queries like INSERT, UPDATE, DELETE. Managed tables are maintained by Hive, dropping the table will drop files placed too.

External Table helps us to create a schema for reading data in files. Location clause is required when an external non-partitioned table is created pointing the folder that holds data files.

Here is an example for a managed table

USE sales;
CREATE TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE; 

Here is an example for an external table

CREATE EXTERNAL TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION '/xyz/abc/lmn'; 


Once the table is created, SQL-Like queries can be used for accessing data in files.

Let's see more on these with samples in future posts.

Wednesday, July 15, 2015

MDX and Reporting Services: How to remove duplicates from a parameter?

What if you get duplicate values from a dimension that is going to be used for loading a Reporting Services parameter? Generally, you do not get duplicates from a dimension because we do not configure the dimension like that. For an example, we do not expect duplicate names from Product dimension. But there can be an exception that shows duplicate names but unique keys, something like below;

Id Name ...
1 Product A ...
2 Product B ...
3 Product A ...
4 Product C ...

If Id of this has been used as the key and name has been used as the name of the key, even though we write a MDX query like below, we still get duplicates as they have unique keys.

SELECT 
  {  } ON COLUMNS, 
  { Distinct([Product].[Product].MEMBERS) } ON ROWS 
FROM [Cube];

Best way is, fix this in database level without taking it to reporting level. But if we cannot do anything at database level (assume that no permission, have to use as it is) and still should not show duplicates with the parameter, what is the best way?

I have solved this in two ways. One is, execute the MDX from the database engine, connecting with OLAP database using a linked server. In this case, TSQL SELECT statement will be used and source for the parameter will be database engine, not OLAP database.

SELECT
 DISTINCT [Measures.ProductName]
FROM OpenQuery(LinkedServerForOLAP,
 'WITH 
 MEMBER [Measures.ProductName] AS [Product].[Product].CurrentMember.Name
 SELECT 
    { [Measures.ProductName] } ON COLUMNS, 
    { Distinct([Product].[Product].MEMBERS) } ON ROWS 
  FROM [Cube]')

Other method is, write VB.Net code inside the report for removing duplicates. This requires a hidden parameter that holds data coming from the OLAP database and another new parameter for showing values. New parameter has to be loaded from a function written in the Code segment that accesses the hidden parameter for getting values, programatically removes duplicates and returns distinct values.

Here is the second way;

First create a hidden parameter. Make sure it is set as hidden.


Then set Available Values and Default Values with the data set created based on OLAP database.



First part is done. Then implement the following code in Code segment.

Function GetDistinctLabels(parameter As Parameter) As String()


 Dim labels As Object() = parameter.Label
 System.Array.Sort(labels)

  Dim x As Integer = 0

 For i As Integer = 0 to labels.Length - 1

  If i > 0 And labels(i).Equals(labels(i - 1)) Then
   
   Continue For
  End If
  labels(x) = labels(i)
  x += 1
 Next

  Dim distinctlabels As [String]() = New [String](x - 1) {}
 System.Array.Copy(labels, 0, distinctlabels, 0, x)

  Return distinctlabels

End Function


After that, load the main parameter that is used for showing values as below;

Set the Available Values using Specify values and call the function written for loading distinct values passing parameter name. Do it for both Label and Value.


Set Default Values using same technique if required.


That is all. I have used both methods and both worked fine. However if keys of these labels have to be passed to another dataset loading more data, additional work is required for maintaining keys of the parameter.

Monday, July 13, 2015

MDX Top function

If we need get Top n records from a relational database table, we simply use SELECT TOP (n).... But how do you do the same with Multi-dimensional database like Analysis Services databases? MDX is the language we used for querying Analysis Services databases, or Cubes. MDX function called TopCount  is the known function for similar operation but it is believed that it has to be used with a numeric expression and the result of TopCount is based on values of the numeric expression used.

In a way, it is true. But what if we need to get a set of dimension members using Top functionality without using a numeric expression. To elaborate, if I want to get 5 products from Products table, I can write something like below using TSQL;

USE AdventureWorks2014;
GO

SELECT TOP (5) Name
FROM Production.Product;

If I want to do the same with Product Dimension in a cube, how can it be done? Can I use TopCount without using a numeric expression? Yes, it is possible. The numeric expression for TopCount function is optional, hence it works like below;

SELECT 
 {} ON 0
 , {TopCount([Product].[Product Categories].[Product].Members, 3)
  } ON 1
FROM [Adventure Works];


In addition to that, Head  function can be used too.

SELECT 
 {} ON 0
 , {Head([Product].[Product Categories].[Product].Members, 3)
  } ON 1
FROM [Adventure Works];


Sunday, July 12, 2015

SQL Server Brain Basher of the Week #020 - Lock Escalation

If you have read my previous post "SQL Server: Locking cost, concurrency cost and concurrency issues [Discussion on Transactions - III]", then you have seen lock escalation. Lock escalation is a process that converts fine-grain-locks (row level, page level locks) to fever coarse-grain-locks (table level locks). This increases concurrency cost and reduces the locking cost. This week Brain Basher is based on this, and it is;

What makes Lock Escalation triggered?

Generally, it gets triggered when a single statement obtained at least 5000 locks. If it is not successful, then it attempts the same for every 1,250 new locks. This happens only when Lock Escalation is not disabled on the table by using ALTER TABLE SET LOCK_ESCALATION option. If you need a sample code to see this, please visit above link.

As per MSDN;

Lock escalation is triggered when lock escalation is not disabled on the table by using the ALTER TABLE SET LOCK_ESCALATION option, and when either of the following conditions exists:
  • A single Transact-SQL statement acquires at least 5,000 locks on a single nonpartitioned table or index.
  • A single Transact-SQL statement acquires at least 5,000 locks on a single partition of a partitioned table and the ALTER TABLE SET LOCK_ESCALATION option is set to AUTO.
  • The number of locks in an instance of the Database Engine exceeds memory or configuration thresholds.


SQL Server: Locking cost, concurrency cost and concurrency issues [Discussion on Transactions - III]

Continuing notes related to the discussion....

Another area that was discussed is, how SQL Server places locks for transactions and whether the cost of transaction goes up based on the locking mode used. What exactly happens when we query, when we execute an action query? By default, SQL Server decides what type of locking mode should be used for our query, whether it should be shared or exclusive, and the type of resources should be locked: RID, Pages, Tables, Database, etc.

If SQL Server has to maintain many number of locks, the maintenance cost goes up and it might affect to the performance too. However this might increases the concurrency. For example, assume that we update 6,000 records in a table. Considering many factors, SQL Server might first obtain fine-grained-locks (row or page level locks) only for records that are being updated. This makes sure that other records in the table are not locked and can be accessed by other users, increasing the concurrency, reducing the concurrency cost. However, the cost of maintaining 6,000 locks may be very high. If SQL Server detects that the cost is higher than expected, it tries to escalate the fine-grained-locks to more coarse-grained-locks (table level locks), decreasing the the locking cost and increasing the concurrency cost (more and more users will be blocked). Below diagram shows this;


This code shows how locks are obtained for a transaction that updates 5,000 records;

-- starting a transaction
BEGIN TRAN;

-- update 4000 records
UPDATE TOP (4000) InternetSalesTest
 SET DiscountAmount = SalesAmount * (5/100)
WHERE MONTH(OrderDate) = 5;

-- check locks
sp_lock;



Since only records that are being updated are locked, users can access other records without any issue. This increases the concurrency reducing the concurrency cost. However this increases locking cost.

When 6,000 records are getting updated;

-- rollback the previous transaction
ROLLBACK TRAN;

-- start a new transaction
BEGIN TRAN;

-- update 6,000 records
UPDATE TOP (6000) InternetSalesTest
 SET DiscountAmount = SalesAmount * (5/100)
WHERE MONTH(OrderDate) = 5;

-- check locks
sp_lock;


Since the entire table is locked now, it decreases concurrency increasing concurrency cost. No other users can access the table though locking cost is low now.

This is changeable however it is always recommended not to change the default behavior unless you know what you exactly do.

We discussed about concurrency issues too. There are four well known issues related to concurrency which can be addressed with changing locking mechanism by applying Isolation Level. There are Dirty read, Lost updates, Inconsistency Analysis, and Phantom reads.

I have written an article on this in 2011, you can have a look on it at: http://www.sql-server-performance.com/2007/isolation-levels-2005/

Saturday, July 11, 2015

Hadoop cluster and how it stores a file when deploying

Hadoop is no longer a new word, everyone knows it and everyone knows why we need it. During my last two presentations, I explained Hadoop Cluster and how it stores a file when a file is placed. Here is the image I used for explaining Hadoop Cluster.


Hadoop cluster consists couple of components, including Master nodes and Slave nodes as main components. Master Nodes responsible for managing and coordinating services and tasks (Eg. using Name Node) and Slave Nodes responsible for storing and processing data providing resources like CPU and memory.

Generally, Hadoop is configured on rack-based servers. On top of each rack, network switch is configured for intra-rack communication and another network switch is configured for handling communication between rack switches and client that runs Hadoop client-related software.

Hadoop uses HDFS for holding files. It is responsible for breaking large files into smaller chunks (128MB - configurable), placing them in different slave nodes and replicating them for providing high availability. Here is a video that shows how a file is distributed in Hadoop Distributed File System;


Friday, July 10, 2015

Understanding SQL Server Recovery Process [Discussion on Transactions - II]

This is the continuation of the post I made: Understanding transaction properties with Microsoft SQL Server.

During the same discussion, we talked about how SQL Server handles incomplete transactions in the event of a power failure or system crash. In order to understand this, we need to understand a process called Recovery Process. This process takes place every time SQL Server is started or when a restore operation is performed.

Recovery Process makes sure that incomplete transactions are either rolled back or rolled forward. This is possible because necessary information is available in the log file even though data files are not updated with modifications performed by our transactions. When a modification is performed, log is updated with information such as values related to the modification, type of the change, page numbers, date and time of the beginning and end. SQL Server makes sure that the message that says “transaction is completed” is sent to the client only after relevant information is written to the log. Remember, physical log is immediately updated with our transactions but it does not update data pages in the disk (data files) immediately. It updates data pages loaded to the buffer and pages in the disk are updated by a process called Checkpoint.

Checkpoint writes updated data pages in the memory to data files. In addition to that, it updates the log file with information related to transactions that are in progress.

When Recovery Process runs, it checks for transactions that have updated the log but not updated the data file. If found, it redoes the transactions applying modifications recorded in the log to data files. This is called as redo phase of recovery.

If Recovery Process finds incomplete transactions in the log file, it checks for updated data pages in the data file related to the transaction and using information available in the log file, it undoes all modifications from the data file. This is called as undo phase of recovery.

This video shows how Recovery Process works with complete and incomplete transactions. It shows how transactions are started with the time line, how log is updated (with ) and how data files are updated (with ). And finally, it shows how Recovery Process works on all transactions and recover them either performing Redo or Undo.



Tuesday, July 7, 2015

Understanding transaction properties with Microsoft SQL Server [Discussion on Transactions - I]

I had a chance to discuss an old topic with few of my colleagues, it was on transactions and isolation levels. Although it is old, well known to experienced engineers, it is something new and something important to know for many. Therefore, thought to make some notes on the discussion, here is the first one.

What is a transaction? Simplest way to understand it is, consider multiple operations to be done as a single operation. In database management system world, it is a unit of work that will contain one or more operations on querying, modifying data and modifying definition of the data. With SQL Server, we implement explicit transactions using BEGIN TRAN, COMMIT TRAN and ROLLBACK TRAN. In addition to that, SQL Server supports implicit transaction and auto-commit transactions.

Transactions have four properties. Here are details we discussed on them aligning with SQL Server.

Atomicity: Transaction is considered as an atomic unit of work. As explained above, it includes one or more operations. However transaction completion is not just completing one operation, it should either complete all operations involved or none of them should be completed. SQL Server automatically rolls back transactions that have errors if it runs with auto-commit mode and we need to roll back if explicit or implicit transactions are used. If the system fails before the completion of the transaction, upon restart the service, SQL Server undoes all incomplete transactions during Recovery Process.

Consistency: This refers the state of the data (or database). When a transaction needs to be started on data, data needs to be in a consistence state in order to access. In SQL Server, if data is being modified, with default behavior, the state of the data is not consistence and transaction has to wait, unless different isolation level is used. SQL Server allows to start the transaction once data is consistence and transaction goes through all integrity rules set and completes bringing data into another consistence state.

Isolation: This refers controlling data access as transaction expects. Basically, if transaction expects to get exclusive access even for data reading, this property makes sure that data is in required consistency level for accessing. Once allowed, if other concurrent transactions require same data, access is allowed based on the way data is being accessed by the first transaction. SQL Server offers two methods for handling isolation: locking and versioning. Both locking and versioning are available in on-premises SQL Server installation and default is locking. Locking makes sure that transaction cannot read data if it is inconsistence. Versioning is bit different, if data is inconsistence, it allows to read the previous consistence state of data. This is the default for Azure SQL Databases. This can be controlled by implementing different isolation levels with transactions.

Durability: Once the transaction is committed, it is considered as durable even with a disaster. SQL Server uses write-ahead mechanism for handling this, it makes sure committed transactions are written to the transaction log before writing it to data files. If something happens before writing committed data to the data file, still data can be recovered using info recorded in the log file. Generally, at the restart of SQL Server service, it checks the transaction logs for rolling forward transactions that are written to log but data files and rolling back transaction that are incomplete.

Monday, July 6, 2015

SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

** The latest article on this based on RC3 is available as SQL Server 2016 - Manipulating data in Azure Storage using PolyBase.

Polybase is not a new product to Microsoft SQL Server product suite but it is new to many of us or it is not a well known component to developers as it was limited to one edition. Microsoft started adding Big Data related components to Microsoft platform long time back and as a result of it, HDInsight (Hadoop in Azure) and Analytical Platform System (Parallel Data Warehouse) appliance were introduced. Poybase was part of APS and it was limited to APS edition. In a way, it acted as a bridge between Hadoop and relational databases, allowing us to query files (all structured, semi-structured and unstructured) in the Hadoop using familiar language: SQL.

Good news is, now Microsoft has included this functionality to other editions as part of Microsoft SQL Server 2016. It allows you to process data in files (obviously large files) using SQL as they were tables in the database when files are stored in Azure Blob Storage or Hadoop.

As you know, SQL Server 2016 is still CTP and it is CTP 2.1 when this post is written. This post shows how to use Polybase when text files are stored in Azure Hadoop. Note that the code written is based on current version and implementation may be changed with future releases.

How do you start? First of all, you need SQL Server 2016 installed in your machine. Here are the steps for installing it: Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Once installed, you need to enable hadoop connectivity using sp_configure. If you run sp_configure, you will see that zero is set with it.



Since I am going to access Hadoop configured in Azure, it needs a value like 4. For more information on values can be set, read this: https://msdn.microsoft.com/en-us/library/mt143174.aspx

Once set, restart SQL Server Service and Hadoop related services (shown in the image below). Note that those services are not visible via SQL Server Configuration Manager.




In order to test Polybase, we need a HDInsight cluster configured with Azure. If have already created one, make a note on storage name and primary key of it. If you have not created, have a look on this post for creating one and accessing it via CloudXplorerHow to navigate HDInsight cluster easily: CloudXplorer

Let's create a simple text file for testing and place it in HDInsight. I have created a simple text file (of course, it is too small and not a best example for Hadoop, but will use it for testing) and uploaded to a new folder called testfolder in HDInsight using CloudXplorer.



Here are the next steps:

1. Enable 4631 flag using TRACEON and create a Credential for accessing Azure storage.

-- create a database for testing
CREATE DATABASE TestPolybase;
GO

-- connect with it
USE TestPolybase;
GO

-- enable 4631
DBCC TRACEON(4631, -1);

-- create the master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Pa$$w0rd';

-- create credential for the azure storage
-- Note that even though this creates it
-- I had an error with CREATE EXTERNAL DATA SOURCE
-- hence, did not use this
-- changed core-site.xml instead

--CREATE CREDENTIAL Azure01DineshCredential
--WITH IDENTITY = 'azure01dineshhdistorage' -- storage name
-- , SECRET = 'abc23232njjj'; -- primary key


However, this did not work for me. CREATE CREDENTIAL allowed me to create it but I could not use it with CREATE EXTERNAL DATA SOURCE. It threw following error with it;

Msg 46516, Level 16, State 26, Line 16
The specified credential cannot be found

The reason for this could be, not adding credential to the database but to the server. However I could not add it to the database too;



Therefore, I added the credential manually to core-ste.xml file. File is located generally in:
 C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf 



For more info on this, read this: https://msdn.microsoft.com/en-us/library/ms189522.aspx


2. Then we need to create an External Data Source. This sets the location we need to access for querying files. Note that I have not used CREDENTIAL keyword and LOCATION has been set with storage name and container name.

-- create the external data source
-- note that CREDENTIAL has not been used
CREATE EXTERNAL DATA SOURCE Azure01DineshHDIstorage
WITH
(
 TYPE = HADOOP,
 -- wasbs://storagename@containername.blob.core.windows.net/
 LOCATION = 'wasbs://azure01dineshhdinsight@azure01dineshhdistorage.blob.core.windows.net/'
 --, CREDENTIAL = Azure01DineshCredential
);


For more info on this, read this: https://msdn.microsoft.com/en-us/library/dn935022.aspx


3. Next step is, creating a File Format. This helps to understand the file we have stored when reading the file as a table.

-- create the file format required
CREATE EXTERNAL FILE FORMAT FileFormat
WITH
(
 FORMAT_TYPE = DELIMITEDTEXT
 , FORMAT_OPTIONS
  (
   FIELD_TERMINATOR = ','
  )
);


For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935026.aspx


4. Now we need to create an External Table that references data stored in the location we specify. Here is the code for creating the table. Since I have stored the Log.txt file in testfolder created in HDInsight, Location is set with '/testfolder'. Data_Source is set with the one I created and File_Format is set with the format created. Reject_type and Reject_value indicate that ignore 10 invalid records.

-- create the table using file format created
-- and for the file uploaded
CREATE EXTERNAL TABLE LogData
(
 id int
 , userid int
 , nofseconds int
)
WITH
(
 LOCATION = '/testfolder'
 , DATA_SOURCE = Azure01DineshHDIstorage
 , FILE_FORMAT = FileFormat
 , REJECT_TYPE = VALUE
 , REJECT_VALUE = 10
);


For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935021.aspx

All done. If you go through Object Explorer, you should see all we have created.



Now we should be able to query the file like below;



Still this is still being implemented, we cannot test many things against Hadoop but we can understand how it is going to be with future releases.

Sunday, July 5, 2015

SQL Server Brain Basher of the Week #019 - TSQL

Here is a challenge related to TSQL.

What would be the result of below query?

SELECT *
FROM ( VALUES
(1, 'abc'),
(2, 'lmp'),
(3, 'xyz')) AS d (Id, Value);


  1. Error that says Incorrect syntax near the keyword 'values'..
  2. Error that says Column name or number of supplied values does not match table definition.
  3. Error that says Incorrect syntax near the keyword 'SELECT'.
  4. Tabular resultset with two columns and three rows.
Though it looks like that the syntax of the code is wrong, it works without any issue. The reason for this is the enhanced VALUES clause. The VALUES clause was enhanced with SQL Server 2008 for supporting multiple row inserts with a single insert statement. This enhancement was not limited to it, it allows us to construct a derived table as if we construct a standard table value constructor.

Now, you know the answer. If you execute the above, you will see a resultset with two columns named as Id and Value with three rows.


Friday, July 3, 2015

What is Schema-on-write and Schema-on-Read?

When subjects related to Database Management Systems is discussed, the term Schema-On-Write is not widely used (or not popular), though that is what we have been using as a method for writing and holding data. This method forces the record to be matched with the defined schema before writing it into the storage. For example, in order to hold Customer data, we create a table in a traditional database with appropriate columns, with fixed data types. When a record is inserted, the record has to be aligned with defined columns (defined structure) and the record gets validated (via columns defined, type of them and constraints) before inserting to the table. This slows down the insert operation but consistence of the record is guaranteed. This method is called as Schema-on-write.



Schema-On-Read is a method that matches data to the schema as it is read from storage. Traditional database management system does not employ this but new platform like Hadoop uses methods like this for processing data, specifically on semi-structured and unstructured data. For example, if there is a dataset that holds some log records formatted as semi-structured or unstructured, content of it can be read with a defined schema as per the requirement. The requirement may need only few elements of the content and schema defined for reading addresses only required elements. There can be another requirement on the same content but needs to read differently, new schema is applied when reading. This slows down reading operation as data that is read has to be checked with defined schema.


Thursday, July 2, 2015

Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Microsoft has made SQL Server 2016 CTP 2 available and it can be downloaded and installed for seeing new features. It is available for download at: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

Since this is still CTP, you may get issues while installing and working with it, therefore it is always recommended to install this in a separate box. If you do not have a physical box, install it in a VM.

Here are the steps for installing SQL Server 2016 CTP 2.1. I have added only most relevant steps with issues I faced and workaround applied.

As usual, you need some prerequisites. Make sure following are installed with your machine (or VM) before launching installation.

Microsoft .NET Framework 3.5 Service Pack 1

Oracle JRE 7 Update 51 (64-bit)

Once installed, start with SQL Server installation setup;

1. Once the setup is launched, SQL Server Installation Center appears, select Installation tab and click on New SQL Server stand-alone installation or add features to an existing installation.


2. When you get Product Key page, DO NOT SELECT Specify a free edition. If you select it, installation might get stuck and you will not be able to continue.


Select Enter the product key radio button and click on Next. You do not need a key to continue.



3. Accept license terms and continue with License Terms page.

4. Next is Microsoft Update. It is always better to check for updates.

5. Continue with Install Setup Files.

6. Continue with Install Rules. Make sure that status of everything is passed.

7. Next is Setup Role. Select SQL Server Feature Installation and click Next.



8. Select your required features from Feature Selection. This is what I selected.




9. If all prerequisites are not installed, it will be reported with Feature Rules page. If no issues are reported, click on Next to continue.



10. Since it is a new installation, install it as the default instance. If need, it can be installed as a Named instance too.




11. Next is Server Configuration page that allows you to configure service accounts. Configure them as you want and continue.

12. Make sure Mixed Authentication is enabled with Database Engine Configuration page and if required, change folder path for databases with Data Directories tab. Add users who you need as administrators to the instance.



13. If Analysis Services is selected, Analysis Services Configuration page will appear. Select Server Mode and then add users who need administrative permission on Analysis Services if you have this page.



14. Select the Reporting Services Mode if you have selected Reporting Services as part of the installation.



15. Once everything is done, summary will be displayed, analyze it and continue.



16. Installation Progress page will show you the progress of the installation.



17. Once the installation is completed, final page will be shown with a message.


You can have a look on new things related to SQL Server 2016 with usual window: Management Studio. I have highlighted some new things appeared when opened it, you will find more.


At this moment, no AdventureWorks databases are available for 2016 but older version databases can be used for testing.