Sunday, June 25, 2017

Database Normalization - 1NF, 2NF, 3NF, BCNF, 4NF and 5NF with examples

Normalization is a process of identifying the optimal grouping (relations at the end) for attributes that satisfies data requirements in an organization. It is a database design technique we use after completing ER modeling. This process identifies relationships between attributes (called functional dependencies) and applies series of tests described as normal forms.

There are many articles written on this and you can find examples for almost all normal forms. However many articles explains the theory with the same scenario, hence thought to make a post with different set of examples that I use for my lectures.

As explained above, the purpose of normalization is to identify the best grouping for attributes that ultimately forms relations. Some of the characteristics of relations formed are;
  • Support of data requirements with minimal number of attributes
  • Relation holds attributes with a close logical relationship (functional dependency)
  • Relation holds minimal redundancy with each attribute (except foreign keys)
    • increasing the performance of updates
    • reducing the storage consumption
    • avoiding update anomalies (insertion, modification and deletion)

Functional Dependency

Let's try to understand functional dependency first. This speaks about the relationship between attributes in a relation. For an example, if EmployeeCode and FirstName are attributes of Employee relation, we can say that FirstName is functionally dependent on EmployeeCode. This means, each EmployeeCode is associated with exactly one value of FirstName. We denote this like;

EmployeeCode  -> FirstName

Remember, in above relationship, we call the EmployeeCode as determinant. Basically the left-hand side of the arrow is considered as the determinant. The relationship between left to right is always one to one (1:1).

If the right-hand attribute is fully dependent on left-hand side, we call this dependency as full functional dependency. If the left-hand side is a composite one (two or more attributes) and right-hand side can be determined by part of left-hand side, then the dependency is considered as partial dependency (you will see an example of it later).

Why need to identify functional dependencies? One of the reasons for that is, identifying the best candidate for the primary key. Once functional dependencies are identified, we can analyze all and select the most suitable determinant as the primary key.

First Normal Form (1NF)

The definition of this goes as A relation in which the intersection of each row and column contains one and only one value. Let's try to understand this with an example. The following table shows an identified relation for Student Registration for courses. As you see, a tuple represents a registration that is done for a date.

In order to make sure that the relation is normalized for 1NF, we need to make sure that;
  • No multiple values in intersection of each row and column
  • No repeatable groups in attributes (like Course1, Course2, Course3... columns)
  • Order of attributes and tuples are insignificant
  • No duplicate tuples in the relation.
You can see that Course attribute has multiple value that violates the 1NF. There are multiple ways for addressing this but if I need to handle it without decomposing the relation, I can organize my tuples as below.

Since the relation has no multiple values in intersections and no repeatable groups, it is now a 1NF relation.

Second Normal Form (2NF)

The definition of second normal form is A relation that is in First Normal form and every non-primary-key attribute is fully dependent on the primary key. What is says is, there should not be partial dependency between primary key and non-primary key.

Let's try to set the primary key for above table. For that, let's list out some functional dependencies;

StudentCode, Course  ->  Name, Town, Province, Course, DateRegistered
StudentCode  ->  Name
Town  ->  Province

Considering above identified functional dependencies, I can easily pick the first one, that is StudentCode, Course as my primary key because the combination of them can be used for identifying the tuple easily.

Okay, now the primary key is StudentCode+Course. However, we know that StudentCode  -> Name relationship is still exist. This means that Name can be determined by part of the primary key, that is partial dependency. This is the violation of second normal form.

We can decompose the relation now into two for making sure that relations do not violating the 2NF.

Note that you will not see violation of 2NF if the primary key is based on just one attribute.

Third Normal Form (3NF)

This normal form speaks about transitive dependency. The definition goes as A relation that is in First and Second Normal form and in which no non-primary-key attribute is transitively dependent on the primary key.

This says that we should remove transitive dependency if they are exist. What is transitive dependency? It is a condition such as in Student relation, StudentCode determines the Town (StudentCode  ->  Town  - There is only one two associated with a given StudentCode) and Town determines the Province (Town  ->  Province), therefore StudentCode determines Province (Note that, as per this relation StudentCode detemines Province but the issue is it can be determined by Town too). This is transitive dependency. In other words, if you see that Attribute A determines B (A  ->  B) and B determines C (B  ->  C), then A determines C (A  ->  C).

For removing transitive dependency, we need to decompose the relation.

Boyce-Codd Normal Form (BCNF / 3.5NF)

This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more stronger by making sure that every non-primary-key determinant is a candidate key with identified functional dependencies. The definition goes as A relation is in BCNF, if and only if, every determinant is a candidate key.

What does it exactly means? You have already seen that we can identify many functional dependencies in a relation and we pick one for defining the primary key. The determinants of other identified functional dependencies can be candidate keys for the primary key or they might not be qualified for the primary key. If you see that all determinants are qualified, means you can mark them as the primary key if need, then your relation (table) is in BCNF.

Take this example.

Assume that business rules related to this relation are as follows;
  1. Course has one or more subjects.
  2. Course is managed by one or more lecturers.
  3. Subject is taught by one or more lecturers.
  4. Lecturer teaches only one subject.
If you consider the primary key of this table is Course + Subject, then no violation of 1NF, 2NF and 3NF. Let's list out all possible functional dependencies.
  1. Course, Subject  ->  Lecturer
  2. Course, Lecturer  ->  Subject
  3. Lecturer  ->  Subject
Now, based on the identified functional dependencies, see whether you can make determinants as candidate keys. If you take the first one, we can clearly say that Course + Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as we can identify tuples uniquely using it. However the determinant of the third one cannot be used as a candidate key because it has duplicates. You cannot make Lecturer as a primary key. Now you have a determinant that cannot be set as a primary key, hence it violates BCNF.

In order to make the table BCNF table, need to decompose as below.

Forth Normal Form (4NF)

This normal form handles multi-valued dependencies caused by 1NF. When we see repeated groups or multiple values in an intersection, we add additional tuples removing multiple values. That is what we do with 1NF. When there are two multi-value attributes in a relation, then each value in one of the attributes has to be repeated with every value of the other attribute. This situation is referred as a multi-valued dependency. See below relation;

If we apply 1NF to this relation;

The definition of the multi-valued dependency goes as Represent a dependency between attributes in a relation, such that for each value of A there is a set of values for B and set of values for C. However the set of values for B and C are independent of each other. This dependency denotes as A ->> B.

See the CustomerContacts table. CustomerCode determines multiple Telephone (CustomerCode ->> Telephone) and CustomerCode determines multiple Address (CustomerCode  ->>  Telephone).

The forth normal form is describes as A relation that is in Boyce-Codd normal form and does not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R. Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence need to decompose the table as below.

Fifth Normal Form (5NF)

In order to normalize relations, we decompose the relations into multiple relations. Although multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing inaccurate outputs in certain conditions. 

When we decompose a relation into two relations, the resulting relations have the property called lossless-join that makes sure rejoining two relations produce the original relation. The definition of lossless-join is, A property of decomposition, which ensures that no spurious tuples are generated when relations are reunited through a natural join operation.

Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple relations for minimizing redundency, it might introduce join dependency, that might create spurious tuples when they are reunited. The definition of Join Dependency goes as for a relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join dependency if and only if every legal value of R is equal to the join its projections on A, B, ..., Z. Considering this, definition of Fifth normal form goes as A relation that has no join dependency.

Since this is very rare to see in database design, let's try to understand with an example. See the following table that contains how Lecturers teaches Subjects related to Courses.

Assume that the tuples are formed based on the following scenario;
  • Whenever Lecturer L1 teaches Subject S1,
  • and Course C1 has Subject S1,
  • and Lecturer L1 teaches at least one subject in Course C1,
  • then Lecturer L1 will be teaching Subject S1 in Course C1.
Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.

Now if I try to decompose this relation into two relations for minimizing redundant data, I will be having these two tables (Sequences are added for understanding joins only);

Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.

See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce another relation like below;

Now, if we rejoin all three, we will not see any issue.

Therefore, in order to minimize redundant data and make sure no join dependency, make sure relations are formed as 5NF relations.

Wednesday, June 21, 2017

Power BI Parameters - What are they and when they can be used?

Microsoft Power BI supports adding parameters for queries and use them with various areas. However, since it has some limitations and cannot use with all areas, it has not been noticed much. Let's talk about Power BI parameters and see the way of using it.

Let's take a scenario and see how parameters can help on that. I have a set of CSV files that shows monthly sales. We need to to create a report using one of the files and then publish. Whenever the report needs to be refreshed for a different file, I need to change the data source by changing many things. What if I can just open the report using Power BI Desktop and change the source using a menu and get the report refreshed rather changing many things?

It can be done with Power BI Parameters. We can create a parameter with predefined values (in this case, source file names along with paths) and use it for changing the source.

Power BI Parameter can be created in three ways;
  • Parameter that accepts a value via an input box
  • Parameter that is formed using a predefined list
  • Parameter that is formed using another query
Parameter with a predefined list
Let's focus on the second one first. Here are the steps for creating a Power BI report with a parameter;

1. Open Power BI and connect with a CSV file. In my case, the file name is Sales_201501.csv.

2. Once the file is loaded, click Edit Queries in Home ribbon for opening Query Editor. You should one query under Queries

Parameters can be created only with Query Editor. However, created parameters can be accessed in both Query Editor and Data Model.

3. Click Manage Parameters in Home ribbon. Once the dialog box is opened, click New to create a new Parameter.

4. Name the parameter as SourceFile

5. Select Text for Type.

6. Select List of values for Suggested values.

7. Enter File Names along with paths.

8. Set Default Value as you need. This is the default value for the designer.

9. Set Current Value as you need. This setting is for the end user, or to use with the report level.

Now you should see the parameter in Queries. If you want to open it as another table in the data model, you can right click on it and select Enable Load. We do not need to enable this option for using the parameter at report level.

Now we need to make sure that source of our file is set with the created parameter. There are two ways of doing it. One is open the Advanced Editor and change M Code. Other way is, using Data source settings. Let's use the second option. If you need to know how the M Code is getting changed, open it and see after the change.

10. Click Data source settings in the Home ribbon. This opens the Data Source Setting dialog box.

11. Select the source and click Change Source.... This opens a dialog box that allows you to change settings related to your source.

12. You should notice that the button given for File Path has an option for creating parameters and selecting a parameter.

13. Select Parameter as the option and select name parameter created.

14. Click OK and close the Data source settings.

15. All set. Click Close & Apply for saving the queries and closing Query Editor.

16. Create a visual as you can distinguish results when different file is selected. This is what I created from my file.

Let's change the source using the parameter now.

17. Click down-arrow in Edit Queries in Home ribbon. You should see three options. Select the last one that is Edit Parameter.

18. You should see Enter Parameters dialog box that shows all parameters. Change the source file to different one.

19. Once clicked OK, you should be prompted for Applying. Apply the settings and you should see the report with new source now.

Like this way, we can use the parameter in other sections like Filtering rows in Query Editor, Replacing, etc. We still cannot use the parameter with DAX and it is one limitations.

Saturday, June 17, 2017

Relational Algebra - Joins - Theta Join, Equijoin, Natural Join, Outer Join, Semijoin

I wrote a post on Relational Algebra that discusses most of operations related to it. This is the continuation of it and this explains Join Operations related to Relational Algebra. You may find that it is different with Joins we do now but it is the foundation for all type of joins we do with our relational databases.

Join Operations

Join is one of the main operations we perform against relations (tables) for retrieving data. It is done over the Cartesian Product of the two operand relations, using a SELECT statement with a Predicate. We are familiar with Joins like INNER JOIN, OUTER JOIN and CROSS JOIN but initially there were introduced with types like Theta Join, Equijoin, Natural Join, Outer Join and Semijoin. Modern DBMSs have enhanced these and have different implementations and that is why we do not see these types with mentioned names. But let's try to understand each of these types and how they are represented with Relational Algebra. Note that I have used different SQL Statement but it can be implemented using many techniques. Remember, almost all Joins are based on Cartesian Products.

Theta Join

This is based on a Predicate added to a Cartesian Product. In simple term, if you have joined two tables using CROSS JOIN, then you can add a filter to the result using one of the comparison operators. See the example given. Note that it can be implemented using SELECTION over a Cartesian Product as well.


This is same as Theta Join but the comparison operator is equal. Generally, if the operator of the Theta Join is equal operator (=), then the join is called as Equijoin instead of Theta Join, Here are two examples;

Natural Join

Natural Join is an Equijoin of two relations over all common attributes. In other words, when joining two tables, join is done using all common columns. Therefore, explicit Predicate is not required. See the sample given. I have used NATURAL JOIN which is not available with some DBMSs. Note that Common Attributes are not duplicated.

Outer Join

This join type includes both matching and no matching values from one relation and matching values from the other relation when two relations are joined. The relation that returns all tuples is determined using the Symbol used for the operation. If the Symbol is opened for the Left Relation, it is considered as the relation that returns all tuples. This is implemented using either LEFT or RIGHT in SQL.


Here is the last one. This join performs a join operations over two relations and projects over the attributes of first operand (or the relation). With this join, tuples can be limited for the join operation by adding a predicate, increasing the performance of the join operation.


Wednesday, June 14, 2017

Power BI - Scheduled Refresh section is not available in Settings

Power BI Desktop allows us to import data from files such as CSV and Excel. Once imported, it automatically creates a Model that can be used for creating reports. Good thing is, when we publish the report, Dataset that is created with the model is also getting published. This makes sure that we can create a schedule and refresh the Dataset published for seeing latest.

See below image. It shows the Gateway connection and Scheduled refresh settings for an uploaded text file. The Dataset created for the file is PowerBI Test Report.

Now if I get the settings related to the Dataset called Sales_201501;

I will not see same settings. What is the reason? It is also based on a text file.

This will happen when the uploaded Dataset is not a Model. This file has been uploaded using Power BI Services (, not using Power BI Desktop. Unlike Power BI Desktop, Power BI Services does not create a Model when a file is uploaded and uploaded file will be sill treated as a Dataset but not as a Model.

If you need to get your Dataset continuously refreshed by creating a schedule, then make sure that it is created using Power BI Desktop.

You can get the uploaded file refreshed automatically if the file is in either SharePoint or OneDrive.

Saturday, June 10, 2017

Power BI - Cannot highlight or pin visuals when grouped with shapes

Power BI allows us to add shapes like rectangle or oval for marking specific area, showing covered visuals as they are related and operate together. This is really useful when you need to highlight a group of visuals that has same behavior or the visuals are related to same business process.

When you have visuals, you can add the required shape and change properties as you need. The below image shows an added rectangle with specific Title and filled color with transperancy set to 75%.

Once added and set, it can be nicely seen with the report. But you may notice that you have lost some of the functionalities. You cannot highlight a slice of Pie Chart and make other visuals filtered. You cannot pin the visuals for Dashboards when they are published to online service.

However, this can be handled. The reason for this is, the added rectangle appears on top of visuals. If you send the rectangle "back", then this issue gets sorted.

As you see, now you can highlight slices in the Pie Chart and you can pin visuals to a Dashboard.

Wednesday, June 7, 2017

Naming columns in the View

How do you name columns returned from your created view? Generally, you do not need to explicitly name them as the names of view-columns are derived from base-tables. But, if there is any ambiguity in the name for a column or calculated column, then it must be named explicitly.

What are the ways of naming columns?

There are two ways of naming columns in the view. One way is, mentioning names along with CREATE VIEW statement. This requires names for all columns, you cannot just name set of columns in the view. Here is an example of it;

Other way is, adding aliases to columns. Most of us prefer this way because we do not need to name all columns explicitly. In most cases, we need to name columns that has ambiguity in the name or it is a calculated column. Remember if above technique is used for naming columns, aliases added will be overwritten.

Tuesday, June 6, 2017

Creating Horizontal Views and Vertical Views

Everyone has either used or created views for getting or creating a dynamic result using one or multiple tables. And everyone knows the usage of views and purpose of creating them. But many do not how they have been categorized; such as Horizontal and Vertical. Here is a simple explanation on them.

Horizontal Views

If you think about the standard definition of a view, it is a virtual relation (table) that is result of one or more relational operations on one or more relations (tables). In simple terms, is a logical table defined with a SELECT statement based on one or more tables. Now, what is a Horizontal View? It is a view that limits the number of records to be produced. We create Horizontal Views by adding WHERE clause.

USE AdventureWorks2014;

CREATE VIEW dbo.Sales_2012
SELECT * FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2012;

Vertical Views

This view results set of selected columns from one or more tables. We implement this by selecting required columns, optionally adding the WHERE clause.

USE AdventureWorks2014;

CREATE VIEW dbo.Sales_2012_Summary
SELECT SalesOrderNumber, OrderDate, SubTotal 
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2012;

In addition to that, there is another type called Grouped and joined views that is actually either a Horizontal View or Vertical View with aggregations and multiple joins, hence I do not think that it has to be taken as a major type.

Monday, June 5, 2017

What should be considered on Visualization?

Creating a report with stunning, meaningful and appropriate visuals is the key for delivering the information quickly, hence special consideration has to be made when designing reports. Otherwise, the created report will not be attractive and users will not see what they want and as they want.

There are few things need to be considered when designing reports;
  • Placement - you need to make sure that visuals added to the report are placed or positioned at the right place. How do we decide? Generally, most important visuals (A KPI for an example) should be placed in the top-left corner of the report because it is the place everyone looks at immediately. Not only that, when the report is opened using a mobile device, this makes sure that the important ones are displayed first. In addition to that, make sure the space between visuals is consistence and sized appropriately.
  • Appropriate Visuals - If something can be displayed using a Column Chart does not mean that it is the best visual for showing it. Therefore, right visual should be selected for displaying information. For an example, if just one value has to be shown, something like Revenue for the month, a visual like KPI or Card can be used. If something needs to be shown with values against a set of parameters, something like, showing revenue by month should be displayed with Column Chart. If something needs to be shown against set of parameters using percentage, not values, Pie or Donut will be appropriate.
  • Story Telling - Make sure that you report contains relevant and related information only, it should not be cluttered. If the report is created for Sales Department, make sure it shows information related to it. Not only that, if the report contains set of visuals related another area, example, a process, highlight it or group them. You can add a unfilled rectangle covering visuals related.
  • Formatting - The size of the visual, font used, font size, color and labels, all matters. Make sure that the title is properly displayed for visuals and you can make some visuals large and some small based on the importance of them. Same goes for font size. If it is a KPI and it is most important one, increasing the font size of it will definitely add a value to the report. Another important thing in formatting is, showing numerals properly. No need to show the whole number unless the visual is something like table. Generally, if the number is really large, consider to show it in thousands or millions.
  • Coloring - There are two things to consider with coloring; background color and colors on visuals. Generally light background color will do a lot as visuals can be highlighted with most of the colors. Therefore it is recommended to make the background light, something like white. However, making the background dark has become a trend because it makes the dashboard elegant. But when selecting colors for visuals, you need to be very careful as dark colors in visuals may not blend properly with a dark background. Another consideration is, company theme. Your company may have a color theme for all web applications and you might have to stick to it. If so, select appropriate colors for visuals based on the theme.
    What about colors used with visuals? One thing you need to remember is, the standard color-set used with BI. Generally, we use Green for Excellent, Amber for Neutral and Red for worse. Therefore, you should use a color like Red for indicating a success of something. Another thing is, do not use all these three colors with all types of visuals. For an example, when a KPI is shown, one color is used for either coloring font or background for the visual that is fine. Similarly, these three colors can be used with a tabular visual with all possible cells that is something you need to consider. You may see your table busy when many cells are colored with all three colors, hence it is better to color only one type of cells with one color.
There are many more tiny things to consider but above considered as the most important considerations. List anything if you think that I have missed some.

Sunday, June 4, 2017

Calculate the average value against a large table - SQL Server Brain Basher of the Week #068

Let's talk about another Interview Question related to SQL Server development. This is about one of the aggregate functions that is AVG. Here is the question;

You have a table called SalesOrderHeader and it has millions of records. It has a column called Freight and you need to the average of it. You need to make sure that only 10% of records is used for calculating the average.

What would be the best way? The standard answer is, write a query to get 10% of records using TOP and calculate the average.

SELECT AVG(t.Freight) AverageFreight
 FROM Sales.SalesOrderHeader) t

But this might not give you the accurate average as you consider only first set of records. It will be more accurate if you pick records randomly and then calculate the average. You may add ORDER BY to your statement with NEW_ID function.

SELECT AVG(t.Freight) AverageFreight
 FROM Sales.SalesOrderHeader

Second method gives the most accurate value as it picks records randomly. However the cost is high with the statement. There is another way to achieve the same. It is using TABLESAMPLE operator.

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader

Here is the comparison between both methods. Notice the total cost.

As you see, TABLESAMPLE gives better performance than the first method.

Since it picks records randomly, the average it returns different at each execution. If you need to same value for all your executions, REPEATABLE option has to be used with repeat_seed

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader

You will get the same average as long as the repeat_seed is same.

Saturday, June 3, 2017

Power BI - No page to enter SQL Server credentials when connecting - Here is the reason

Have you experienced that Power BI just connects with your SQL Server without requesting your login credentials? Or you might need to connect with your SQL Server using someone else credentials but Power BI continues to the next page without prompting the page related to credentials.

This is not a bug or issue, this is because of the cache maintained by Power BI Desktop.

When I try to connect with my SQL Server using Power BI Desktop, I get the initial page where I can enter server name and database name. But when I click on OK, it straightaway opens the page for table selection instead of showing the page related to authentication.

I know that this happens because I have connected to the same server before and my connection is cached. What if I want to connect the same using a different account?

Clearing the data source cache

It is possible to clear the saved information related to previously connected sources. You can open this setting using File menu -> Options and settings -> Data source settings. This page shows all connections you have made previously; Data sources in current file shows all connections you have made with opened file and Global permissions shows all connections you have made with all files

You can clear cached permissions by selecting the connection and clicking Clear Permissions. Once cleared, you should see the page that requests credentials for connecting with the source.

Friday, June 2, 2017

Power BI does not show all columns in SQL Server View

I experienced a strange issue with Power BI today, I might have missed something or there is something unknown to me. This is what I experienced; Power BI does not show all columns when trying to import records from a SQL Server view.

Just for simulating the issue, I created a view with AdventureWorks2014, using three tables, Product, ProductSubCategory, and ProductCategory, and named as vProduct.

USE [AdventureWorks2014]

 SELECT p.ProductID, p.Name Product, s.Name SubCategory, c.Name Category
 FROM Production.Product p
  INNER JOIN Production.ProductSubcategory s
   ON p.ProductSubcategoryID = s.ProductSubcategoryID
  INNER JOIN Production.ProductCategory c
   ON s.ProductCategoryID = c.ProductCategoryID;

Then I granted SELECT permission to Jack user.

GRANT SELECT ON dbo.vProduct TO Jack;

If Jack executes a SELECT against the view, he sees records with all columns.

Now, if Jack connects to the database with Power BI, this is what he sees.

Anyone has experienced this? I made a search but could not find a solution. I might have made a mistake or missed something but could not figure it out. Appreciate if you can share your thoughts or a solution if you have already seen this and sorted out.

I used the latest Power BI version that is May 2017.

Note that I see the same result for both Import and Direct Query.

Wednesday, May 31, 2017


Few days back, I wrote an article on SQL Server default behavior when it expands files with AutoGrowth enabled. Where there are multiple files in the file group and all files are fully filled, SQL Server expands only one file at a time, making the data distribution inconsistence. Read more on it: SQL Server does not always write data to files, proportional to the amount of free space in each file

There is a facility to change this behavior with SQL Server 2016. By default File Group is set to AUTOGROW_SINGLE_FILE but if we set it to AUTOGROW_ALL_FILES, then it changes the default behavior and grows all files when required.

Let me take the same code used with my previous code and show you.

Let's create a database with an additional file group that contains two file. Let's set the new file group as the default and create a table on it. The below code does it and inserts 900 records to the table. And it checks the spaced used;

USE master;


-- Two additional data files are added under FG1
( NAME = N'TestDatabase'
 , FILENAME = N'D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\TestDatabase.mdf' 
( NAME = N'TestDatabase_Data1'
 , FILENAME = N'D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\TestDatabase_Data1.ndf' 
( NAME = N'TestDatabase_Data2'
 , FILENAME = N'D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\TestDatabase_Data2.ndf' 
( NAME = N'TestDatabase_log'
 , FILENAME = N'D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\TestDatabase_log.ldf' 

USE TestDatabase;

-- Create a table
-- This will be created on FG1 as it is the default
CREATE TABLE dbo.TestTable
 Id int identity(1,1) Primary Key
 , Name char(8000) not null

-- Inserting 900 records
INSERT INTO dbo.TestTable
 (Name) VALUES (Replicate('a', 8000));
GO 900

 file_id, name, physical_name
 , Convert(Decimal(12,2),Round(size/128.000,2)) AS [FILE_SIZE_MB]
 , Convert(Decimal(12,2),Round(FileProperty(name,'SpaceUsed')/128.000,2)) AS [SPACE_USED_MB]
 , Convert(Decimal(12,2),Round((size-FileProperty(name,'SpaceUsed'))/128.000,2)) AS [FREE_SPACE_MB]
 , Convert(Decimal(12,2),Round((size-FileProperty(name,'SpaceUsed'))/128.000/(size/128.000),2)) AS [FREE_PCT_MB]
FROM sys.database_files;

As you see, both files are almost filled. Let's change the default setting of the file group and insert some additional records.


Done. Now let's see whether only one file or both files have been grown.

 file_id, name, physical_name
 , Convert(Decimal(12,2),Round(size/128.000,2)) AS [FILE_SIZE_MB]
 , Convert(Decimal(12,2),Round(FileProperty(name,'SpaceUsed')/128.000,2)) AS [SPACE_USED_MB]
 , Convert(Decimal(12,2),Round((size-FileProperty(name,'SpaceUsed'))/128.000,2)) AS [FREE_SPACE_MB]
 , Convert(Decimal(12,2),Round((size-FileProperty(name,'SpaceUsed'))/128.000/(size/128.000),2)) AS [FREE_PCT_MB]
FROM sys.database_files;

Both files have been grown. This makes sure that data is getting distributed properly and I believe that this should be the standard setting for our databases.

Tuesday, May 30, 2017

Power BI - Changes you should know on Power BI Free Account from June 01

As per the announcement made by Microsoft on Power BI early this month,  Following things will be changed on Power BI Free account;

  • Almost all Power BI Pro features will be available for Free accounts, except sharing and collaboration. Sharing and Collaboration will only be available for Pro account.
  • All types of data sources can be accessed using Free account.
  • Following are increased: 10GB storage, 8 times daily refresh and 1 million rows/hour streaming data.
  • Capabilities NOT included: Group workspaces (now it is app-workspace), Export to PowerPoint, Export to CSV/Excel and Analyze in Excel.
As you see, some of important things are going to vanish while some are getting added. If you have been using Free account, you will be offered a Pro trial account for one year to continue with what you have been doing.

For more info on this, read this thread.

Monday, May 29, 2017

Understanding Relational Algebra

When the relational model was introduced, in order to work with the model or in order to retrieve or update data in the model, languages were introduced. They are called as relational languages. Initially, two languages: Relational Algebra and Relational Calculus were introduced by Codd during 1971 as basis of relational languages.

When you start studying databases and its related languages, these are the two main languages you learn first, and of course, they are not much user-friendly. If you have worked with modern Database Management Systems such as Microsoft SQL Server or Oracle, then you know that how TSQL or PLSQL is powerful and richer when you compare to Relational Algebra and Relational Calculus. However, if you are to study database and related languages, it is always better to start with basis as it explains the fundamentals.

The Relational Algebra

It is a high-level procedural language. It describes how an operation is performed between two relations (tables: that is the word familiar to us that results another relation. The operation is described with expressions and it can be nested, means, an output of one relation can be used to performed another operation.

This language is considered as theoretical language as well as relation-at-a-time language. It manipulates all tuples (records) using one statement without looping them. There are many different types of operations in Relational Algebra but Codd originally proposed eight operations and they are categoried into two: Unary and Binary.

Unary Operations

There are two operations defined under this: Selection and Projection. They are called as unary operations since they operate on only one relation.

Binary Operations

There are six operations under this category: Cartesian Product, Union, Set Difference, Join, Intersection and Division. Unlike unary operations, they work on pairs of relations.

Here are some sample images I used in my classes. Note that SQLs can be written in different ways for getting the output. The one I have written is not the only way for getting the result.


There are set of symbols that are used with operations, here are the standard symbols used with Relational Algebra.


This operation returns tuples from a single relation based on the specified predicate (condition). Multiple predicates can be added using AND, OR and NOT.


This operation returns a relation that contains a vertical subset of used relation. In other words, it returns set of tuples only with specified attributes.

Cartesian Product

This operation creates a relation combining two relations, concatenating every tuple in one relation with every tuple in other relation. In simple term, if the first table has 100 records with 5 attributes and other table has 50 records with 10 attributes, this creates an output with 5000 records (100 * 50) and 15 attributes (5 + 10).


This operation makes a relation containing all tuples from first relation or second relation or both first and second relations. This eliminates duplicates. Remember, both relations used must be union-compatible.

Set Difference

This operation creates a relation containing tuples in first relation that are not in second relation. Just like the previous one, both relations must be union-compatible.


This operation creates a relation containing tuples in both first and second relations. Both relations must be union-compatible.


This operation creates a relation containing selected attributes in first relation, matching with every tuple in second relation. See the image; It tries to answer Which customers are registered from ALL the countries ranked as 2.

I have made a separate post on Joins. Here is the link for it: