Supporting R in SQL Server 2016
SQL Server R Services combine the power and flexibility of the open-source R language with enterprise-level tools for data storage and management, workflow development, and reporting and visualization. This article introduces SQL R Services and the R language.
What Is R and How Is It Used?
R is the most widely used language for statistics, data mining, and machine learning. (R is also the name of the environment and the engine that executes code written in the R language.) You need to learn how to develop R programs like you need to learn any other programming language.
Before going deeper into the R language, let’s explain what the terms statistics, data mining, and machine learning mean. Statistics is the study and analysis of data collection and the interpretation and presentation of the results.
Data mining is a set of powerful analysis techniques used to discover data patterns and rules that might improve a business. Both statistics and data mining usually use samples. For example, rather than using all of the United States census data, they use representative samples.
Machine learning is programming a system to solve a given problem automatically, using a data set. You might use sample data or complete data to do this.
All these terms overlap somewhat. However, a small but important difference emerges when you think of the users of each method. Statistics is a science, and its users are scientists. Data mining is used for business, and its users are typically businesspeople. Machine learning ‘users’ are often the machines running the programs. But in many cases the same algorithms are used, so there is really a lot of similarity among these branches of applied mathematics in analytical applications.
Finding Standard R Tools
You can download R from the Comprehensive R Archive Network (CRAN) site. You can get an R engine for Windows, Linux, or Mac OS X. After installation, you start working in an interactive mode. You can use the R Console client tool to write the code line by line.
There are many additional tools; the most popular free tool for writing and executing R code is RStudio, which you can download here.
The Drawbacks of Working in R
Of course, the R language and the R engine have their own issues. Security is a potential one. For example, installing additional packages that enhance and add functionalities directly from the code, like many R users typically do, might not be compatible with an organization’s security policies.
In addition, most of the calculations in R are not scalable. This might not be an issue for statistical and data mining analyses, because these typically work with samples. However, machine learning algorithms can consume huge amounts of data, which can create a problem with scale.
R Support in SQL Server and Other Products
In the SQL Server suite, any SQL Server Analysis Services (SSAS) version after 2000 supports data mining. SSAS has included some very popular algorithms with excellent explanatory visualizations, and its data mining capabilities are relatively easy to use. However, the number of algorithms is limited, and the whole statistical analysis is missing in the SQL Server suite.
Until the integration of R, that is. By introducing R in SQL Server, Microsoft has made a quantum leap forward in the statistics, data mining, and machine learning departments.
In SQL Server 2016, you now get a highly scalable R engine – even if not every function and algorithm is rewritten as a scalable one. Still, you will probably find what you need for analyzing a big dataset. You can store an R data mining or machine learning model in a SQL Server table and use it for prediction on new data. You can even store graphs in a binary column and use them in SQL Server Reporting Services (SSRS) reports.
Finally, R support is not limited to SQL Server. You can use R code in Power BI Desktop and Power BI Service. It’s also supported in Azure Machine Learning (Azure ML) experiments.
Microsoft provides a highly-scalable R engine in two flavors:
- R Services (In-Database) – This is the installation that integrates R with SQL Server. It includes a database service that runs outside the SQL Server Database Engine and provides a communication channel between the Database Engine and R runtime. You can install R Services with the SQL Server setup. The R engine includes open-source R components and a set of scalable R packages.
- Microsoft R Server – This standalone R server has the same open and scalable packages, but it runs on multiple platforms.
Security in SQL Server R Services
For security purposes, installing additional packages in In-Database R Services is not that simple; installing packages must be done by a DBA rather than through the R code itself.
In SQL Server, you call R code through a stored procedure. This way, a DBA can apply SQL Server security to the code. In addition, you need SQL Server (or a Windows login or Windows) to run the code that uses SQL Server R Services. This login must have permissions for SQL Server objects, including accessing the database where you run the R code, permission to read SQL Server data, and (if you need to store the results in a table) permission to write data.
For development, you must prepare a client installation. You can do this by downloading Microsoft R Client (start the download here). This installation includes the open source R engine and the scalable packages. You’ll probably also want to install a development IDE, either RStudio or R Tools for Visual Studio.
Of course, you can also download and install the Developer Edition of SQL Server 2016 instead. This way, you get both the Database Engine and the R runtime with its scalable packages.
SQL Server R Services Architecture
The following figure shows the communication process between SQL Server and the R engine.
Here’s what’s going on:
- In the SQL Server Database Engine, you run an R script using the
sys.sp_execute_external_scriptstored procedure. SQL Server sends the request to the Launchpad, a new service that supports the execution of external scripts.
- The Launchpad service starts the launcher appropriate to the script’s language. Currently, the sole available service is RLauncher, so you can only do external scripts in R for the moment. However, the infrastructure is there to support additional programming languages when the time comes.
- RLauncher starts RTerm, the R terminal application for executing R scripts.
- RTerm sends the script to the BxlServer. This is a new executable used for communication between SQL Server and the R engine. The scalable R functions are implemented in this executable.
- The BxlServer uses SQL Satellite, a new extensibility API, to manage fast data transferal between SQL Server and the external runtime. (Again, R is currently the only supported environment.)
Testing an Example of R in SQL Server
Let’s try executing an R script in SQL Server. First, you need to enable the execution of external scripts in the SQL Server Database Engine. Then you can execute the R code with
Configuring SQL Server to Execute R Code
We’ll use the
sys.sp_configure system stored procedure to enable external scripts. You can do this with the following code:
USE master; EXEC sys.sp_configure 'show advanced options', 1; RECONFIGURE EXEC sys.sp_configure 'external scripts enabled', 1; RECONFIGURE;
Executing R Code in SQL Server
Now you can prepare your R code for execution. As mentioned, this is done with the
sys.sp_execute_external_script system stored procedure. The most important parameters are:
@language– This is currently limited to R.
@script– This is the actual script in the external language.
@input_data_1_name– This is the name for the data frame (as seen in the R code’s
@scriptparameter for the first input dataset); the default name is
@input_data_1– This is the T-SQL query that specifies the first input data set.
@output_data_1_name– This is the name of the R object, most probably a data frame, that holds the output data set; the default name is
WITH RESULT SETS– This is where you specify the column names and data types of the R script output as seen in SQL Server.
An Example of Predictive Analytics with R in SQL Server
I’ve made the following example a bit complex to demonstrate the capabilities of R. We’ll make use of the dbo.vTargetMail view from the AdventureWorksDW2014 demo database. Specifically, we’ll be predicting the
BikeBuyer column by executing R code from inside SQL Server. Note: We’re using the Logistic Regression algorithm.
Here’s the actual code:
-- Change the context to the demo database USE AdventureWorksDW2014; -- Input data for the R code is SQL Server data DECLARE @input AS NVARCHAR(MAX) SET @input = N' SELECT CustomerKey, MaritalStatus, Gender, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation as Occupation, HouseOwnerFlag, NumberCarsOwned, CommuteDistance, Region, BikeBuyer FROM dbo.vTargetMail;' ; -- The R code that uses the inout data and performs predictions DECLARE @RCode AS NVARCHAR(MAX); SET @RCode = N' TMSQL <- rxImport(inData = sqlTM, reportProgress = 3); bbLogR <- rxLogit(BikeBuyer ~ MaritalStatus + Gender + TotalChildren + NumberChildrenAtHome + Education + Occupation + HouseOwnerFlag + NumberCarsOwned + CommuteDistance + Region, data = TMSQL); prtm <- rxPredict(modelObject = bbLogR, data = TMSQL, outData = NULL, predVarNames = "BikeBuyerPredict", type = "response", checkFactorLevels=FALSE, extraVarsToWrite = c("CustomerKey"), writeModelVars = TRUE, overwrite = TRUE); OutputDataSet <- prtm'; -- Executing the R code EXEC sys.sp_execute_external_script -- In SQL Server 2016,the only external language supported is R @language = N'R', -- R code parameter - R code from the variable @RCode @script = @RCode, -- Defining the input data - T-SQL query from the variable @input @input_data_1 = @input, -- Input data name for the R code @input_data_1_name = N'sqlTM' -- Defining the result set structure WITH RESULT SETS (( BikeBuyerPredict FLOAT, CustomerKey INT, BikeBuyer INT, MaritalStatus NCHAR(1), Gender NCHAR(1), TotalChildren INT, NumberChildrenAtHome INT, Education NVARCHAR(40), Occupation NVARCHAR(100), HouseOwnerFlag NCHAR(1), NumberCarsOwned INT, CommuteDistance NVARCHAR(15), Region NVARCHAR(50) ));
The abbreviated results show a couple of rows with only four columns. However, this is enough to see the actual and the predicted value of the
BikeBuyer column – and to see that the R algorithm works.
BikeBuyerPredict CustomerKey BikeBuyer MaritalStatus ----------------- ----------- --------- ------------- 0.797310057618303 11000 1 M 0.743479932863631 11001 1 S 0.74022757411714 11002 1 M
SQL Server and R: A Powerful Combination
Supporting R in SQL Server is probably SQL Server 2016’s most important new feature. The integration with the Database Engine solves the issues that prevented R usage in many enterprises: scalability and security. SQL Server and R together make a compelling platform for the most advanced analytical tasks.