Tutorials
This page contains tutorials in the form of jupyter notebooks that will allow you to interact with and learn how to use SML.
Contents
- Introductory Material
- Reading in Data
- Seperating Keywords
- Cleaning Up Data
- Partitioning Datasets
- Using Classification Alogorithms
- Using Clustering Algorithms
- Using Regression Algorithms
- Saving / Loading Models
- Plotting Datasets and Metrics
Introductory Material
We assumed that you have installed SML, if not please see Instructions for Installing SML before continuing. For the examples below we use publicly aviliable data which you can download manually nevertheless, for simplicity’s sake you can run the following python script to obtain the same data and begin running examples.
Reading in Data
When reading in data using SML one must use the READ
Keyword followed by the path to the file, for example:
query = 'READ "/path/to/dataset" '
You can also provide optional arguments by including a ()
with arguments for example:
query = 'READ "/path/to/dataset" (sep = ",", header=None) '
A list of the all of the READ optional arguments are:
- header - Argument used to specify the header of a dataset. (By default this is None, if no header is present. Otherwise pass in a list or variable names.)
- sep - Argument used to specify what a dataset is delimited by. (For fixed width files, use ‘\s+’)
- dtypes - Argument used to specify the datatype of each column in a dataset.
The table below contains examples of SML reading in Data from various datasets. To view the tutorials using the READ
command click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
Dataset | Task | Acknowledgement | Tutorials |
---|---|---|---|
Iris | READ |
link | notebook |
Auto-MPG | READ |
link | notebook |
Wine | READ |
link | notebook |
Seperating Keywords
When seperating data, we use the keyword AND
to specify that another action will be performed for the query. As you’ll find in subsequent sections you can combine keywords to form complicated queries. For now consider the following example:
query = 'READ "/path/to/data" (separator = "\s+", header = None) AND\
REPLACE ("?", "mode") AND SPLIT (train = .8, test = .2, validation = .0) AND\
REGRESS (predictors = [2,3,4,5,6,7,8], label = 1, algorithm = simple)'
While you haven’t formally been introducted to the REPLACE
, SPLIT
, and REGRESS
keywords yet, this query will perform the following steps:
-
- Read the dataset, delimited by “\s+” with no header.
-
- Next it will replace any values of “?”.
-
- Then it will split the data using a 80/20 split for training and testing respectively.
-
- Then it will perform regression using columns 2-8 of the dataset as features, and column 1 as the label. The algorithm that SML will use is simple linear regression.
Currently, it’s not important to know exactly what every keyword is doing in the query however, it’s important to note that each keyword is delimited by an AND
keyword. In the subsequent sections you’ll start to see the AND
keyword used.
___
Cleaning Up Data
When working with datasets, values may be missing or NaNs, NAs, and other troublesome values may be present in a dataset. You can replace these values in SML by using the REPLACE
keyword. The following example shows the syntax for the REPLACE
keyword:
'READ "/path/to/data" (separator = ",", header = None) AND REPLACE (missing = "NaN", strategy = "mode")'
When the REPLACE
keyword is used it requires the first value to be one that you wish to replace, followed by the metric that you want to replace the column with. In the code snippet above you read in some hypotheical dataset and then replace any value of ‘NaN’ with the mode of the column. Currently the following metrics have been implemented:
- mode
- mean
- drop column (Removes column if 1 value of the replace value is in a column)
- minimum
Dataset | Task | Acknowledgement | Tutorial |
---|---|---|---|
Titanic | READ + REPLACE |
link | notebook |
Auto-MPG | READ + REPLACE |
link | notebook |
Partitioning Datasets
For almost all situations in Machine Learning it’s often useful to split a dataset into Training and Testing sets. To split data with SML you specify the SPLIT
keyword The following example shows the SYNTAX for the SPLIT
: keyword:
query = 'READ "/path/to/data" (separator = ",", header = None) AND SPLIT (train = 0.8, test = 0.2)'
The SPLIT
keyword requires train and test to have some numerical value that adds up to 1 enclosed in (). Here we read some hypthetical dataset using the READ
keyword. From there we also include the keyword AND
which specifies that additional command will be used in the query. Then the keyword SPLIT
is used and we specify that we want 80% of the dataset that is read in to be used for training and the other 20% to be testing.
The table below contains examples of SML reading in data from various datasets and splitting the data into various training and testing sets. To view the tutorials for the SPLIT
keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
Dataset | Task | Acknowledgement | Tutorial |
---|---|---|---|
Iris | READ + SPLIT |
link | notebook |
Auto-MPG | READ + SPLIT |
link | notebook |
Wine | READ + SPLIT |
link | notebook |
Using Classification Alogorithms
If you want to run a classfication algorithm using SML you use the CLASSIFY
keyword. The current algorithms availiable for classification are:
- Support Vector Machines (SVM)
- Naive Bayes
- Random Forest
- Logistic Rergession
- K-Nearest Neighbors
Consider the following code snippet with respect to the syntax for the CLASSIFY
keyword:
'CLASSIFY (predictors = [1,2,3,4], label = 5, algorithm = svm)'
The syntax is to specify CLASSIFY
with the following enclosed in (): columns of the dataset that you want to use as features, the label you want to classify, and the algorithm that you want to use.
The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing classifcation over the dataset with a classifcation algorithim. To view the tutorials for the CLASSIFY
keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
It’s worth noting that for the Titanic, Chronic Kidney Disease, and U.S Census Dataset the REPLACE keyword is used, this keyword will be talked about in a subsequent section.
Dataset | Task | Algorithm | Acknowledgement | Tutorial |
---|---|---|---|---|
Iris | READ + SPLIT + CLASSIFY |
SVM | link | notebook |
Spam Detection | READ + SPLIT + CLASSIFY |
Naive Bayes | link | notebook |
Titanic | READ + REPLACE + SPLIT + CLASSIFY |
Random Forest | link | notebook |
Chronic Kidney Disease | READ + REPLACE + SPLIT + CLASSIFY |
Logistic Regression | link | notebook |
U.S. Census | READ + REPLACE + SPLIT + CLASSIFY |
Logistic Regression | link | notebook |
Using Clustering Algorithms (Tutorials Still Under Construction)
If you want to run clustering algorithms using SML you use the CLUSTER
keyword. The current algorithms availiable for clustering are:
- K-Means Clustering
Consider the following code snippet with respect to the syntax for the CLUSTER
keyword:
'CLUSTER (predictors = [1,2,3,4,5,6,7], algorithm = kmeans)'
The syntax is to specify CLUSTER
with the following enclosed in (): columns of the dataset that you want to use as features, and the algorithm that you want to use.
The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing clustering over the dataset with a clustering algorithim. To view the tutorials for the CLUSTER
keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
Dataset | Task | Algorithm | Acknowledgement | Tutorial
:—: | :—: | :—: | :—: | :—:
Seeds | READ
+ SPLIT
+ CLUSTER
| K-Means | link | notebook
Wine | READ
+ SPLIT
+ CLUSTER
| ? | link | notebook
___
Using Regression Algorithms
If you want to run regression algorithms using SML you use the REGRESS
keyword. The current algorithms availiable for regression are:
- Simple Linear Regression
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
Consider the following code snippet with respect to the syntax for the REGRESS
keyword:
'REGRESS (predictors = [1,2,3,4,5,6,7,8,9], label = 10, algorithm = ridge)'
The syntax is to specify REGRESS
with the following enclosed in (): columns of the dataset that you want to use as features, the label that you want to predict and the algorithm that you want to use to do regression.
The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing regression over the dataset with a specific regression algorithim. To view the tutorials for the REGRESS
keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
Dataset | Task | Algorithm | Acknowledgement | Tutorial
:—: | :—: | :—: | :—: | :—:
Auto-MPG | READ
+ REPLACE
+ SPLIT
+ REGRESS
| Simple Linear Regression | link | notebook
Computer Hardware | READ
+ SPLIT
+ REGRESS
| Ridge Regression | link | notebook
Boston Housing | READ
+ SPLIT
+ REGRESS
| Elastic Net | link | notebook
___
Saving / Loading Models
It’s possible to save models and reuse them later. To save a model in SML you use the SAVE
keyword.
Consider the following code snippet with respect to the syntax for the SAVE
keyword:
'SAVE "path/to/save/model"'
The syntax is to specify SAVE
followed by the path to save the model enclosed in ""
.
To use this model again you use the LOAD
keyword. Consider the following code snippet with respect to the syntax for the LOAD
keyword:
'LOAD /path/to/load/model'
The syntax is to specify LOAD
followed by the path to save the model.
The table below contains an example of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing regression over the Auto-MPG dataset with a simple linear regression. The model is saved and then reloaded. To view the tutorials for the SAVE
& LOAD
keywords click on the hyperlink in the Tutorial Column. Again you can download the Auto-MPG dataset by clicking on the hyperlink in the Acknowledgement Column.
Dataset | Task | Algorithm | Acknowledgement | Tutorial |
---|---|---|---|---|
Auto-MPG | READ + REPLACE + SPLIT + REGRESS + SAVE + LOAD |
Simple Linear Regression | link | notebook |
Plotting Datasets and Metrics of Algorithms
When using SML it’s possible to plot datasets or metrics of algorithms. The syntax to do this is PLOT
followed by the enclosing the following in () The model type and the plot types. Consider the following code snippet with respect to the syntax for the PLOT
keyword:
'PLOT (modelType="AUTO", plotTypes="AUTO")''
Here were telling SML to generate plots based on the modelType (Regression, Classifcation, Clustering) that would provide the best information about the model and dataset.
The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing some machine learning task over the dataset with a specific algorithim. To view the tutorials for the PLOT
keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.
Dataset | Task | Algorithm | Acknowledgement | Tutorial |
---|---|---|---|---|
Iris | READ + SPLIT + CLASSIFY + PLOT |
SVM | link | notebook |
Auto-MPG | READ + SPLIT + REGRESS + PLOT |
Simple Linear Regression | link | notebook |
Seeds | READ + SPLIT + CLUSTER + PLOT |
K-Means | link | notebook |