Tutorials

This page contains tutorials in the form of jupyter notebooks that will allow you to interact with and learn how to use SML.

Contents


Introductory Material

We assumed that you have installed SML, if not please see Instructions for Installing SML before continuing. For the examples below we use publicly aviliable data which you can download manually nevertheless, for simplicity’s sake you can run the following python script to obtain the same data and begin running examples.


Reading in Data

When reading in data using SML one must use the READ Keyword followed by the path to the file, for example:

query = 'READ "/path/to/dataset" '

You can also provide optional arguments by including a () with arguments for example:

query = 'READ "/path/to/dataset" (sep = ",", header=None) '

A list of the all of the READ optional arguments are:

  • header - Argument used to specify the header of a dataset. (By default this is None, if no header is present. Otherwise pass in a list or variable names.)
  • sep - Argument used to specify what a dataset is delimited by. (For fixed width files, use ‘\s+’)
  • dtypes - Argument used to specify the datatype of each column in a dataset.

The table below contains examples of SML reading in Data from various datasets. To view the tutorials using the READ command click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

Dataset Task Acknowledgement Tutorials
Iris READ link notebook
Auto-MPG READ link notebook
Wine READ link notebook

Seperating Keywords

When seperating data, we use the keyword AND to specify that another action will be performed for the query. As you’ll find in subsequent sections you can combine keywords to form complicated queries. For now consider the following example:

query = 'READ "/path/to/data" (separator = "\s+", header = None) AND\
 REPLACE ("?", "mode") AND SPLIT (train = .8, test = .2, validation = .0) AND\
  REGRESS (predictors = [2,3,4,5,6,7,8], label = 1, algorithm = simple)'

While you haven’t formally been introducted to the REPLACE, SPLIT, and REGRESS keywords yet, this query will perform the following steps:

    1. Read the dataset, delimited by “\s+” with no header.
    1. Next it will replace any values of “?”.
    1. Then it will split the data using a 80/20 split for training and testing respectively.
    1. Then it will perform regression using columns 2-8 of the dataset as features, and column 1 as the label. The algorithm that SML will use is simple linear regression.

Currently, it’s not important to know exactly what every keyword is doing in the query however, it’s important to note that each keyword is delimited by an AND keyword. In the subsequent sections you’ll start to see the AND keyword used. ___

Cleaning Up Data

When working with datasets, values may be missing or NaNs, NAs, and other troublesome values may be present in a dataset. You can replace these values in SML by using the REPLACE keyword. The following example shows the syntax for the REPLACE keyword:

'READ "/path/to/data" (separator = ",", header = None) AND REPLACE (missing = "NaN",  strategy = "mode")'

When the REPLACE keyword is used it requires the first value to be one that you wish to replace, followed by the metric that you want to replace the column with. In the code snippet above you read in some hypotheical dataset and then replace any value of ‘NaN’ with the mode of the column. Currently the following metrics have been implemented:

  • mode
  • mean
  • drop column (Removes column if 1 value of the replace value is in a column)
  • minimum
Dataset Task Acknowledgement Tutorial
Titanic READ + REPLACE link notebook
Auto-MPG READ + REPLACE link notebook

Partitioning Datasets

For almost all situations in Machine Learning it’s often useful to split a dataset into Training and Testing sets. To split data with SML you specify the SPLIT keyword The following example shows the SYNTAX for the SPLIT: keyword:

query = 'READ "/path/to/data" (separator = ",", header = None) AND SPLIT (train = 0.8, test = 0.2)'

The SPLIT keyword requires train and test to have some numerical value that adds up to 1 enclosed in (). Here we read some hypthetical dataset using the READ keyword. From there we also include the keyword AND which specifies that additional command will be used in the query. Then the keyword SPLIT is used and we specify that we want 80% of the dataset that is read in to be used for training and the other 20% to be testing.

The table below contains examples of SML reading in data from various datasets and splitting the data into various training and testing sets. To view the tutorials for the SPLIT keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

Dataset Task Acknowledgement Tutorial
Iris READ + SPLIT link notebook
Auto-MPG READ + SPLIT link notebook
Wine READ + SPLIT link notebook

Using Classification Alogorithms

If you want to run a classfication algorithm using SML you use the CLASSIFY keyword. The current algorithms availiable for classification are:

  • Support Vector Machines (SVM)
  • Naive Bayes
  • Random Forest
  • Logistic Rergession
  • K-Nearest Neighbors

Consider the following code snippet with respect to the syntax for the CLASSIFY keyword:

'CLASSIFY (predictors = [1,2,3,4], label = 5, algorithm = svm)'

The syntax is to specify CLASSIFY with the following enclosed in (): columns of the dataset that you want to use as features, the label you want to classify, and the algorithm that you want to use.

The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing classifcation over the dataset with a classifcation algorithim. To view the tutorials for the CLASSIFY keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

It’s worth noting that for the Titanic, Chronic Kidney Disease, and U.S Census Dataset the REPLACE keyword is used, this keyword will be talked about in a subsequent section.

Dataset Task Algorithm Acknowledgement Tutorial
Iris READ + SPLIT + CLASSIFY SVM link notebook
Spam Detection READ + SPLIT + CLASSIFY Naive Bayes link notebook
Titanic READ + REPLACE + SPLIT + CLASSIFY Random Forest link notebook
Chronic Kidney Disease READ + REPLACE + SPLIT + CLASSIFY Logistic Regression link notebook
U.S. Census READ + REPLACE + SPLIT + CLASSIFY Logistic Regression link notebook

Using Clustering Algorithms (Tutorials Still Under Construction)

If you want to run clustering algorithms using SML you use the CLUSTER keyword. The current algorithms availiable for clustering are:

  • K-Means Clustering

Consider the following code snippet with respect to the syntax for the CLUSTER keyword:

'CLUSTER (predictors = [1,2,3,4,5,6,7], algorithm = kmeans)'

The syntax is to specify CLUSTER with the following enclosed in (): columns of the dataset that you want to use as features, and the algorithm that you want to use.

The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing clustering over the dataset with a clustering algorithim. To view the tutorials for the CLUSTER keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

Dataset | Task | Algorithm | Acknowledgement | Tutorial :—: | :—: | :—: | :—: | :—: Seeds | READ + SPLIT + CLUSTER | K-Means | link | notebook Wine | READ + SPLIT + CLUSTER | ? | link | notebook ___

Using Regression Algorithms

If you want to run regression algorithms using SML you use the REGRESS keyword. The current algorithms availiable for regression are:

  • Simple Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Elastic Net Regression

Consider the following code snippet with respect to the syntax for the REGRESS keyword:

'REGRESS (predictors = [1,2,3,4,5,6,7,8,9], label = 10, algorithm = ridge)'

The syntax is to specify REGRESS with the following enclosed in (): columns of the dataset that you want to use as features, the label that you want to predict and the algorithm that you want to use to do regression.

The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing regression over the dataset with a specific regression algorithim. To view the tutorials for the REGRESS keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

Dataset | Task | Algorithm | Acknowledgement | Tutorial :—: | :—: | :—: | :—: | :—: Auto-MPG | READ + REPLACE + SPLIT + REGRESS | Simple Linear Regression | link | notebook Computer Hardware | READ + SPLIT + REGRESS | Ridge Regression | link | notebook Boston Housing | READ + SPLIT + REGRESS | Elastic Net | link | notebook ___

Saving / Loading Models

It’s possible to save models and reuse them later. To save a model in SML you use the SAVE keyword. Consider the following code snippet with respect to the syntax for the SAVE keyword:

'SAVE "path/to/save/model"'

The syntax is to specify SAVE followed by the path to save the model enclosed in "".

To use this model again you use the LOAD keyword. Consider the following code snippet with respect to the syntax for the LOAD keyword:

'LOAD /path/to/load/model'

The syntax is to specify LOAD followed by the path to save the model.

The table below contains an example of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing regression over the Auto-MPG dataset with a simple linear regression. The model is saved and then reloaded. To view the tutorials for the SAVE & LOAD keywords click on the hyperlink in the Tutorial Column. Again you can download the Auto-MPG dataset by clicking on the hyperlink in the Acknowledgement Column.

Dataset Task Algorithm Acknowledgement Tutorial
Auto-MPG READ + REPLACE + SPLIT + REGRESS + SAVE + LOAD Simple Linear Regression link notebook

Plotting Datasets and Metrics of Algorithms

When using SML it’s possible to plot datasets or metrics of algorithms. The syntax to do this is PLOT followed by the enclosing the following in () The model type and the plot types. Consider the following code snippet with respect to the syntax for the PLOT keyword:

'PLOT (modelType="AUTO", plotTypes="AUTO")''

Here were telling SML to generate plots based on the modelType (Regression, Classifcation, Clustering) that would provide the best information about the model and dataset.

The table below contains examples of SML reading in data from various datasets, splitting the data into various training and testing sets, and performing some machine learning task over the dataset with a specific algorithim. To view the tutorials for the PLOT keyword click on the hyperlinks in the Tutorial Column. All of these datasets can be downloaded by clicking the hyperlinks in the Acknowledgment’s column.

Dataset Task Algorithm Acknowledgement Tutorial
Iris READ + SPLIT + CLASSIFY + PLOT SVM link notebook
Auto-MPG READ + SPLIT + REGRESS + PLOT Simple Linear Regression link notebook
Seeds READ + SPLIT + CLUSTER + PLOT K-Means link notebook