Classes

Class 6 - 12/11

Code from this class can be found here

Class 5 - 12/5

Semantic organization of text

Analyzing text:

text is messy - machine learning algorithms like well defined, fixed length inputs and outputs

Bag of words model:

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. We will use it for feature extraction on text.

Bag of words (BOW) is a representation of text that describes the occurance of words within a document

It measures:

the vocabulary of words in the document
the measure of presence of known words

It’s call bag of words because we discard the structure of the words and focus only on whether or not the words appear in the document and how frequently, and not where they occur or in what order

There are a few different ways to approach bag of words:

Count Occurance

Let’s look at an example:

He drinks a whiskey drink
He drinks a vodka drink
He drinks a lager drink
He drinks a cider drink

We can look at each line as a “document”. First - what is the unique vocabulary of this document? (ignoring case and punctuation)

he
drinks
a
whiskey
drink
vodka
lager
cider

This is a unique vocabulary of 8 words out of a corpus of 20

Next we create document vectors - the goal is to turn each document into a vector of that we can use as input or output for a model. We have a vocabulary of 8 so our vector length is 8. Thus, our first document becomes:

[1, 1, 1, 1, 1, 0, 0, 0]

For large corpus, we get something called a sparse vector - tons of vocabulary so vectors have many zeros - sparse vectors are harder to compute. We can ignore stop words, punctuation, case, fix mispellings, stemming (play -> playing). The problem with this solution is it high frequency words can dominate the model and cause bias. Think about it - less frequent words might be more important

TF-IDF

TF-IDF takes another approach - that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.

Term Frequency: a scoring of the frequency of the word in the current document. Inverse Document Frequency: a scoring of how rare the word is across documents.

The scores are a weighting where not all words are equally as important or interesting.

Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.

      number of times term t appears in the document
tf =   ------------------------------------------
          total number of terms in the document

Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. The rarer the term, the higher the IDF score.

                total number of documents
idf = loge(-----------------------------------)
           number of documents with term in it

tfidf is the multiplication of these two factors

tfidf = tf * idf

Doc2vec

Distributed Representations of Sentences and Documents - introduced in 2014 https://arxiv.org/abs/1405.4053. We will use it to perform feature extraction on a corpus. Unlike bag of words models, the idea with Doc2Vec is to maintain the the ordering and semantics of the words. This should give us better features than tfidf.

The idea for doc2vec started with Word2Vec. Word2vec is a three layer neural network with one input, one hidden and an output layer. The input layer corresponds to signals for context (surrounding words) and output layer correspond to signals for predicted target word. As the training procedure repeats this process over large number of sentences (or phrases), the weights “stabilize”. These weights are then used as the vectorized representations of words.

Additional Resources:

Class 4 - 11/20

Feature extraction and exploration methodologies

What is feature extraction?

In a multi layer neural network, the final layer is known as the softmax
It takes a list of numbers from the network and “squashes” them into probabilities
The layer before this final layer is known as the logits, or feature vector
This is the “fingerprint” of the image, and can be used to compare images

KMeans Clustering

KMeans clustering is an unsupervised machine learning method that allows us to “automatically” find clusters of similar data points. It requires that we know how many clusters we want ahead of time (this is our “K” value).
We can determine the optimal “K” value using the elbow method - does not necessarily apply to the work in this class

UMAP

UMAP is a new dimensionality reduction method that outperforms t-SNE - the previous SOTA for dimensionality reduction and visualization
Dimensionality reduction is about learning the “latent” features in your data
A lot of the features are redundant or can be condensed
For example - mnist has 784 dimension, we shouldn’t need this many points to represent simple digits

Additional resources:

Good in-depth explanations of feature extraction here and here
How KMeans clustering works
A very mathy discussion about UMAP
UMAP
If you’re curious about t-SNE
Rasterfairy

Class 4 - supplemental

Multilayer Perceptrons and Gradient Descent

Plain vanilla aka multilayer perceptron

We are going to use tensorflow 2.0

mnist : 28 x 28 numbers - 784 total numbers

each one holds the grayscale value of that pixel (0 - 1)

general structure of the network:

input - > hidden layer -> output

so how does this work?

when the network sees some specific features, certain parts of it activate in response

just like our perceptron, we have weights for each connection

we take the weighted sum, and then calculate our activation. We want our activations to be between 0 and 1. In order to achieve this, we use the sigmoid function : 1 / (1 + e^-x). So the activation of the layer is a measure of how positive the weighted sum is.

Stochastic Gradient Descent:

an optimization algorithm used to train machine learning algorithms, most notably artificial neural networks used in deep learning
gradient refers to the calculation of an error gradient, or the “slope of error” and descent refers to moving down along that slope towards some minumum level of error
we define some ‘cost function’ - a value that represents how “wrong” the network is with respect to it’s weights
instead of thinking of a massive function with tons of variables, we can consider one single function (inverted parabola) C(w)
how do we find a weight value such that it is a minumum of this function? That’s pretty easy for a function with just a few variables but with 40k variables, its a very difficult problem
instead, we start at any old point and determine which direction we should move
we calculate the slope of that function and use that to determine which direction to move
we then take a bunch of small steps in that direction during each batch

Some terminology:

sample - a single row of data that is fed into the algorithm and an output that is used to compare to the prediction. a training set consists of many samples
batch size - a hyperparameter that defines the number of samples to work through before updating the internal model parameters
epoch - hyperparameter that defines the number of times the algorithm will work through an entire dataset
SGD - stochastic gradient descent, described above

Additional resources:

But what is a Neural Network? - Deep learning, chapter 1
Gradient descent, how neural networks learn - Deep learning, chapter 2
NOC 10.4: Neural Networks: Multilayer Perceptron Part 1 - The Nature of Code
NOC 10.5: Neural Networks: Multilayer Perceptron Part 2 - The Nature of Code
The Building Blocks of Interpretability
Introduction to Multilayer Neural Networks with TensorFlow’s Keras API
ml4a Demo: MNIST input - this should clarify the difference between input and first layer
ml4a Demo: MNIST forward pass - try refreshing and look at the probability update
Jay Alamar - A Visual And Interactive Look at Basic Neural Network Math

Perceptrons

Earlier we covered machine learning such as linear regression and KNN.

Supervised Machine Learning is all about ‘learning’ a function given a training set of examples.

Machine learning methods should derive a function that can generalize well to inputs not in the training set, since then we can actually apply it to inputs for which we do not have an output.

In this class we are going to cover the simplest model of an artificial neural network out there.

What is an artificial neural network (ANN)? What is a perceptron?

These all started as attempts mathematically model the neuron. The question generally was - what kind of computational systems can be made that are inspired by the biological neuron?

The perceptron is a form of supervised learning that can differentiate between linearly separable datasets.

Further material:

Watch:

Read:

Datasets and Scraping

What are datasets? How are they used?

Image Datasets:

Recommendation Systems:

Music:

Million Song dataset

More here

Other interesting datasets:

Scraping datasets

Can’t find a way to download what you want? First try seeing if anyone else has downloaded it yet. Is there a download functionality? Try contacting someone and seeing if they’ll share the data.

Is scraping legal?

You still need to keep copyright in mind - you could violate copyright by redisplaying content.

Class 3 - 11/13

Vectors and classification

What is a vector?

A vector has magnitude
A vector has direction

A vector’s magnitude is the length - the size. A vector is a set of instructions on how to get from the tail to the tip

Why vectors?

Vectors can be used for physics. Vectors can explain movement, static forces, and so many other things. Vectors can be used to describe position in a space. But when we talk about space we often think of 2d or 3d. Vectors can exist in many more dimenstions. And sometimes it makes sense to pair a bunch of numbers together! There are a bunch of operations we want to do on both numbers so we consider them together

Vector math we reviewed:

Add/sub
mult/div
Magnitude
Normalize

KNN

“Tell me who your neighbor is, and I’ll tell you who you are”
“K-Nearest Neighbor” is a machine learning algorithm used for both classification and regression. It is a “lazy learning” algorithm due to the fact that there is really is no learning at all. New data points are classified / valued according to a distance comparison with every data point in a training set. (source)[https://github.com/nature-of-code/NOC-S17-2-Intelligence-Learning/blob/master/week3-classification-regression/README.md]
Nice interactive demo here a bit further down the page

The notebook from the class should be listed on the “notebooks” page.

Class 2 - 11/6

What is scripting?

A scripting or script language is a programming language for a special run-time environment that automates the execution of tasks; the tasks could alternatively be executed one-by-one by a human operator. Scripting languages are often interpreted. - Wikipedia

Python is a scripting language that we will be using in this class. It is highly versatile, high-level, general-purpose programming language that is quickly becoming one of the most popular languages.

Using the terminal, we installed Homebrew

First, make sure you have Xcode command-line tools installed. Note - this is not the Xcode editor. These are separate tools that run from your command line. Run the commands below (without the $ at the beginning of the line):

$ xcode-select --install

Next, we install Homebrew by running the following command:

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

$ brew doctor

Next, we install Python

$ brew install python

Check your install with the following command

$ python --version

It should say report a version of python 3 or higher

Next we can use pip, pythons package manager, to install jupyter notebook and other dependencies:

$ pip3 install jupyterlab
$ pip3 install matplotlib

We then run the command below to open a jupyter notebook. Make sure you are in the correct directory when you run this command.

$ jupyter notebook

You can review the in class exercises we did here:

Intro to Python

Some recommended resources:

learnpython.org
Codecademy
Freecodecamp
Learn Python 3 the Hard Way (book)
A good post on whitespace for those who are curious
Another good post on scoping for index variables - or why “fruit” persisted outside the scope of our for loop

Class 1 - 10/30

Introductory class. We began by having everyone answer the following questions:

Why are you taking this class?
What is your background and experience? What are you interested in?
What did you do this summer?

This class is about using machine learning to explore archives.

The data problem - we have more information now than ever before, and it is growing exponentially.

Cultural institutions are digitizing and open sourcing massive collections. Beyond this, what is a dataset? Perhaps shopping inventories are datasets? Movies are a dataset. Google photos is a dataset.

There are growing datasets that we want to be able to search on a semantic basis.

At the same time, people expect better and better digital experiences. Companies like Google, Uber, etc are spending tons on design to create the best experiences they can. They are setting the bar for online experiences.

Design is about sorting/organizing/making information digestible. We are designing the experience of exploring a dataset using machine learning as a design tool.

A main concept for the class is using machine learning and AI to augment or facilitate, not to create.

We looked at the following references (the first six do not use ML):

And more listed here

We also learned how to use the command line to do simple tasks

Quick guide here

Import commands to be familiar with:

pwd
ls
cd
mkdir
rmdir
rm
mv
cp