Classes
Class 6 - 12/11
Code from this class can be found here
Class 5 - 12/5
Semantic organization of text
Analyzing text:
- text is messy - machine learning algorithms like well defined, fixed length inputs and outputs
Bag of words model:
- The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. We will use it for feature extraction on text.
Bag of words (BOW) is a representation of text that describes the occurance of words within a document
It measures:
- the vocabulary of words in the document
- the measure of presence of known words
It’s call bag of words because we discard the structure of the words and focus only on whether or not the words appear in the document and how frequently, and not where they occur or in what order
There are a few different ways to approach bag of words:
Count Occurance
Let’s look at an example:
- He drinks a whiskey drink
- He drinks a vodka drink
- He drinks a lager drink
- He drinks a cider drink
We can look at each line as a “document”. First - what is the unique vocabulary of this document? (ignoring case and punctuation)
- he
- drinks
- a
- whiskey
- drink
- vodka
- lager
- cider
This is a unique vocabulary of 8 words out of a corpus of 20
Next we create document vectors - the goal is to turn each document into a vector of that we can use as input or output for a model. We have a vocabulary of 8 so our vector length is 8. Thus, our first document becomes:
[1, 1, 1, 1, 1, 0, 0, 0]
For large corpus, we get something called a sparse vector - tons of vocabulary so vectors have many zeros - sparse vectors are harder to compute. We can ignore stop words, punctuation, case, fix mispellings, stemming (play -> playing). The problem with this solution is it high frequency words can dominate the model and cause bias. Think about it - less frequent words might be more important
TF-IDF
TF-IDF takes another approach - that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.
Term Frequency: a scoring of the frequency of the word in the current document. Inverse Document Frequency: a scoring of how rare the word is across documents.
The scores are a weighting where not all words are equally as important or interesting.
Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.
number of times term t appears in the document
tf = ------------------------------------------
total number of terms in the document
Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. The rarer the term, the higher the IDF score.
total number of documents
idf = loge(-----------------------------------)
number of documents with term in it
tfidf is the multiplication of these two factors
tfidf = tf * idf
Doc2vec
Distributed Representations of Sentences and Documents - introduced in 2014 https://arxiv.org/abs/1405.4053. We will use it to perform feature extraction on a corpus. Unlike bag of words models, the idea with Doc2Vec is to maintain the the ordering and semantics of the words. This should give us better features than tfidf.
The idea for doc2vec started with Word2Vec. Word2vec is a three layer neural network with one input, one hidden and an output layer. The input layer corresponds to signals for context (surrounding words) and output layer correspond to signals for predicted target word. As the training procedure repeats this process over large number of sentences (or phrases), the weights “stabilize”. These weights are then used as the vectorized representations of words.
Additional Resources:
- Reading and Writing Electronic Text by Allison Parrish (syllabus)
- Document Embedding Techniques
- An Introduction to Bag-of-Words in NLP
- 3 basic approaches in Bag of Words which are better than Word Embeddings
- A Gentle Introduction to the Bag-of-Words Model
- How does doc2vec represent feature vector of a document?
- A gentle introduction to Doc2Vec
- Doc2vec tutorial
- Distributed Representations of Sentences and Documents
- Doc2Vec to wikipedia articles
- Gensim: Core Concepts
- Gensim: Doc2Vec
Class 4 - 11/20
Feature extraction and exploration methodologies
What is feature extraction?
- In a multi layer neural network, the final layer is known as the softmax
- It takes a list of numbers from the network and “squashes” them into probabilities
- The layer before this final layer is known as the logits, or feature vector
- This is the “fingerprint” of the image, and can be used to compare images
KMeans Clustering
-
KMeans clustering is an unsupervised machine learning method that allows us to “automatically” find clusters of similar data points. It requires that we know how many clusters we want ahead of time (this is our “K” value).
-
We can determine the optimal “K” value using the elbow method - does not necessarily apply to the work in this class
UMAP
-
UMAP is a new dimensionality reduction method that outperforms t-SNE - the previous SOTA for dimensionality reduction and visualization
-
Dimensionality reduction is about learning the “latent” features in your data
-
A lot of the features are redundant or can be condensed
-
For example - mnist has 784 dimension, we shouldn’t need this many points to represent simple digits
Additional resources:
- Good in-depth explanations of feature extraction here and here
- How KMeans clustering works
- A very mathy discussion about UMAP
- UMAP
- If you’re curious about t-SNE
- Rasterfairy
Class 4 - supplemental
Multilayer Perceptrons and Gradient Descent
Plain vanilla aka multilayer perceptron
We are going to use tensorflow 2.0
mnist : 28 x 28 numbers - 784 total numbers
each one holds the grayscale value of that pixel (0 - 1)
general structure of the network:
input - > hidden layer -> output
so how does this work?
when the network sees some specific features, certain parts of it activate in response
just like our perceptron, we have weights for each connection
we take the weighted sum, and then calculate our activation. We want our activations to be between 0 and 1. In order to achieve this, we use the sigmoid function : 1 / (1 + e^-x). So the activation of the layer is a measure of how positive the weighted sum is.
Stochastic Gradient Descent:
-
an optimization algorithm used to train machine learning algorithms, most notably artificial neural networks used in deep learning
-
gradient refers to the calculation of an error gradient, or the “slope of error” and descent refers to moving down along that slope towards some minumum level of error
-
we define some ‘cost function’ - a value that represents how “wrong” the network is with respect to it’s weights
-
instead of thinking of a massive function with tons of variables, we can consider one single function (inverted parabola) C(w)
-
how do we find a weight value such that it is a minumum of this function? That’s pretty easy for a function with just a few variables but with 40k variables, its a very difficult problem
-
instead, we start at any old point and determine which direction we should move
-
we calculate the slope of that function and use that to determine which direction to move
-
we then take a bunch of small steps in that direction during each batch
Some terminology:
-
sample - a single row of data that is fed into the algorithm and an output that is used to compare to the prediction. a training set consists of many samples
-
batch size - a hyperparameter that defines the number of samples to work through before updating the internal model parameters
-
epoch - hyperparameter that defines the number of times the algorithm will work through an entire dataset
-
SGD - stochastic gradient descent, described above
Additional resources:
-
Gradient descent, how neural networks learn - Deep learning, chapter 2
-
NOC 10.4: Neural Networks: Multilayer Perceptron Part 1 - The Nature of Code
-
NOC 10.5: Neural Networks: Multilayer Perceptron Part 2 - The Nature of Code
-
Introduction to Multilayer Neural Networks with TensorFlow’s Keras API
-
ml4a Demo: MNIST input - this should clarify the difference between input and first layer
-
ml4a Demo: MNIST forward pass - try refreshing and look at the probability update
-
Jay Alamar - A Visual And Interactive Look at Basic Neural Network Math
Perceptrons
Earlier we covered machine learning such as linear regression and KNN.
Supervised Machine Learning is all about ‘learning’ a function given a training set of examples.
Machine learning methods should derive a function that can generalize well to inputs not in the training set, since then we can actually apply it to inputs for which we do not have an output.
In this class we are going to cover the simplest model of an artificial neural network out there.
What is an artificial neural network (ANN)? What is a perceptron?
These all started as attempts mathematically model the neuron. The question generally was - what kind of computational systems can be made that are inspired by the biological neuron?
The perceptron is a form of supervised learning that can differentiate between linearly separable datasets.
Further material:
Watch:
- 10.1: Introduction to Neural Networks - The Nature of Code
- 10.2: Neural Networks: Perceptron Part 1 - The Nature of Code (first 12 minutes)
- 3.1: Introduction to Session 3 - What is Machine Learning?
Read:
- A ‘Brief’ History of Neural Nets and Deep Learning
- Nature of Code Chapter 10. Neural Networks
- A Quick Introduction to Neural Networks
- Nature of Code Intelligence and learning - Week 4 Neural Networks
- A Visual and Interactive Guide to the Basics of Neural Networks (really great interactive)
- Calculate the Decision Boundary of a Single Perceptron - Visualizing Linear Separability
Datasets and Scraping
What are datasets? How are they used?
Image Datasets:
Recommendation Systems:
Music:
More here
Other interesting datasets:
Scraping datasets
Can’t find a way to download what you want? First try seeing if anyone else has downloaded it yet. Is there a download functionality? Try contacting someone and seeing if they’ll share the data.
You still need to keep copyright in mind - you could violate copyright by redisplaying content.
Class 3 - 11/13
Vectors and classification
What is a vector?
- A vector has magnitude
- A vector has direction
A vector’s magnitude is the length - the size. A vector is a set of instructions on how to get from the tail to the tip
Why vectors?
Vectors can be used for physics. Vectors can explain movement, static forces, and so many other things. Vectors can be used to describe position in a space. But when we talk about space we often think of 2d or 3d. Vectors can exist in many more dimenstions. And sometimes it makes sense to pair a bunch of numbers together! There are a bunch of operations we want to do on both numbers so we consider them together
Vector math we reviewed:
- Add/sub
- mult/div
- Magnitude
- Normalize
KNN
- “Tell me who your neighbor is, and I’ll tell you who you are”
- “K-Nearest Neighbor” is a machine learning algorithm used for both classification and regression. It is a “lazy learning” algorithm due to the fact that there is really is no learning at all. New data points are classified / valued according to a distance comparison with every data point in a training set. (source)[https://github.com/nature-of-code/NOC-S17-2-Intelligence-Learning/blob/master/week3-classification-regression/README.md]
- Nice interactive demo here a bit further down the page
The notebook from the class should be listed on the “notebooks” page.
More reading:
- CS231 - Python Numpy Tutorial
- K-Nearest Neighbors Algorithm in Python and Scikit-Learn
- ml4a KNN Classification
- Pandas
- Pandas documentation
- How to master Pandas
Class 2 - 11/6
What is scripting?
- A scripting or script language is a programming language for a special run-time environment that automates the execution of tasks; the tasks could alternatively be executed one-by-one by a human operator. Scripting languages are often interpreted. - Wikipedia
Python is a scripting language that we will be using in this class. It is highly versatile, high-level, general-purpose programming language that is quickly becoming one of the most popular languages.
Using the terminal, we installed Homebrew
First, make sure you have Xcode command-line tools installed. Note - this is not the Xcode editor. These are separate tools that run from your command line. Run the commands below (without the $
at the beginning of the line):
$ xcode-select --install
Next, we install Homebrew by running the following command:
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Run the following command once you’re done to ensure Homebrew is installed and working properly:
$ brew doctor
Next, we install Python
$ brew install python
Check your install with the following command
$ python --version
It should say report a version of python 3 or higher
Next we can use pip, pythons package manager, to install jupyter notebook and other dependencies:
$ pip3 install jupyterlab
$ pip3 install matplotlib
We then run the command below to open a jupyter notebook. Make sure you are in the correct directory when you run this command.
$ jupyter notebook
You can review the in class exercises we did here:
Some recommended resources:
- learnpython.org
- Codecademy
- Freecodecamp
-
Learn Python 3 the Hard Way (book)
- A good post on whitespace for those who are curious
- Another good post on scoping for index variables - or why “fruit” persisted outside the scope of our for loop
Class 1 - 10/30
Introductory class. We began by having everyone answer the following questions:
- Why are you taking this class?
- What is your background and experience? What are you interested in?
- What did you do this summer?
This class is about using machine learning to explore archives.
The data problem - we have more information now than ever before, and it is growing exponentially.
Cultural institutions are digitizing and open sourcing massive collections. Beyond this, what is a dataset? Perhaps shopping inventories are datasets? Movies are a dataset. Google photos is a dataset.
There are growing datasets that we want to be able to search on a semantic basis.
At the same time, people expect better and better digital experiences. Companies like Google, Uber, etc are spending tons on design to create the best experiences they can. They are setting the bar for online experiences.
Design is about sorting/organizing/making information digestible. We are designing the experience of exploring a dataset using machine learning as a design tool.
A main concept for the class is using machine learning and AI to augment or facilitate, not to create.
We looked at the following references (the first six do not use ML):
- Below the Surface
- Cooper Hewitt Collection
- Art Palette
- Coins
- NYPL
- Navigating The Green Book
- Neural Neighbors: Capturing Image Similarity
- PixPlot
- The Norwegian National Museum
- Identifying art through machine learning
- Teachable Machine
- Pattern Radio: Whale Songs
- Talk to Books
- Semantris
- Font Map
- The Infinite Drum Machine
- Bird Sounds
- Boil the Frog
- Every Noise at Once
- X Degrees of Separation
- NASA’s Visual Universe
- NASA’s Visual Universe
- Curator Table
And more listed here
We also learned how to use the command line to do simple tasks
Import commands to be familiar with:
- pwd
- ls
- cd
- mkdir
- rmdir
- rm
- mv
- cp