Classes

Class 6 - 12/11

Code from this class can be found here

Class 5 - 12/5

Semantic organization of text

Analyzing text:

Bag of words model:

Bag of words (BOW) is a representation of text that describes the occurance of words within a document

It measures:

It’s call bag of words because we discard the structure of the words and focus only on whether or not the words appear in the document and how frequently, and not where they occur or in what order

There are a few different ways to approach bag of words:

Count Occurance

Let’s look at an example:

We can look at each line as a “document”. First - what is the unique vocabulary of this document? (ignoring case and punctuation)

This is a unique vocabulary of 8 words out of a corpus of 20

Next we create document vectors - the goal is to turn each document into a vector of that we can use as input or output for a model. We have a vocabulary of 8 so our vector length is 8. Thus, our first document becomes:

[1, 1, 1, 1, 1, 0, 0, 0]

For large corpus, we get something called a sparse vector - tons of vocabulary so vectors have many zeros - sparse vectors are harder to compute. We can ignore stop words, punctuation, case, fix mispellings, stemming (play -> playing). The problem with this solution is it high frequency words can dominate the model and cause bias. Think about it - less frequent words might be more important

TF-IDF

TF-IDF takes another approach - that high frequency may not able to provide much information gain. In another word, rare words contribute more weights to the model.

Term Frequency: a scoring of the frequency of the word in the current document. Inverse Document Frequency: a scoring of how rare the word is across documents.

The scores are a weighting where not all words are equally as important or interesting.

Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.

      number of times term t appears in the document
tf =   ------------------------------------------
          total number of terms in the document


Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. The rarer the term, the higher the IDF score.

                total number of documents
idf = loge(-----------------------------------)
           number of documents with term in it


tfidf is the multiplication of these two factors

tfidf = tf * idf

Doc2vec

Distributed Representations of Sentences and Documents - introduced in 2014 https://arxiv.org/abs/1405.4053. We will use it to perform feature extraction on a corpus. Unlike bag of words models, the idea with Doc2Vec is to maintain the the ordering and semantics of the words. This should give us better features than tfidf.

The idea for doc2vec started with Word2Vec. Word2vec is a three layer neural network with one input, one hidden and an output layer. The input layer corresponds to signals for context (surrounding words) and output layer correspond to signals for predicted target word. As the training procedure repeats this process over large number of sentences (or phrases), the weights “stabilize”. These weights are then used as the vectorized representations of words.

Additional Resources:

Class 4 - 11/20

Feature extraction and exploration methodologies

What is feature extraction?

KMeans Clustering

UMAP

Additional resources:

Class 4 - supplemental

Multilayer Perceptrons and Gradient Descent

Plain vanilla aka multilayer perceptron

We are going to use tensorflow 2.0

mnist : 28 x 28 numbers - 784 total numbers

each one holds the grayscale value of that pixel (0 - 1)

general structure of the network:

input - > hidden layer -> output

so how does this work?

when the network sees some specific features, certain parts of it activate in response

just like our perceptron, we have weights for each connection

we take the weighted sum, and then calculate our activation. We want our activations to be between 0 and 1. In order to achieve this, we use the sigmoid function : 1 / (1 + e^-x). So the activation of the layer is a measure of how positive the weighted sum is.

Stochastic Gradient Descent:

Some terminology:

Additional resources:

Perceptrons

Earlier we covered machine learning such as linear regression and KNN.

Supervised Machine Learning is all about ‘learning’ a function given a training set of examples.

Machine learning methods should derive a function that can generalize well to inputs not in the training set, since then we can actually apply it to inputs for which we do not have an output.

In this class we are going to cover the simplest model of an artificial neural network out there.

What is an artificial neural network (ANN)? What is a perceptron?

These all started as attempts mathematically model the neuron. The question generally was - what kind of computational systems can be made that are inspired by the biological neuron?

The perceptron is a form of supervised learning that can differentiate between linearly separable datasets.

Further material:

Watch:

Read:

Datasets and Scraping

What are datasets? How are they used?

Image Datasets:

Recommendation Systems:

Music:

More here

Other interesting datasets:

Scraping datasets

Can’t find a way to download what you want? First try seeing if anyone else has downloaded it yet. Is there a download functionality? Try contacting someone and seeing if they’ll share the data.

Is scraping legal?

You still need to keep copyright in mind - you could violate copyright by redisplaying content.

Class 3 - 11/13

Vectors and classification

What is a vector?

A vector’s magnitude is the length - the size. A vector is a set of instructions on how to get from the tail to the tip

Why vectors?

Vectors can be used for physics. Vectors can explain movement, static forces, and so many other things. Vectors can be used to describe position in a space. But when we talk about space we often think of 2d or 3d. Vectors can exist in many more dimenstions. And sometimes it makes sense to pair a bunch of numbers together! There are a bunch of operations we want to do on both numbers so we consider them together

Vector math we reviewed:

KNN

The notebook from the class should be listed on the “notebooks” page.

More reading:

Class 2 - 11/6

What is scripting?

Python is a scripting language that we will be using in this class. It is highly versatile, high-level, general-purpose programming language that is quickly becoming one of the most popular languages.

Using the terminal, we installed Homebrew

First, make sure you have Xcode command-line tools installed. Note - this is not the Xcode editor. These are separate tools that run from your command line. Run the commands below (without the $ at the beginning of the line):

$ xcode-select --install

Next, we install Homebrew by running the following command:

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Run the following command once you’re done to ensure Homebrew is installed and working properly:

$ brew doctor

Next, we install Python

$ brew install python

Check your install with the following command

$ python --version

It should say report a version of python 3 or higher

Next we can use pip, pythons package manager, to install jupyter notebook and other dependencies:

$ pip3 install jupyterlab
$ pip3 install matplotlib

We then run the command below to open a jupyter notebook. Make sure you are in the correct directory when you run this command.

$ jupyter notebook

You can review the in class exercises we did here:

Some recommended resources:

Class 1 - 10/30

Introductory class. We began by having everyone answer the following questions:

This class is about using machine learning to explore archives.

The data problem - we have more information now than ever before, and it is growing exponentially.

Cultural institutions are digitizing and open sourcing massive collections. Beyond this, what is a dataset? Perhaps shopping inventories are datasets? Movies are a dataset. Google photos is a dataset.

There are growing datasets that we want to be able to search on a semantic basis.

At the same time, people expect better and better digital experiences. Companies like Google, Uber, etc are spending tons on design to create the best experiences they can. They are setting the bar for online experiences.

Design is about sorting/organizing/making information digestible. We are designing the experience of exploring a dataset using machine learning as a design tool.

A main concept for the class is using machine learning and AI to augment or facilitate, not to create.

We looked at the following references (the first six do not use ML):

And more listed here

We also learned how to use the command line to do simple tasks

Quick guide here

Import commands to be familiar with: