Open In Colab

Geometric view of data

Objectives

In this notebook we’ll explore how multivariate data can be represented in different orthonormal bases (dimensions). In other words, how to change the dimensions defining a set of data. This notebook will scaffold your intuition of these concepts and help you understand the use of PCA in research that you will be reading about.

Overview:

  • Explore correlated multivariate data.

  • Define and visualize an arbitrary orthonormal basis.

  • Project data from one basis (cartesian) onto another basis (arbitrary).

Specifically, the code in the interactive demos will draw random samples from a zero-mean bivariate normal distribution with a specified covariance matrix. Throughout the tutorial, we’ll imagine these samples represent the activity (firing rates) of two recorded neurons on different trials.

Setup

#@title {display-mode: "form"}

#@markdown Execute this code cell to set up the notebook

import numpy as np
import matplotlib.pyplot as plt

import ipywidgets as widgets  # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle")


def plot_data(X):
    """
    Plots bivariate data. Includes a plot of each random variable, and a scatter
    plot of their joint activity. The title indicates the sample correlation
    calculated from the data.

    Args:
    X (numpy array of floats) :   Data matrix each column corresponds to a
                                  different random variable

    Returns:
    Nothing.
    """

    fig = plt.figure(figsize=[10, 6])
    gs = fig.add_gridspec(2, 2)
    ax1 = fig.add_subplot(gs[0, 0])
    ax1.plot(X[:, 0], color='k')
    plt.ylabel('Neuron 1')
    plt.title('Sample var 1: {:.1f}'.format(np.var(X[:, 0])))
    ax1.set_xticklabels([])
    ax2 = fig.add_subplot(gs[1, 0])
    ax2.plot(X[:, 1], color='k')
    plt.xlabel('Sample Number')
    plt.ylabel('Neuron 2')
    plt.title('Sample var 2: {:.1f}'.format(np.var(X[:, 1])))
    ax3 = fig.add_subplot(gs[:, 1])
    ax3.plot(X[:, 0], X[:, 1], '.', markerfacecolor=[.5, .5, .5],
           markeredgewidth=0)
    ax3.axis('equal')
    plt.xlabel('Neuron 1 activity')
    plt.ylabel('Neuron 2 activity')
    plt.title('Sample corr: {:.1f}'.format(np.corrcoef(X[:, 0], X[:, 1])[0, 1]))
    plt.show()


def plot_basis_vectors(X, W):
    """
    Plots bivariate data as well as new basis vectors.

    Args:
    X (numpy array of floats) :   Data matrix each column corresponds to a
                                  different random variable
    W (numpy array of floats) :   Square matrix representing new orthonormal
                                  basis each column represents a basis vector

    Returns:
    Nothing.
    """

    plt.figure(figsize=[4, 4])
    plt.plot(X[:, 0], X[:, 1], '.', color=[.5, .5, .5], label='Data')
    plt.axis('equal')
    plt.xlabel('Neuron 1 activity')
    plt.ylabel('Neuron 2 activity')
    plt.plot([0, W[0, 0]], [0, W[1, 0]], color='r', linewidth=3,
           label='Basis vector 1')
    plt.plot([0, W[0, 1]], [0, W[1, 1]], color='b', linewidth=3,
           label='Basis vector 2')
    plt.legend()
    plt.show()


def plot_data_new_basis(Y):
    """
    Plots bivariate data after transformation to new bases.
    Similar to plot_data but with colors corresponding to projections onto
    basis 1 (red) and basis 2 (blue). The title indicates the sample correlation
    calculated from the data.

    Note that samples are re-sorted in ascending order for the first
    random variable.

    Args:
    Y (numpy array of floats):   Data matrix in new basis each column
                                 corresponds to a different random variable

    Returns:
    Nothing.
    """
    fig = plt.figure(figsize=[8, 4])
    gs = fig.add_gridspec(2, 2)
    ax1 = fig.add_subplot(gs[0, 0])
    ax1.plot(Y[:, 0], 'r')
    plt.xlabel
    plt.ylabel('Projection \n basis vector 1')
    plt.title('Sample var 1: {:.1f}'.format(np.var(Y[:, 0])))
    ax1.set_xticklabels([])
    ax2 = fig.add_subplot(gs[1, 0])
    ax2.plot(Y[:, 1], 'b')
    plt.xlabel('Sample number')
    plt.ylabel('Projection \n basis vector 2')
    plt.title('Sample var 2: {:.1f}'.format(np.var(Y[:, 1])))
    ax3 = fig.add_subplot(gs[:, 1])
    ax3.plot(Y[:, 0], Y[:, 1], '.', color=[.5, .5, .5])
    ax3.axis('equal')
    plt.xlabel('Projection basis vector 1')
    plt.ylabel('Projection basis vector 2')
    plt.title('Sample corr: {:.1f}'.format(np.corrcoef(Y[:, 0], Y[:, 1])[0, 1]))
    plt.show()

def get_data(cov_matrix,n):
    """
    Returns a matrix of 1000 samples from a bivariate, zero-mean Gaussian.

    Note that samples are sorted in ascending order for the first random variable

    Args:
    cov_matrix (numpy array of floats): desired covariance matrix

    Returns:
    (numpy array of floats) : samples from the bivariate Gaussian, with each
                              column corresponding to a different random
                              variable
    """

    mean = np.array([0, 0])
    X = np.random.multivariate_normal(mean, cov_matrix, size=n)
    indices_for_sorting = np.argsort(X[:, 0])
    # X = X[indices_for_sorting, :]

    return X

def _calculate_cov_matrix(var_1, var_2, corr_coef):
    """
    Calculates the covariance matrix based on the variances and correlation
    coefficient.
    Args:
    var_1 (scalar)          : variance of the first random variable
    var_2 (scalar)          : variance of the second random variable
    corr_coef (scalar)      : correlation coefficient
    Returns:
    (numpy array of floats) : covariance matrix
    """

    # Calculate the covariance from the variances and correlation
    cov = corr_coef * np.sqrt(var_1 * var_2)

    cov_matrix = np.array([[var_1, cov], [cov, var_2]])

    return cov_matrix

def define_orthonormal_basis(u):
    """
    Calculates an orthonormal basis given an arbitrary vector u.
    Args:
    u (numpy array of floats) : arbitrary 2-dimensional vector used for new
                                basis
    Returns:
    (numpy array of floats)   : new orthonormal basis
                                columns correspond to basis vectors
    """

    # Normalize vector u
    u = u / np.sqrt(u[0] ** 2 + u[1] ** 2)

    # Calculate vector w that is orthogonal to w
    w = np.array([-u[1], u[0]])

    # Put in matrix form
    W = np.column_stack([u, w])

    return W

def change_of_basis(X, W):
    """
    Projects data onto new basis W.
    Args:
    X (numpy array of floats) : Data matrix each column corresponding to a
                                different random variable
    W (numpy array of floats) : new orthonormal basis columns correspond to
                                basis vectors
    Returns:
    (numpy array of floats)    : Data matrix expressed in new basis
    """

    # Project data onto new basis described by W
    Y = X @ W

    return Y

Section 1: (Correlated) multivariate data

Executing the code cell below will enable you (via an interactive slider) to can change the correlation coefficient between data in each of two dimensions. Get a feel for how changing the correlation coefficient affects the geometry of the simulated population data in the 2 dimensional (ie multivariate) space.

  1. What effect do negative correlation coefficient values have on the geometry of the population data?

  2. What correlation coefficient results in a circular data cloud?

Click here if you are interested in detials about the math behind multivariate normal distributions

To gain intuition, we will first use a simple model to generate multivariate data. Specifically, we will draw random samples from a bivariate normal distribution. This is an extension of the one-dimensional normal distribution to two dimensions, in which each \(x_i\) is marginally normal with mean \(\mu_i\) and variance \(\sigma_i^2\):

(4)\[\begin{align} x_i \sim \mathcal{N}(\mu_i,\sigma_i^2). \end{align}\]

Additionally, the joint distribution for \(x_1\) and \(x_2\) has a specified correlation coefficient \(\rho\). Recall that the correlation coefficient is a normalized version of the covariance, and ranges between -1 and +1:

(5)\[\begin{align} \rho = \frac{\text{cov}(x_1,x_2)}{\sqrt{\sigma_1^2 \sigma_2^2}}. \end{align}\]

For simplicity, we will assume that the mean of each variable has already been subtracted, so that \(\mu_i=0\) for both \(i=1\) and \(i=2\). The remaining parameters can be summarized in the covariance matrix, which for two dimensions has the following form:

(6)\[\begin{align} {\bf \Sigma} = \begin{pmatrix} \text{var}(x_1) & \text{cov}(x_1,x_2) \\ \text{cov}(x_1,x_2) &\text{var}(x_2) \end{pmatrix}. \end{align}\]

In general, \(\bf \Sigma\) is a symmetric matrix with the variances \(\text{var}(x_i) = \sigma_i^2\) on the diagonal, and the covariances on the off-diagonal. Later, we will see that the covariance matrix plays a key role in PCA.

The covariance can be found by rearranging the equation above:

(7)\[\begin{align} \text{cov}(x_1,x_2) = \rho \sqrt{\sigma_1^2 \sigma_2^2}. \end{align}\]
#@title Interactive Demo 1
#@title {display_mode:form} 

#@markdown Execute this cell to enable the interactive demo


@widgets.interact(
    corr_coef = widgets.FloatSlider(value=.2, 
                                    min=-1, max=1, 
                                    step=0.05))
def visualize_correlated_data(corr_coef=0):
    variance_1 = 1
    variance_2 = 1

    # Compute covariance matrix
    cov_matrix = _calculate_cov_matrix(variance_1, variance_2, corr_coef)

    # Generate data with this covariance matrix
    n = 150
    X = get_data(cov_matrix,n)

    # Visualize
    plot_data(X)

Section 2: Project data onto new basis

The “basis” is the dimensions. For example, the bases of the 2-dimensional cartesian plot are two sets of real numbers; one in x and one in y (ie. orthogonal directions).

Data can be represented in many ways using different bases. Here, we will express bivariate (2-dimensional) data (defined by the matrix of coordinates \(\bf X\)) originally defined in 2-D cartesian coordinates in a new 2-dimensional “orthonormal” basis set (bases are orthogonal vectors with length 1). The bivariate data is then defined by the matrix of coordinates \(\bf Y\) in the new basis.

Click here if you are interested in some detail about the math of defining orthonormal bases.

We will define a new orthonormal basis of vectors \({\bf u} = [u_1,u_2]\) and \({\bf w} = [w_1,w_2]\). Two vectors are orthonormal if:

  1. They are orthogonal (i.e., their dot product is zero):

(8)\[\begin{align} {\bf u\cdot w} = u_1 w_1 + u_2 w_2 = 0 \end{align}\]
  1. They have unit length:

(9)\[\begin{align} ||{\bf u} || = ||{\bf w} || = 1 \end{align}\]

In two dimensions, it is easy to make an arbitrary orthonormal basis. All we need is a random vector \({\bf u}\), which we have normalized. If we now define the second basis vector to be \({\bf w} = [-u_2,u_1]\), we can check that both conditions are satisfied:

(10)\[\begin{align} {\bf u\cdot w} = - u_1 u_2 + u_2 u_1 = 0 \end{align}\]

and

(11)\[\begin{align} {|| {\bf w} ||} = \sqrt{(-u_2)^2 + u_1^2} = \sqrt{u_1^2 + u_2^2} = 1, \end{align}\]

where we used the fact that \({\bf u}\) is normalized. So, with an arbitrary input vector, we can define an orthonormal basis, which we will write in matrix by stacking the basis vectors horizontally:

(12)\[\begin{align} {{\bf W} } = \begin{pmatrix} u_1 & w_1 \\ u_2 & w_2 \end{pmatrix}. \end{align}\]

Projecting (transforming) the data from one basis into another basis is done using matrix multiplication :

(13)\[\begin{align} {\bf Y = X W}. \end{align}\]

In the interactive demo below, you can explore the geometry of the transformed data \(\bf Y\) as you vary the choice of basis.

  • The parameter corr_coeff controls the correlation between the two original dimensions.

  • The parameter \(\theta\) controls the angle of the first new basis vector (red, \(\bf u\)) in degrees relative to the original cartesian basis set.

    The second new basis vector will be orthogonal to the first.

Think about the following questions as you explore the demo:

  1. What happens to the projected data as you rotate the basis?

  2. How does the correlation coefficient change? How does the variance of the projection onto each basis vector change?

  3. Are you able to find a basis in which the projected data is uncorrelated?

#@title Interactive Demo 2
#@title {display_mode:form} 

#@markdown Run this code cell to enable the demo

def refresh(corr_coef=0.5,theta=0):
    # corr_coef=0.5
    variance_1 = 1
    variance_2 = 1

    # Compute covariance matrix
    cov_matrix = _calculate_cov_matrix(variance_1, variance_2, corr_coef)

    # Generate data with this covariance matrix
    n=1000
    X = get_data(cov_matrix,n)
    u = [1, np.tan(theta * np.pi / 180)]
    W = define_orthonormal_basis(u)
    Y = change_of_basis(X, W)
    # plot_basis_vectors(X, W)

    # plot_data_new_basis(Y)
    fig = plt.figure(figsize=[20, 10])
    gs = fig.add_gridspec(4, 4)

    ax1 = fig.add_subplot(gs[0, 0])
    ax1.plot(X[:, 0], color='k')
    plt.ylabel('Neuron 1')
    plt.title('Sample var 1: {:.1f}'.format(np.var(X[:, 0])))
    ax1.set_xticklabels([])

    ax2 = fig.add_subplot(gs[1, 0])
    ax2.plot(X[:, 1], color='k')
    plt.xlabel('Sample Number')
    plt.ylabel('Neuron 2')
    plt.title('Sample var 2: {:.1f}'.format(np.var(X[:, 1])))

    ax3 = fig.add_subplot(gs[0:2, 1])
    ax3.plot(X[:, 0], X[:, 1], '.', markerfacecolor=[.5, .5, .5],
           markeredgewidth=0)
    ax3.plot([0, W[0, 0]], [0, W[1, 0]], color='r', linewidth=3,
            label='Basis vector 1')
    ax3.plot([0, W[0, 1]], [0, W[1, 1]], color='b', linewidth=3,
            label='Basis vector 2')
    ax3.axis('equal')
    plt.xlabel('Neuron 1 activity')
    plt.ylabel('Neuron 2 activity')
    plt.title('Sample corr: {:.1f}'.format(np.corrcoef(X[:, 0], X[:, 1])[0, 1]))

    ax4 = fig.add_subplot(gs[2, 0])
    ax4.plot(Y[:, 0], 'r')
    plt.xlabel
    plt.ylabel('Projection \n basis vector 1')
    plt.title('Sample var 1: {:.1f}'.format(np.var(Y[:, 0])))
    ax4.set_xticklabels([])
    ax5 = fig.add_subplot(gs[3, 0])
    ax5.plot(Y[:, 1], 'b')
    plt.xlabel('Sample number')
    plt.ylabel('Projection \n basis vector 2')
    plt.title('Sample var 2: {:.1f}'.format(np.var(Y[:, 1])))
    ax6 = fig.add_subplot(gs[2:4, 1])
    ax6.plot(Y[:, 0], Y[:, 1], '.', color=[.5, .5, .5])
    ax6.axis('equal')
    plt.xlabel('Projection basis vector 1')
    plt.ylabel('Projection basis vector 2')
    plt.title('Sample corr: {:.1f}'.format(np.corrcoef(Y[:, 0], Y[:, 1])[0, 1]))
    plt.show()


_ = widgets.interact(refresh, corr_coef=(-1,1,0.1), theta=(-90, 90, 5))

Summary

  • In this tutorial, we learned that multivariate data can be visualized as a cloud of points in a high-dimensional vector space. The geometry of this cloud is shaped by the covariance matrix.

  • Multivariate data can be represented in a new orthonormal basis using the dot product (matrix multiplication). These new basis vectors correspond to specific mixtures of the original variables - for example, in neuroscience, they could represent different ratios of activation across a population of neurons.

  • The projected data (after transforming into the new basis) will generally have a different geometry from the original data. In particular, taking basis vectors that are aligned with the spread of cloud of points decorrelates the data.

  • These concepts - covariance, projections, and orthonormal bases - are key for understanding PCA, which is a foundational computational tool in a wide variety of neuroscience research.


Notation

(14)\[\begin{align} x_i &\quad \text{data point for dimension } i\\ \mu_i &\quad \text{mean along dimension } i\\ \sigma_i^2 &\quad \text{variance along dimension } i \\ \bf u, \bf w &\quad \text{orthonormal basis vectors}\\ \rho &\quad \text{correlation coefficient}\\ \bf \Sigma &\quad \text{covariance matrix}\\ \bf X &\quad \text{original data matrix}\\ \bf W &\quad \text{projection matrix}\\ \bf Y &\quad \text{transformed data}\\ \end{align}\]

This tutorial was written by Krista Perks for BIOL358 Motor Systems taught at Wesleyan University. Based on content from Neuromatch Academy 2020: Week 1, Day 5: Dimensionality Reduction by Alex Cayco Gajic, John Murray