1. The problem

I have started the Coursera Machine Learning MOOC by Andrew Ng, see Coursera Machine Learning. In Coursera MOOC, all videos are downloadable from a main page where they are tagged with a duration in minutes, see for instance Machine Learning Video Lectures. I wondered if it was possible to predict the size in bytes of videos according to their tagged duration, for instance the first video Welcome (7 min) corresponds to a mp4 file of 12.531.478 bytes.

2. Getting tagged durations and video sizes automatically with Python

From the download page, we can see that each video is downloadable with the URL: https://class.coursera.org/ml-005/lecture/download.mp4?lecture_id=%i where the parameter is a video id (here, an integer between 1 and 114). Using Python, we can implement a function that gets the meta information for each video (without having to download it) and to extract the necessary information, i.e. the tagged durations (which are fortunately set in video filenames) as well as the video sizes in bytes.

The Python urllib2 library has all the necessary features to get these information automatically and a regular expression performs the extraction of the durations.

        import re
        import urllib2
        rx = re.compile("%20%28(\d+)%20min%29") # looking for ' (X min)' in video name
        course = 'ml-005' # course name
        fo = open("%s.txt" % course, "a")
        nvideos = 114
        for i in range(nvideos): # videos are numbered from 1 to nvideos
            try:
                u = urllib2.urlopen('https://class.coursera.org/%s/lecture/download.mp4?lecture_id=%i' % (course, i + 1))
            except: # in case a video is lacking
                continue
            meta = u.info() # meta-information of the page
            length = int(meta.getheaders("Content-Length")[0]) # length
            disposition = meta.getheaders("Content-Disposition")[0] # include filename
            m = rx.search(disposition) # looking for minutes
            if m:
                minutes = int(m.group(1))
            else:
                minutes = 0
            fo.write("%i\t%i\n" % (minutes, length))
            print i+1, minutes, length # display how the process is running
        fo.close()

This script generates a tabulated file named ml-005.txt with duration in minutes and size in bytes for each video, see the output file. In this MOOC, we can see that we get only 113 results since the video with id 94 is not referenced anymore.

3. Using linear regression to fit the data with Octave

Then, from the first programming exercice of the MOOC, using Octave to implement a linear regression with the gradient descent method on these data is quite straightforward.

        % GNUPLOT setting
        graphics_toolkit gnuplot

        %% Initialization
        clear ; close all; clc

        %% ======================= Part 1: Plotting =======================
        fprintf('Plotting Data ...\n')
        data = load('ml.txt');
        X = data(:, 1); y = data(:, 2)/1024/1024; % convert to MB
        m = length(y); % number of training examples

        plot(X, y, 'rx', 'MarkerSize', 10); % Plot the data
        ylabel('Size in MB'); % Set the y axis label
        xlabel('Duration in min'); % Set the x axis label

        %% =================== Part 2: Gradient descent ===================
        fprintf('Running Gradient Descent ...\n')

        X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
        theta = zeros(2, 1); % initialize fitting parameters

        % Some gradient descent settings
        iterations = 1500;
        alpha = 0.01;

        % compute and display initial cost
        computeCost(X, y, theta)

        % run gradient descent
        theta = gradientDescent(X, y, theta, alpha, iterations);

        % print theta to screen
        fprintf('Theta found by gradient descent: ');
        fprintf('%f %f \n', theta(1), theta(2));

        %% ======================= Part 3: Plotting again =================
        % Plot Data
        hold on; % keep previous plot visible
        % Plot the linear fit
        plot(X(:,2), X*theta, '-')
        legend('Training data', 'Linear regression')
        hold off % don't overlay any more plots on this figure

The octave script outputs finally a cost of 0.54824 and a theta vector of [0.407491 1.120686]. Therefore, the linear estimation of the video size in MB according to its tagged duration is given by the equation: y = θ0 + θ1.x

4. Graphical representation

The linear regression result is also displayed using Octave graphical features.

 

We can see that the data fit roughly to a straight line. In fact, the relationship between tagged durations and sizes is not linear since all videos may have not the same compression rate and the tagged durations may contain approximations or errors as well.

5. Compressing data with PCA in Octave

PCA script from Coursera can easily be adapted to compress these data. The top eigen vector is on the first bissector y = x.

        % GNUPLOT setting
        graphics_toolkit gnuplot

        %% Initialization
        clear ; close all; clc

        %  The following command loads the dataset. You should now have the 
        %  variable X in your environment
        X = load('ml.txt');
        X(:, 2) = X(:, 2) / 1024 / 1024; % convert to MB

        %  Visualize the example dataset
        plot(X(:, 1), X(:, 2), 'bo');
        axis([0 25 0 25]); axis square;

        %  Before running PCA, it is important to first normalize X
        [X_norm, mu, sigma] = featureNormalize(X);

        %  Run PCA
        [U, S] = pca(X_norm);

        %  Compute mu, the mean of the each feature

        %  Plot the normalized dataset (returned from pca)
        plot(X_norm(:, 1), X_norm(:, 2), 'bo');
        axis([-4 4 -4 4]); axis square

        %  Project the data onto K = 1 dimension
        K = 1;
        Z = projectData(X_norm, U, K);

        X_rec  = recoverData(Z, U, K);

        %  Draw lines connecting the projected points to the original points
        hold on;
        plot(X_rec(:, 1), X_rec(:, 2), 'ro');
        for i = 1:size(X_norm, 1)
            drawLine(X_norm(i,:), X_rec(i,:), '--k', 'LineWidth', 1);
        end
        legend('Training data', 'PCA', "location", "southeast")
        hold off

Then the raw and the reconstructed data can be displayed on the same graph. Graphical comparison between linear regression (ordinary least squares) is explained in this post: Principal Component Analysis vs Ordinary Least Squares: a Visual Explanation. Examples are given in the R language.

pca

 

6. To do…

Handling durations in the format MM:SS and managing problems in retrieving video meta-information.