1. The problem
I have started the Coursera Machine Learning MOOC by Andrew Ng, see Coursera Machine Learning. In Coursera MOOC, all videos are downloadable from a main page where they are tagged with a duration in minutes, see for instance Machine Learning Video Lectures. I wondered if it was possible to predict the size in bytes of videos according to their tagged duration, for instance the first video
Welcome (7 min) corresponds to a mp4 file of
2. Getting tagged durations and video sizes automatically with Python
From the download page, we can see that each video is downloadable with the URL:
https://class.coursera.org/ml-005/lecture/download.mp4?lecture_id=%i where the parameter is a video id (here, an integer between 1 and 114). Using Python, we can implement a function that gets the meta information for each video (without having to download it) and to extract the necessary information, i.e. the tagged durations (which are fortunately set in video filenames) as well as the video sizes in bytes.
The Python urllib2 library has all the necessary features to get these information automatically and a regular expression performs the extraction of the durations.
import re import urllib2 rx = re.compile("%20%28(\d+)%20min%29") # looking for ' (X min)' in video name course = 'ml-005' # course name fo = open("%s.txt" % course, "a") nvideos = 114 for i in range(nvideos): # videos are numbered from 1 to nvideos try: u = urllib2.urlopen('https://class.coursera.org/%s/lecture/download.mp4?lecture_id=%i' % (course, i + 1)) except: # in case a video is lacking continue meta = u.info() # meta-information of the page length = int(meta.getheaders("Content-Length")) # length disposition = meta.getheaders("Content-Disposition") # include filename m = rx.search(disposition) # looking for minutes if m: minutes = int(m.group(1)) else: minutes = 0 fo.write("%i\t%i\n" % (minutes, length)) print i+1, minutes, length # display how the process is running fo.close()
This script generates a tabulated file named
ml-005.txt with duration in minutes and size in bytes for each video, see the output file. In this MOOC, we can see that we get only 113 results since the video with id 94 is not referenced anymore.
3. Using linear regression to fit the data with Octave
Then, from the first programming exercice of the MOOC, using Octave to implement a linear regression with the gradient descent method on these data is quite straightforward.
% GNUPLOT setting graphics_toolkit gnuplot %% Initialization clear ; close all; clc %% ======================= Part 1: Plotting ======================= fprintf('Plotting Data ...\n') data = load('ml.txt'); X = data(:, 1); y = data(:, 2)/1024/1024; % convert to MB m = length(y); % number of training examples plot(X, y, 'rx', 'MarkerSize', 10); % Plot the data ylabel('Size in MB'); % Set the y axis label xlabel('Duration in min'); % Set the x axis label %% =================== Part 2: Gradient descent =================== fprintf('Running Gradient Descent ...\n') X = [ones(m, 1), data(:,1)]; % Add a column of ones to x theta = zeros(2, 1); % initialize fitting parameters % Some gradient descent settings iterations = 1500; alpha = 0.01; % compute and display initial cost computeCost(X, y, theta) % run gradient descent theta = gradientDescent(X, y, theta, alpha, iterations); % print theta to screen fprintf('Theta found by gradient descent: '); fprintf('%f %f \n', theta(1), theta(2)); %% ======================= Part 3: Plotting again ================= % Plot Data hold on; % keep previous plot visible % Plot the linear fit plot(X(:,2), X*theta, '-') legend('Training data', 'Linear regression') hold off % don't overlay any more plots on this figure
The octave script outputs finally a cost of
0.54824 and a theta vector of
[0.407491 1.120686]. Therefore, the linear estimation of the video size in MB according to its tagged duration is given by the equation: y = θ0 + θ1.x
4. Graphical representation
The linear regression result is also displayed using Octave graphical features.
We can see that the data fit roughly to a straight line. In fact, the relationship between tagged durations and sizes is not linear since all videos may have not the same compression rate and the tagged durations may contain approximations or errors as well.
5. Compressing data with PCA in Octave
PCA script from Coursera can easily be adapted to compress these data. The top eigen vector is on the first bissector
y = x.
% GNUPLOT setting graphics_toolkit gnuplot %% Initialization clear ; close all; clc % The following command loads the dataset. You should now have the % variable X in your environment X = load('ml.txt'); X(:, 2) = X(:, 2) / 1024 / 1024; % convert to MB % Visualize the example dataset plot(X(:, 1), X(:, 2), 'bo'); axis([0 25 0 25]); axis square; % Before running PCA, it is important to first normalize X [X_norm, mu, sigma] = featureNormalize(X); % Run PCA [U, S] = pca(X_norm); % Compute mu, the mean of the each feature % Plot the normalized dataset (returned from pca) plot(X_norm(:, 1), X_norm(:, 2), 'bo'); axis([-4 4 -4 4]); axis square % Project the data onto K = 1 dimension K = 1; Z = projectData(X_norm, U, K); X_rec = recoverData(Z, U, K); % Draw lines connecting the projected points to the original points hold on; plot(X_rec(:, 1), X_rec(:, 2), 'ro'); for i = 1:size(X_norm, 1) drawLine(X_norm(i,:), X_rec(i,:), '--k', 'LineWidth', 1); end legend('Training data', 'PCA', "location", "southeast") hold off
Then the raw and the reconstructed data can be displayed on the same graph. Graphical comparison between linear regression (ordinary least squares) is explained in this post: Principal Component Analysis vs Ordinary Least Squares: a Visual Explanation. Examples are given in the R language.
6. To do…
Handling durations in the format
MM:SS and managing problems in retrieving video meta-information.