Becoming a Video Analyst: Reading and Manipulating Video Streams

Blog / Luis Torres / March 20, 2019

Videos are an extremely rich source of data, providing an attractive mix of a visual story and the notion of flow in time, which allow the data science and computer vision community to interpret and understand data with a level of nuance that is lost in simple time-series data or still images. In order to take video footage and present it for analysis, there are a number of tools a data scientist must have in their belt. In this post, we give an introduction to video manipulation using a small set of image processing operations in python applied to each frame of either a video file or a webcam stream.

The Video Analyst’s Toolkit

While there are numerous tools that could go into the video analyst’s toolkit, 2 of the most common are OpenCV, which provides a rich library of functions for manipulating and transforming video images, and imutils which provides convenience functions for doing basic image manipulation and interfacing with matplotlib.

Reading the Frames of a Video

Did you ever create an animation flipbook by drawing independent drawings or sketches in a notebook and flipping the pages? Conceptually, this is what occurs when processing and analysing videos: for each frame of the video, do something. To start processing and analysing a video, you first need a video stream, which can either be generated and processed in real-time or from a pre-existing source:

if args_dict('video', None) is None:
    video_stream = VideoStream(src=0).start()
    video_stream = cv2.VideoCapture(args_dict['video'])

With our streaming source established, we simply stream until there is no more video footage available:

while True:
    frame =
    # difference between frame from webcam vs frame from video
    frame = frame if args_dict.get("video", None) is None else frame[1]
    # stop frame extraction once you get at the end of video file
    if frame is None:

frame of the video is now stored in frame as a BGR array (the convention used by OpenCV), so we are ready to apply image processing operations to each frame to display a collage.

Manipulating the Frames of a Video

With the video now being streamed and available for processing, you can introduce the functionality to perform necessary transformation and analysis (e.g. localising faces on the image frame, parsing the image through a classifier, or applying a model for detecting a person or object). The path we follow here is to apply a set of image processing operations, mostly colour conversions and edge detections, to merge them into a collage:

# re-scale the dimensions of the frame - 50% in each dimension
img_small = cv2.resize(frame, (0, 0), fx=0.5, fy=0.5)
# convert to grayscale
img_gray = cv2.cvtColor(img_small, cv2.COLOR_BGR2GRAY)
# invert the gray scale
img_gray_inv = 255 - img_gray
# detect edges using the Canny method
img_canny_edge = cv2.Canny(img_blur, 0, 50)
# apply gaussian blurring - convolve with gaussian kernel, size(9, 9)
img_blur = cv2.GaussianBlur(img_gray_inv, (9, 9), 0, 0)
# generate sketch transformation
img_sketch = cv2.divide(img_gray, 255 - img_blur, scale=256)
# highlight edges in black/white space with a threshold
img_edges = cv2.adaptiveThreshold(img_blur,
                                  cv2.TRESH_BINARY, 15, 2)
# generate collage - stacking images
upper_row = np.hstack((img_small,
                       cv2.cvtColor(img_gray, cv2.COLOR_GRAY2BGR)
                       cv2.cvtColor(img_gray_inv, cv2.COLOR_GRAY2BGR)))
lower_row = np.hstack((img_canny_edge,
                       cv2.cvtColor(img_sketch, cv2.COLOR_GRAY2BGR)
                       cv2.cvtColor(img_edges, cv2.COLOR_GRAY2BGR)))

# merge rows for collage
collage = np.vstack((upper_row, lower_row))
# display the processed frames
cv2.imshow('Frame', collage)
# exit the stream if you press "q" on the keyboard
keyboard_event = cv2.waitKey(1) & 0xFF
if keyboard_event == ord("q"):
Fig.1 – Image processing example.

A single example picture with the image processing operations is shown, displaying the original and grayscale images in the top row and the edges plus the sketch transformations in the bottom.

The next example highlights a video stream with the same image manipulations applied. In human vision, we are used to seeing the world in colour and, with it, the associated flow of time. But some features are simpler to analyse in other spaces, like the isolation of text, numbers, or object shapes using an edge detector on images, in order to increase the contrast of these regions so that we can, for instance, apply optical character recognition or project the edge silhouette of a human action. By doing this, we can reduce noise and amplify our options for potential classification tools.

Fig.2 – Video stream example.


Manipulating videos for any kind of image processing or deep learning application starts with one fundamental step: learn how to read the frames of the stream and manipulate them. In this post we demonstrated the basics of reading and processing video streams with an emphasis on pure image processing operations. In future posts on the topic, we’ll look at applying inferences of a model related to object detection or feature extraction, stay tuned as we continue to add tools to your video processing and manipulation repertoire.

Header image courtesy of Denise Jans.
Fig.1 courtesy of Linda Xu.
Fig.2 source: Link.