How you can build practical applications by quantifying observations from video

Sandy D
7 min readAug 15, 2019

--

Providing insights to customers directly from video analysis

The saying ‘a picture is worth a thousand words’ is particularly interesting today. The ability to quantify observations directly from images with computer vision techniques is readily accessible. Boost this ability up to 30 and 60 pictures or frames a second and you’re producing novels of insight rather quickly.

In 2019 it does not yet feel like broader society appreciates the impact computer vision will have in the near term. The ability to quantify information from video will enable many beneficial applications and lead to insights which, because of time and resource limitations in the past, were practically impossible to realize.

Key Point Detection

Facial recognition is gaining popularity, particularly in China for applications in payments and check-ins. Other applications include sentiment analysis where models have been trained to classify a person’s sentiment based on their facial movements. One of the foundations of facial recognition is ‘facial key-point detection’. In key-point detection, computer vision models have been trained to perform two important tasks, classification and localization. These tasks seek to answer the questions “what is it” and “where is it”. Each point on the face has its own identifier and a set of coordinates relative to the image, usually in pixels.

https://us.norton.com/internetsecurity-iot-how-facial-recognition-software-works.html

The same technique can also be applied to the whole body, referred to as ‘body key point detection’ or ‘pose estimation’. Pose estimation works much the same way in that computer vision models have been trained to classify a number of body points, typically joints, and determine their location within an image.

Pose estimation techniques are rapidly improving. What’s more exciting is that these pre-trained models along with quality documentation are being made publicly available.

For example, you can determine how many pixels a person’s hand moves and if you have a measurement for scale (the person’s height) this can be converted into meters or distance. When combined with the frame rate of the video you can start to calculate distance per unit time or velocity. In making these measurements, we can determine the state of various features within the video. An example is identifying whether a person is ascending or descending.

With classifications of state or activity and measures of position and time it is possible to build applications that analyze both what a person is doing and how they are doing it.

Quantifying observations from video

With that short intro it’s time to get to work. This quick how-to will cover the following:

  • Generating body key-point data from video (Open Pose)
  • Reading the key point data into Python (Pandas, JSON)
  • Performing analysis of the key-point data (Numpy, Pandas)
  • Creating custom video overlays or visualizations (Open CV)
  • Saving the original video with your newly generated overlays (Open CV)

We’ll start with recording a sample video, in the video at the end of the article, performing power lifting movements. We’ll load the video into Open Pose which will generate frame by frame text outputs of the key-points in JSON format. The JSON files are read into pandas data frames in Python. In Python, we’ll clean up the data, and perform calculations on the key-points. We’ll then use Open CV, the raw coordinates from Open Pose and our analysis points to generate video overlays to provide feedback to the user. Finally, we’ll use Open CV to save an HD video including the overlays.

https://github.com/CMU-Perceptual-Computing-Lab/openpose

Step 1 — Generating Body Key Point datasets using Open Pose

First you’ll want to setup Open Pose by following the setup instructions from the GitHub repo below:

https://github.com/CMU-Perceptual-Computing-Lab/openpose

Now take a sample video and run it through Open Pose, in this case manually using the command prompt. You’ll want to execute this command from the directory where you’ve saved Open Pose. In the example below, the sample video is in a folder structure examples\media\name_of_your_vid.mp4 inside the open pose folder.

Odds are reasonably good that you’ve recorded a video on your iPhone. To convert videos from .mov to .mp4 you can use ffmpeg. You’ll need to setup ffmpeg or an alternative video converter.

Command to convert .mov to .mp4 in ffmpeg:

ffmpeg -i your_input_video.mov -vcodec h264 -acodec mp2 your_output_video.mp4

Now to read in a sample video and generate output video overlays from Open Pose. This will save a video that you can view to understand the key points which will be generated. When starting out, your life will be easier if there is only 1 person in the video.

bin\OpenPoseDemo.exe — video examples\media\name_of_your_vid.mp4 — net_resolution “320x320” — display 0 — render_pose 0 — write_video output/name_of_output.avi

Below is the command to read in a sample video and generate output Key Points from Open Pose. This command generates the JSON text file which we’ll use to perform analysis on the key points.

bin\OpenPoseDemo.exe --video examples\media\name_of_your_vid.mp4 --net_resolution "320x320" --part_candidates --write_json output/ --display 0 --render_pose 0

You should now have a JSON file with key points in the Open Pose directory.

Step 2 — Reading in key point data into Python

Now we’ll read in the JSON files which were created by Open Pose into a Python script. Open Pose creates 1 JSON file per video frame, each frame contains 24 points and the background. For more, see the Open Pose Output documentation:

https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/output.md

Below are the numbered keys and their corresponding body point.

// Result for BODY_25 (25 body parts consisting of COCO + foot)
// const std::map<unsigned int, std::string> POSE_BODY_25_BODY_PARTS {
// {0, "Nose"},
// {1, "Neck"},
// {2, "RShoulder"},
// {3, "RElbow"},
// {4, "RWrist"},
// {5, "LShoulder"},
// {6, "LElbow"},
// {7, "LWrist"},
// {8, "MidHip"},
// {9, "RHip"},
// {10, "RKnee"},
// {11, "RAnkle"},
// {12, "LHip"},
// {13, "LKnee"},
// {14, "LAnkle"},
// {15, "REye"},
// {16, "LEye"},
// {17, "REar"},
// {18, "LEar"},
// {19, "LBigToe"},
// {20, "LSmallToe"},
// {21, "LHeel"},
// {22, "RBigToe"},
// {23, "RSmallToe"},
// {24, "RHeel"},
// {25, "Background"}
// };

Sample format of the JSON output from Open Pose:

{
"version":1.1,
"people":[
{
"pose_keypoints_2d":[582.349,507.866,0.845918,746.975,631.307,0.587007,...],
"face_keypoints_2d":[468.725,715.636,0.189116,554.963,652.863,0.665039,...],
"hand_left_keypoints_2d":[746.975,631.307,0.587007,615.659,617.567,0.377899,...],
"hand_right_keypoints_2d":[617.581,472.65,0.797508,0,0,0,723.431,462.783,0.88765,...]
"pose_keypoints_3d":[582.349,507.866,507.866,0.845918,507.866,746.975,631.307,0.587007,...],
"face_keypoints_3d":[468.725,715.636,715.636,0.189116,715.636,554.963,652.863,0.665039,...],
"hand_left_keypoints_3d":[746.975,631.307,631.307,0.587007,631.307,615.659,617.567,0.377899,...],
"hand_right_keypoints_3d":[617.581,472.65,472.65,0.797508,472.65,0,0,0,723.431,462.783,0.88765,...]
}
],
// If `--part_candidates` enabled
"part_candidates":[
{
"0":[296.994,258.976,0.845918,238.996,365.027,0.189116],
"1":[381.024,321.984,0.587007],
"2":[313.996,314.97,0.377899],
"3":[238.996,365.027,0.189116],
"4":[283.015,332.986,0.665039],
"5":[457.987,324.003,0.430488,283.015,332.986,0.665039],
"6":[],
"7":[],
"8":[],
"9":[],
"10":[],
"11":[],
"12":[],
"13":[],
"14":[293.001,242.991,0.674305],
"15":[314.978,241,0.797508],
"16":[],
"17":[369.007,235.964,0.88765]
}
]
}

Interpreting key-point data can get tricky if you’re trying to make calculations on 1 person in the foreground and you’ve got people in the background. Point sets are in the form [coordinate, coordinate, confidence], if there are 2 people in the frame you’ll have 6 elements in the array instead of 3.

Part of the sample code below attempts to compensate for this by keeping the point sets which are closest to the center of the frame, assuming that is the target person for analysis.

Step 3 — Performing analysis on the key point data

Now you can determine if for example the hand is above the shoulder by comparing y coordinate data of point 4 (right wrist) with point 2 (right shoulder).

Movement speed can be determined using the starting position at a particular frame (file number) and comparing to the position at a later time or frame. For this you’ll need the frames per second of the input video.

Conversion from relative distance in pixels to meters can be done using a scale ratio of pixels per meter which you can determine if you have a measure of a person’s height.

For example, in this demo these are some of the calculations:

  • Distance between the knees
  • Rep count
  • Angle of the thigh relative to shin
  • Position of the knee relative to foot
  • Instantaneous velocity
  • Average velocity on ascent and decent

Step 4 — Creating custom video overlays in Open CV

Now that you’ve made observations and calculations you can visualize them in the form of overlays on the input video. To do this we’ll leverage Open CV with its built in drawing functions.

You can learn more about Open CV here: https://opencv.org/

Below are examples of drawing a line between multiple points and a rectangle based on the position of the heel and toe.

In the same way that Open Pose generates key points for every frame of the input video, you’ll want to draw overlays for every frame of the output video.

Once the drawing is complete, call the ‘show’ function to display the frame with overlays:

cv2.imshow(‘video title’, frame)

Step 5 — Saving your video with overlays

The analysis and drawing are done frame by frame and so is the writing to the output video file.

First, we’ll create the output video file:

out = cv2.VideoWriter(‘your_output_video.avi’,cv2.VideoWriter_fourcc(‘M’,’J’,’P’,’G’), 25, (width,height))

Here width and height are read from the input video so that the output size will match the input.

While looping through all the input video frames, create your Open CV overlays and call the write function.

out.write(frame)

Putting it all together we’ve got the original input video, stats to communicate the analysis and visual overlays to indicate what is being measured.

The path forward

It’s interesting to note that a license for commercial use of Open Pose can be purchased for $25,000 USD, however this excludes uses in ‘competitive sports’. The ability to measure and analyze attributes directly from video is particularly powerful for sports analysis. For example, you can now measure the acceleration of every player, the power of their shots and possibly even gain quantitative insights about what separates the super stars from other players.

An example of this work can be seen in the video below:

Thanks to George Seif for the inspiration and contributions to the community.

Keen on practical machine learning?

Follow me on Twitter @_adux and feel free to connect on LinkedIn if you’d like to reach out.

--

--