Repository: onlyphantom/cvessentials Branch: master Commit: e32691a5f1af Files: 48 Total size: 459.3 KB Directory structure: gitextract_nnlovcxn/ ├── .gitignore ├── README.md ├── digitrecognition/ │ ├── contourarea_01.py │ ├── contourarea_02.py │ ├── contourarea_03.py │ ├── digit_01.py │ ├── digitrec.html │ ├── digitrec.md │ ├── morphological_01.py │ ├── morphological_02.py │ ├── roi_01.py │ ├── roi_02.py │ └── utils/ │ └── enumerate.py ├── edgedetect/ │ ├── adaptivethresholding_01.py │ ├── canny_01.py │ ├── contour_01.py │ ├── contourapprox.py │ ├── edgedetect.html │ ├── edgedetect.md │ ├── gaussianblur_01.py │ ├── gradient.py │ ├── img2surface.py │ ├── intensitythresholding_01.py │ ├── kernel.html │ ├── kernel.md │ ├── meanblur_01.py │ ├── meanblur_02.py │ ├── meanblur_03.py │ ├── sharpening_01.py │ ├── sharpening_02.py │ ├── sobel_01.py │ ├── sobel_02.py │ ├── sobel_03.py │ ├── unsharpmask_01.py │ ├── unsharpmask_02.py │ └── utils/ │ └── gaussiancurve.r ├── quiz.md ├── requirements.txt ├── summarynotes/ │ └── class2201.md └── transformation/ ├── lecture_affine.html ├── lecture_affine.md ├── rotate_01.py ├── scale_01.py ├── scale_02.py ├── scale_03.py ├── scale_04.py ├── scale_05.py └── translate_01.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ solutions/ .DS_Store .vscode/ answers.md ================================================ FILE: README.md ================================================ # Essentials of Computer Vision  A math-first approach to learning computer vision in Python. The repository will contain all HTML, PDF, Markdown, Python Scripts, data, and media assets (images or links to supplementary videos). If you wish to contribute, I need translations for Bahasa Indonesia. Please submit a Pull Request. ## Study Guide ### Chapter 1 - Affine Transformation - [Definition](transformation/lecture_affine.html#definition) - [Mathematical Definitions](transformation/lecture_affine.html#mathematical-definitions) - [Practical Examples](transformation/lecture_affine.html#practical-examples) - [Motivation](transformation/lecture_affine.html#motivation) - [Getting Affine Transformation](transformation/lecture_affine.html#getting_affine-transformation) - [Trigonometry Proof](transformation/lecture_affine.html#trigonometry-proof) - [Code Illustrations](transformation/lecture_affine.html#code-illustrations) - [Summary and Key Points](transformation/lecture_affine.html#summary-and-key-points) - Optional video - [Rotation Matrix Explained Visually](https://www.youtube.com/watch?v=tIixrNtLJ8U) - [w/ Bahasa Indonesia voiceover](https://www.youtube.com/watch?v=pWfXR_HmyUw) - References and learn-by-building modules ### Chapter 2 - Kernel Convolutions - [Definition](edgedetect/kernel.html#definition) - Optional video - [Kernel Convolutions Explained Visually](https://www.youtube.com/watch?v=WMmHcrX4Obg) - [Mathematical Definitions](edgedetect/kernel.html#mathematical-definitions) - [Padding](edgedetect/kernel.html#a-note-on-padding) - [Smoothing and Blurring](edgedetect/kernel.html#smoothing-and-blurring) - [A Note on Terminology](edgedetect/kernel.html#a-note-on-terminology) - Kernels or Filters? - Correlations vs Convolutions? - [Code Illustrations: Mean Filtering](edgedetect/kernel.html#code-illustrations-mean-filtering) - [Role in Convolution Neural Networks](edgedetect/kernel.html#role-in-convolutional-neural-networks) - [Handy Kernels for Image Processing](edgedetect/kernel.html#handy-kernels-for-image-processing) - [Gaussian Filtering](edgedetect/kernel.html#gaussian-filtering) - [Sharpening Kernels](edgedetect/kernel.html#sharpening-kernels) - [Gaussian Kernels for Sharpening](edgedetect/kernel.html#approximate-gaussian-kernel-for-sharpening) - [Unsharp Masking](edgedetect/kernel.html#unsharp-masking) - [Summary and Key Points](edgedetect/kernel.html#summary-and-key-points) - References and learn-by-building modules ### Chapter 3 - Edge Detection - [Definition](edgedetect/edgedetect.html#definition) - [Gradient-based Edge Detection](edgedetect/edgedetect.html#gradient-based-edge-detection) - [Sobel Operator](edgedetect/edgedetect.html#sobel-operator) - [Discrete Derivative](edgedetect/edgedetect.html#intuition-discrete-derivative) - [Code Illustrations: Sobel Operator](edgedetect/edgedetect.html#code-illustrations-sobel-operator) - [Gradient Orientation & Magnitude](edgedetect/edgedetect.html#dive-deeper-gradient-orientation-magnitude) - [Image Segmentation](edgedetect/edgedetect.html#image-segmentation) - [Intensity-based Segmentation](edgedetect/edgedetect.html#intensity-based-segmentation) - [Simple Thresholding](edgedetect/edgedetect.html#simple-thresholding) - [Adaptive Thresholding](edgedetect/edgedetect.html#adaptive-thresholding) - [Edge-based Contour Estimation](edgedetect/edgedetect.html#edge-based-contour-estimation) - [Contour Retrieval and Approximation](edgedetect/edgedetect.html#contour-retrieval-and-approximation) - [Canny Edge Detector](edgedetect/edgedetect.html#canny-edge-detector) - [Edge Thinning](edgedetect/edgedetect.html#edge-thinning) - [Hysteresis Thresholding](edgedetect/edgedetect.html#hysteresis-thresholding) - References and learn-by-building modules ### Chapter 4 - Digit Classification - [A Note on Deep Learning](digitrecognition/digitrec.html#what-about-deep-learning) - [Why not MNIST?](digitrecognition/digitrec.html#region-of-interest) - Region of Interest - [ROI identification](digitrecognition/digitrec.html#selecting-region-of-interest) - [Arc Length and Area Size](digitrecognition/digitrec.html#arc-length-and-area-size) - [Dive Deeper: ROI](digitrecognition/digitrec.html#dive-deeper-roi) - [ROI extraction](digitrecognition/digitrec.html#roi-extraction) - [Morphological Transformations](digitrecognition/digitrec.html#morphological-transformations) - [Erosion](digitrecognition/digitrec.html#erosion) - [Dilation](digitrecognition/digitrec.html#dilation) - [Opening and Closing](digitrecognition/digitrec.html#opening-and-closing) - [Learn-by-building: Morphological Transformation](digitrecognition/digitrec.html#learn-by-building-morphological-transformation) - [Seven-segment display](digitrecognition/digitrec.html#seven-segment-display) - [Practical Strategies](digitrecognition/digitrec.html#practical-strategies) - [Contour Properties](digitrecognition/digitrec.html#contour-properties) - [References and learn-by-building modules](digitrecognition/digitrec.html#references) ### Chapter 5 - Facial Recognition ## Approach and Motivation The course is foundational to anyone who wish to work with computer vision in Python. It covers some of the most common image processing routines, and have in-depth coverage on mathematical concepts present in the materials: - Math-first approach - Tons of sample python scripts (.py) - 45+ python scripts from chapter 1 to 4 for plug-and-play experiments - Multimedia (image illustrations, video explanation, quiz) - 57 image assets from chapter 1 to 4 for practical illustrations - 4 PDFs, and 4 HTMLs, one for each chapter - Practical tips on real-world applications The course's **only dependency** is `OpenCV`. Getting started is as easy as `pip install opencv-contrib-python` and you're set to go. ##### Question: What about deep learning libraries? No; While using deep learning for images made for interesting topics, they are probably better suited as an altogether separate course series. This course series (tutorial series) focused on the **essentials of computer vision** and, for pedagogical reasons, try not to be overly ambitious with the scope it intends to cover. There will be similarity in concepts and principles, as modern neural network architectures draw plenty of inspirations from "classical" computer vision techniques that predate it. By first learning how computer vision problems are solved, the student can compare that to the deep learning equivalent, which result in a more comprehensive appreciation of what deep learning offer to modern day computer scientists. ## Course Materials Preview: ### Python scripts  ### PDF and HTML  # Workshops I conduct in-person lectures using the materials you find in this repository. These workshops are usually paid because there are upfront costs to afford a venue and crew. Not just any venue, but a learning environment that is fully equipped (audio, desks, charging points for everyone, massive screen projector, walking space fo teaching assistants, dinner). You can follow me [on LinkedIn](http://linkedin.com/in/chansamuel/) to be updated about the latest workshops. I also make long-form programming tutorials and lessons on computer vision on [my YouTube channel](https://www.youtube.com/@SamuelChan) ### Introduction to AI in Computer Vision - 4th January 2020, Jakarta - Kantorkuu, Citywalk sudirman, Jakarta Pusat - Time: 1300-1600 - 3 hour - Fee: Free for Algoritma Alumni, 100k IDR for public ### Computer Vision: Principles and Practice - 21st and 22nd January 2020, Jakarta - Accelerice, Jl. Rasuna Said, Jakarta Selatan - Time: 1830-2130 - 6 Hour - Fee: Free for Algoritma Alumni, 1.5m IDR for public - 24th and 25th Feburary 2020, Bangkok - JustCo, Samyan Mitrtown - Time: 1830-2130 - 6 Hour - Fee: Free for Algoritma Alumni, 9000 THB for public ## Image Assets - `car2.png`, `pen.jpg`, `lego.jpg` and `sudoku.jpg` are under Creative Commons (CC) license. - `sarpi.jpg`, `castello.png`, `canal.png` and all other photography used are taken during my trip to Venice and you are free to use them. - All assets in Chapter 4 (the `digitrecognition` folder) are mine and you are free to use them. - All other illustrations are created by me in Keynote. - Videos are created by me, and Bahasa Indonesia voice over on my videos is by [Tiara Dwiputri](https://github.com/tiaradwiputri) ## New to programming? 50-minute Quick Start Here's a video: [Computer Vision Essentials 1](https://youtu.be/NWXY4ASRlgA) I created to get you through the installation and taking the first step into this lesson path. If you need help in the course, attend my in-person workshops on this topic (Computer Vision Essentials, free) throughout the course of the year. ## Follow me - [YouTube](https://www.youtube.com/@SamuelChan) - [LinkedIn](http://linkedin.com/in/chansamuel/) - [GitHub](https://github.com/onlyphantom) ================================================ FILE: digitrecognition/contourarea_01.py ================================================ import cv2 BCOLOR = (75, 0, 130) THICKNESS = 4 img_color = cv2.imread("assets/ocbc.jpg") img_color = cv2.resize(img_color, None, None, fx=0.5, fy=0.5) img = cv2.cvtColor(img_color, cv2.COLOR_BGR2GRAY) blurred = cv2.GaussianBlur(img, (7, 7), 0) blurred = cv2.bilateralFilter(blurred, 5, sigmaColor=50, sigmaSpace=50) edged = cv2.Canny(blurred, 130, 150, 255) cv2.imshow("Outline of device", edged) cv2.waitKey(0) cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # sort contours by area, and get the first 10 cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:9] cv2.drawContours(img_color, cnts, 0, BCOLOR, THICKNESS) cv2.imshow("Target Contour", img_color) cv2.waitKey(0) for i, cnt in enumerate(cnts): cv2.drawContours(img_color, cnts, i, BCOLOR, THICKNESS) print(f"ContourArea:{cv2.contourArea(cnt)}") cv2.imshow("Contour one by one", img_color) cv2.waitKey(0) ================================================ FILE: digitrecognition/contourarea_02.py ================================================ import cv2 PURPLE = (75, 0, 130) YELLOW = (0, 255, 255) THICKNESS = 4 FONT = cv2.FONT_HERSHEY_SIMPLEX img_color = cv2.imread("assets/ocbc.jpg") img_color = cv2.resize(img_color, None, None, fx=0.5, fy=0.5) img = cv2.cvtColor(img_color, cv2.COLOR_BGR2GRAY) blurred = cv2.GaussianBlur(img, (7, 7), 0) blurred = cv2.bilateralFilter(blurred, 5, sigmaColor=50, sigmaSpace=50) edged = cv2.Canny(blurred, 130, 150, 255) cv2.imshow("Outline of device", edged) cv2.waitKey(0) cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # sort contours by area, and get the first 10 cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:10] for i, cnt in enumerate(cnts): cv2.drawContours(img_color, cnts, i, PURPLE, THICKNESS) x, y, w, h = cv2.boundingRect(cnt) cv2.rectangle(img_color, (x, y), (x + w, y + h), YELLOW, THICKNESS) area = round(cv2.contourArea(cnt), 1) peri = round(cv2.arcLength(cnt, closed=True), 1) print(f"ContourArea:{area}, Peri: {peri}") cv2.putText(img_color, "Area:" + str(area), (x, y - 15), FONT, 0.4, PURPLE, 1) cv2.putText(img_color, "Perimeter:" + str(peri), (x, y - 5), FONT, 0.4, PURPLE, 1) cv2.imshow("Contours", img_color) cv2.waitKey(0) ================================================ FILE: digitrecognition/contourarea_03.py ================================================ import cv2 PURPLE = (75, 0, 130) YELLOW = (0, 255, 255) THICKNESS = 4 FONT = cv2.FONT_HERSHEY_SIMPLEX img_color = cv2.imread("assets/ocbc.jpg") img_color = cv2.resize(img_color, None, None, fx=0.5, fy=0.5) img = cv2.cvtColor(img_color, cv2.COLOR_BGR2GRAY) blurred = cv2.GaussianBlur(img, (7, 7), 0) blurred = cv2.bilateralFilter(blurred, 5, sigmaColor=50, sigmaSpace=50) edged = cv2.Canny(blurred, 130, 150, 255) cv2.imshow("Outline of device", edged) cv2.waitKey(0) cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # sort contours by area, and get the first 10 cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:9] cv2.drawContours(img_color, cnts, 0, PURPLE, THICKNESS) cv2.imshow("Target Contour", img_color) cv2.waitKey(0) for i in range(len(cnts)): cv2.drawContours(img_color, cnts, i, PURPLE, THICKNESS) print(f"ContourArea:{cv2.contourArea(cnts[i])}") x, y, w, h = cv2.boundingRect(cnts[i]) cv2.rectangle(img_color, (x, y), (x + w, y + h), YELLOW, THICKNESS) area = round(cv2.contourArea(cnts[i]), 1) peri = round(cv2.arcLength(cnts[i], closed=True), 1) print(f"ContourArea:{area}, Peri: {peri}") cv2.putText(img_color, "Area:" + str(area), (x, y - 15), FONT, 0.4, PURPLE, 1) cv2.putText(img_color, "Perimeter:" + str(peri), (x, y - 5), FONT, 0.4, PURPLE, 1) cv2.imshow("Contour one by one", img_color) cv2.waitKey(0) ================================================ FILE: digitrecognition/digit_01.py ================================================ import cv2 import numpy as np FONT = cv2.FONT_HERSHEY_SIMPLEX CYAN = (255, 255, 0) DIGITSDICT = { (1, 1, 1, 1, 1, 1, 0): 0, (0, 1, 1, 0, 0, 0, 0): 1, (1, 1, 0, 1, 1, 0, 1): 2, (1, 1, 1, 1, 0, 0, 1): 3, (0, 1, 1, 0, 0, 1, 1): 4, (1, 0, 1, 1, 0, 1, 1): 5, (1, 0, 1, 1, 1, 1, 1): 6, (1, 1, 1, 0, 0, 1, 0): 7, (1, 1, 1, 1, 1, 1, 1): 8, (1, 1, 1, 1, 0, 1, 1): 9, } # roi_color = cv2.imread("inter/dbs-roi.png") roi_color = cv2.imread("inter/ocbc-roi.png") roi = cv2.cvtColor(roi_color, cv2.COLOR_BGR2GRAY) RATIO = roi.shape[0] * 0.2 roi = cv2.bilateralFilter(roi, 5, 30, 60) trimmed = roi[int(RATIO) :, int(RATIO) : roi.shape[1] - int(RATIO)] roi_color = roi_color[int(RATIO) :, int(RATIO) : roi.shape[1] - int(RATIO)] cv2.imshow("Blurred and Trimmed", trimmed) cv2.waitKey(0) edged = cv2.adaptiveThreshold( trimmed, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5 ) cv2.imshow("Edged", edged) cv2.waitKey(0) kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 5)) dilated = cv2.dilate(edged, kernel, iterations=1) cv2.imshow("Dilated", dilated) cv2.waitKey(0) kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 1)) dilated = cv2.dilate(dilated, kernel, iterations=1) cv2.imshow("Dilated x2", dilated) cv2.waitKey(0) kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 1),) eroded = cv2.erode(dilated, kernel, iterations=1) cv2.imshow("Eroded", eroded) cv2.waitKey(0) h = roi.shape[0] ratio = int(h * 0.07) eroded[-ratio:,] = 0 eroded[:, :ratio] = 0 cv2.imshow("Eroded + Black", eroded) cv2.waitKey(0) cnts, _ = cv2.findContours(eroded, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) digits_cnts = [] canvas = trimmed.copy() cv2.drawContours(canvas, cnts, -1, (255, 255, 255), 1) cv2.imshow("All Contours", canvas) cv2.waitKey(0) canvas = trimmed.copy() for cnt in cnts: (x, y, w, h) = cv2.boundingRect(cnt) if h > 20: digits_cnts += [cnt] cv2.rectangle(canvas, (x, y), (x + w, y + h), (0, 0, 0), 1) cv2.drawContours(canvas, cnt, 0, (255, 255, 255), 1) cv2.imshow("Digit Contours", canvas) cv2.waitKey(0) print(f"No. of Digit Contours: {len(digits_cnts)}") cv2.imshow("Digit Contours", canvas) cv2.waitKey(0) sorted_digits = sorted(digits_cnts, key=lambda cnt: cv2.boundingRect(cnt)[0]) canvas = trimmed.copy() for i, cnt in enumerate(sorted_digits): (x, y, w, h) = cv2.boundingRect(cnt) cv2.rectangle(canvas, (x, y), (x + w, y + h), (0, 0, 0), 1) cv2.putText(canvas, str(i), (x, y - 3), FONT, 0.3, (0, 0, 0), 1) cv2.imshow("All Contours sorted", canvas) cv2.waitKey(0) digits = [] canvas = roi_color.copy() for cnt in sorted_digits: (x, y, w, h) = cv2.boundingRect(cnt) roi = eroded[y : y + h, x : x + w] print(f"W:{w}, H:{h}") # convenience units qW, qH = int(w * 0.25), int(h * 0.15) fractionH, halfH, fractionW = int(h * 0.05), int(h * 0.5), int(w * 0.25) # seven segments in the order of wikipedia's illustration sevensegs = [ ((0, 0), (w, qH)), # a (top bar) ((w - qW, 0), (w, halfH)), # b (upper right) ((w - qW, halfH), (w, h)), # c (lower right) ((0, h - qH), (w, h)), # d (lower bar) ((0, halfH), (qW, h)), # e (lower left) ((0, 0), (qW, halfH)), # f (upper left) # ((0, halfH - fractionH), (w, halfH + fractionH)) # center ( (0 + fractionW, halfH - fractionH), (w - fractionW, halfH + fractionH), ), # center ] # initialize to off on = [0] * 7 for (i, ((p1x, p1y), (p2x, p2y))) in enumerate(sevensegs): region = roi[p1y:p2y, p1x:p2x] print( f"{i}: Sum of 1: {np.sum(region == 255)}, Sum of 0: {np.sum(region == 0)}, Shape: {region.shape}, Size: {region.size}" ) if np.sum(region == 255) > region.size * 0.5: on[i] = 1 print(f"State of ON: {on}") digit = DIGITSDICT[tuple(on)] print(f"Digit is: {digit}") digits += [digit] cv2.rectangle(canvas, (x, y), (x + w, y + h), CYAN, 1) cv2.putText(canvas, str(digit), (x - 5, y + 6), FONT, 0.3, (0, 0, 0), 1) cv2.imshow("Digit", canvas) cv2.waitKey(0) print(f"Digits on the token are: {digits}") ================================================ FILE: digitrecognition/digitrec.html ================================================
In Chapter 4: Digit Recognition, we'll add a few new techniques to our image processing toolset by attempting to build a digit recognition pipeline from start to finish. Throughout the exercise, we will get to practice the image preprocessing tricks we've picked up from previous chapters:
New method and strategies that you'll be learning include:
To be clear, specialised deep learning libraries that have sprung out in recent years are a lot more robust in their approach. By utilizing machine learning principles (cost function, gradient descent etc), these specialised libraries can handle highly complex object recognition and OCR (optical character recognition) tasks at the cost of brute computing power.
The overarching motivation of this free course however, was to make clear to beginners what constitutes artificial intelligence, and to illustrate the principle benefits of machine learning. I try to achieve that by demonstrating -- over multiple chapters of this course -- how computer visions were traditionally, or rather "classically", performed prior to the emergence of deep learning.
By learning the classical approaches to computer vision, the student (you) can compare the effort it takes to hand-tuning parameters and this adds a new dimension of appreciation towards self-learning methods that we'll discuss in the near future.
Do a quick google search on "digit recognition" or "digit classification" and it's hard to find an introductory deep learning course that doesn't use the famous MNIST (Modified National Institute of Standards and Technology)[1] database. This is a handwritten digit database that has long become the de facto in pretty much any machine learning tutorials:

But I'd argue, that for a budding computer vision developer, your learning objectives are better served by taking a different approach.
By choosing real life images, you are confronted with a few more key challenges that are not present from using a well-curated database such as MNIST. These challenges present new opportunities to learn about key concepts such as region of interest, and morphological operations, that you will come to rely upon greatly in the future.
First, take a look at 4 real-life pictures of security tokens issued by banks and institutional agencies (left-to-right: Bank Central Asia, DBS, OCBC Bank, OneKey for Singapore Government e-services):

Notice how noisy these images are, as each image is shot with a different background, different lighting conditions, each token is of a different size and shape, and the different colors in each security token etc.
Your task, as a computer vision developer, is to develop a pipeline that, in each phase, take you closer to the goal. Roughly speaking, given the above task, we would formulate a pipeline that looks like the following:
In practice, step (1) and (2) above is the "application" of the methods you've learned in previous chapters of this series. As we'll soon observe, we will use a combination of blurring operations and edge detection to draw our contours. Among the contours, one of them would be the LCD display containing the digits to be classified. That is our Region of Interest.

The GIF above demonstrates the code in roi_01.py but essentially it shows the selectROI method in action. You'll commonly combined the selectROI method with a either a slicing operation to crop your region of interest, or a drawing operation to call attention to the specific region of the image.
x,y,w,h = cv2.selectROI("Region of interest", img) cropped = img[y:y+h, x:x+w] # draw rectangle cv2.rectangle(img_color, (x,y), (x+w,y+h), (255,0,0), 2)
In most cases, it simply wouldn't be realistic to render an image before manually specifying our region of interest. We'll need this operation to be as close to automatic as possible. But how exactly? That depends greatly on the specific problem set.
In some cases, the obvious choice of strategy would be simply shape recognition, say by counting the number of vertices from each contour. The following code is an example implementation of that:
# cnt = contour peri = cv2.arcLength(cnt, True) # contour approximation cnt_appro = cv2.approxPolyDP(cnt, 0.03 * peri, True) if len(cnt_approx) == 3: est_shape = 'triangle' ... elif len(cnt_approx) == 5: est_shape = 'pentagon' ...
In other cases, you may employ a strategy that try to match contour based on Hu moments (which we'll study in details in future chapters).
Other methods may involve a saliency map, or a visual attention map, for ROI extraction. These methods create a new representation of the original image where each pixel's unique quality are amplified or emphasized. One example implementation on Wikipedia[2] demonstrates how straightforward this concept really is:
As you add new tools and strategies to your computer vision toolbox, you will pick up new approaches to ROI extraction. It is an interesting field of research that has been gaining a lot in popularity with the emergence of deep learning.
As for the images of bank security tokens, can you think of an approach that may be a good fit? Our region of interest is the LCD screen at the top of the button pad on each device, and they all seem to be rather consistent in shape and size. Give it some thought and read on to find out.
I've hinted at the shape and size being a factor, so maybe that would be a good starting point. The good news is the OpenCV made this incredibly easy through the contourArea() and arcLength() function.
The following snippet of code, lifted from contourarea_01.py, finds all contours and sort them by area size in descending order before storing the first 10 in cnts:
cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # sort contours by contourArea, and get the first 10 cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:9]
We can also obtain the contour area and parameter iteratively in a for-loop, like the following:
cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) for i in range(len(cnts)): area = cv2.contourArea(cnts[i]) peri = cv2.arcLength(cnts[i], closed=True) print(f'Area:{area}, Perimeter:{peri}')
In effect, we're looping through each contour that the findContours() operation found, and computing two values each time, area and peri.
Note that the contour perimeter is also known as the arc length. The second argument closed specify whether the shape is a closed contour (True) or just a curve (closed=False).
Execute contourarea_01.py and observe how each contour is displayed, from the one with the largest area to the one with the least, for a total of 10 contours. As you run the script on different pictures of bank security tokens, you see that it does a reliable job at finding the contours, sorting them, and returning our LCD display screen as the first in the list. This makes sense, because visually it is apparent that the LCD display occupy the largest area among other closed shapes in our picture.
Use assets/dbs.jpg instead of assets/ocbc.jpg in contourarea_01.py. Were you able to extract the region of interest (LCD Display) successfully without any changes to the script?
Could we have successfully extract our region of interest have we used arcLength in our strategy?
Supposed we only wanted to extract the region of interest and not the rest, which line of code would you change? Reflect the change in the code and execute it to confirm that you have performed this exercise correctly.
Supposed we wanted the contours sorted according to their respective area, from the smallest to the largest, which line of code would you change? Reflect the change in the code and execute it to confirm that you have performed this exercise correctly.
While working through the exercises above, you may find it helpful to also draw the text describing the area size and perimeter next to each contour. I've shown you how this can be done in contourarea_02.py but the essential addition we make to the earlier code is the two calls to putText():
PURPLE = (75, 0, 130) THICKNESS = 1 FONT = cv2.FONT_HERSHEY_SIMPLEX cv2.putText(img_color, "Area:" + str(area), (x, y - 15), FONT, 0.4, PURPLE,THICKNESS) cv2.putText(img_color, "Perimeter:" + str(peri), (x, y - 5), FONT, 0.4,PURPLE, THICKNESS)

With these foundations, we are now ready to write a simple utility script that:
/inter (intermediary) for the actual digit recognition laterMuch of what you need to do has already been presented so far, but the core pieces are, lifted from roi_02.py the following few lines of code:
img = cv2.imread(...) blurred = cv2.GaussianBlur(img, (7, 7), 0) edged = cv2.Canny(blurred, 130, 150, 255) cnts, _ = cv2.findContours(edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:1] x, y, w, h = cv2.boundingRect(cnts[0]) roi = img[y : y + h, x : x + w] cv2.imwrite("roi.png", roi)
The roi_02.py utility script uses the argparse library so user can specify a file path with a flag -p (or --path) like such:
python roi_02.py -p assets/ocbc.jpg
# equivalent:
python roi_02.py --path assets/ocbc.jpg
If the user do not specify a file path using the -p flag, the default value would be assets/ocbc.jpg. If you wish to change this, edit roi_02.py and specify a different value for the default parameter.
parser = argparse.ArgumentParser() parser.add_argument("-p", "--path", default="assets/ocbc.jpg")
You should run this exercise using dbs.jpg, ocbc2.jpg, or onekey.jpg at least once. Execute the script and check the inter folder to confirm that the ROI has been saved. When you're done, you are ready to move on to the next phase of the digit recognition pipeline.
Once the region of interest is obtained, we now have an image that may still contain noises. This is especially the case when our ROI is obtained by means of thresholding methods, since you can expect some "non-features" (noises) to also be included in the resulting image.
To account for these imperfections, we will now perform a series of operations on our image. We'll learn what they are formally, but let's begin by seeing what is it that they offer to our image processing pipeline. I've included a picture with some random noise, as follow:

The digit "0417" is clearly discernible to the human eye despite the presence of noise. However, consider the perspective of a global thresholding operation; These pixel values are "noise" to us but a computer has no such notion of which pixel values are meaningful and what others are not. A thresold value such as the global mean will take all values into account indiscriminately. A contour finding operation will, instead of 4, return thousands of tiny round segments (they may be tiny, but they are completely valid contours).
An image processing pipeline that fail to account for these may result in sub-optimal performance or, very often, completely undesired results.
Enter two of the most fundamental morphological transformations: erosion and dilation.
Erosion "erodes away the boundaries of foreground object"[3] by sliding a kernel through the image and set a pixel to 1 only if all the pixels under the kernel is 1.
This in effect discard pixels near the boundary and any floating pixels that are not part of a larger blob (which is what the human eye is interested in). Because pixels are eroded, your foreground object will shrink in size.
The opposite of erosion, Dilation sets a pixel to 1 if at least one pixel under the kernel is 1, essentially "growing" the foreground object.
Because of how these operations work, there are a couple of things to note:

As we read our image in grayscale mode (flags=0), we obtain a white blackground and a mostly-black foreground. This is illustrated in the subplot titled "Original" above. We begin our preprocessing steps by first binarizing the image (step 1), followed by inverting the colors (step 2) to get a white-on-black image.
An erosion operation is then performed (step 3). This works by creating our kernel (either through numpy or through opencv's structuring element) and sliding that kernel across our image to remove white noises in our image.
The side-effect is that our foreground object has now shrunk in size as it's boundaries are eroded away. We grow it back by applying a dilation (step 4) and finally show the output as illustrated in the bottom-right pane of the image above.
# read as grayscale roi = cv2.imread("assets/0417s.png", flags=0) # step 1: _, thresh = cv2.threshold(roi, 170, 255, cv2.THRESH_BINARY) # step 2: inv = cv2.bitwise_not(thresh) # step 3 (option 1): kernel = np.ones((5,5), np.uint8) # step 3 (option 2): kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5)) eroded = cv2.erode(inv, kernel, iterations=1) # step 4: dilated = cv2.dilate(eroded, kernel, iterations=1) cv2.imshow("Transformed", dilated) cv2.waitKey(0)
OpenCV provides the three shapes for our kernel:
MORPH_RECTMORPH_CROSSMORPH_ELLIPSEThey are fed as the first argument into cv2.getStructuringElement(), with the second being the kernel size (ksize) itself. The third argument is the anchor point, which defaults to the center.
Another name for Erosion, followed by Dilation is the Opening. It is useful in removing noise in our image. The reverse of Opening is Closing, where we first perform Dilation followed by Erosion, particularly suited for closing small holes inside foreground objects.
OpenCV includes the more generic morphologyEx method for all other morphological operations beyond Erosion and Dilation. The function takes an image as the first argument, an operation as the second operation and finally the kernel. Compare how your code will differ between cv2.erode and cv2.dilate, and their respective equivalence in cv2.morphologyEx():
import cv2 import numpy as np img = cv2.imread('image.png',0) kernel = np.ones((5,5),np.uint8) erosion = cv2.erode(img,kernel,iterations = 1) # Equivalent: # cv2.morphologyEx(img, cv2.MORPH_ERODE, kernel,iterations=1) dilation = cv2.dilate(img,kernel,iterations = 1) # Equivalent: # cv2.morphologyEx(img, cv2.MORPH_DILATE, kernel,iterations=1) opening = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel) closing = cv2.morphologyEx(img, cv2.MORPH_CLOSE, kernel)
In the homework directory, you'll find 0417h.png. Your job is to apply what you've learned in this lesson to clean up the image. Your output should have these qualities:
findContours() on the output, you should have exactly 4 contours
You are free to pick your strategy, but a reference solution would look like the following:

The seven-segment display (known also as "seven-segment indicator") is a form of electronic display device for displaying decimal numerals[4] widely used in digital clocks, electronic meters, calculators and banking security tokens.

This is relevant because it is the character representation of our digits in each of these security tokens. If we can isolate each digit from each other, we can iteratively predict the "class" of each digit (0 to 9). Specifically, we are going to perform a classification task based on the state of each segment.
To ease our understanding, let's refer to each segment using the letters A to G:

We can then create a lookup table that match the collective states to the corresponding class:
| Class | a | b | c | d | e | f | g |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 3 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 4 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| 5 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| 6 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 7 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
| 8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 9 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
How would we represent such a lookup table in our Python code and how would we use it? The obvious answer to the first question is a dictionary. Notice that DIGITSDICT is just a representation of the "binary state" of each segment. The digit "8" for example correspond to all seven segments being activated, or "on" (state of 1).
DIGITSDICT = { (1,1,1,1,1,1,0):0, (0,1,1,0,0,0,0):1, (1,1,0,1,1,0,1):2, (1,1,1,1,0,0,1):3, (0,1,1,0,0,1,1):4, (1,0,1,1,0,1,1):5, (1,0,1,1,1,1,1):6, (1,1,1,0,0,1,0):7, (1,1,1,1,1,1,1):8, (1,1,1,1,0,1,1):9 }
Then, for each digit, we would look at the pixel values in each of the seven segments, and if the majority of pixels are white, we would classify that segment as being in an activated state (1), otherwise in a state of 0. As we iterate over the 7 segments, we now have an array of length 7, each element a binary value(0 or 1).
We would then find the corresponding value in our dictionary using that array. Your code would resemble the following:
# define the rectangle areas corresponding each segment sevensegs = [ ((x0, y0), (x1, y1)), ((x2, y2), (x3, y3)), ... # 7 of them ] # initialize the state to OFF on = [0] * 7 # set each segment to ON / OFF based on majority for (i, ((p1x, p1y), (p2x, p2y))) in enumerate(sevensegs): # numpy slicing to extract only one region region = roi[p1y:p2y, p1x:p2x] # if majority pixels are white, set state to ON if np.sum(region == 255) > region.size *0.5: on[i] = 1 # lookup on dictionary digit = DIGITSDICT[tuple(on)] # digit is one of 0-9
There are multiple ways to write a for-loop but it's important that you are aware of the order in which your for-loop your executing. Referring to our seven-segment illustration below,the first iteration is only concerned with the state of 'A' while the second interation handles the state of 'B', and so on.

Using enumerate, we obtain an additional counter (i) to our iterable (sevensegs); This is convenient for the purpose of setting states. At the first iteration, the first element is our list is conditionally set to 1 if more than half of the pixels in segment 'A' are white. A more detailed example of python's enumeration is in utils/enumerate.py.
If you are paying close attention to the digit '0' in our LCD display, you will notice that the absence of the 'G' segment causes a pretty visible and significant gap. When you test your digit recognition script without special consideration to this attribute, you will find it consistently failing to account for the numbers "0","1" and "7". In fact, you may not even be able to isolate the aforementioned numbers altogether using the findContour operation, because they were treated as two disjointed pieces instead of a whole piece.
A reasonable strategy to handle this is the Dilation or Closing (Dilation followed by Erosion) operation that you've learned earlier.
Similarly, your ROI may necessitate other pre-processing and the specific tactical solution vary greatly depending on the problem set at hand.
As I inspect the bounding box we retrieved around the LCD screen, the observation that these bouding boxes often have their digits centered around the bottom half of the display led me to insert an additional step prior to the morphological transformation in the final code solution. The step uses numpy subsetting to trim away the top 20% as well as 20% on each side of the image:
roi = cv2.imread("roi.png", flags=0) RATIO = roi.shape[0] * 0.2 trimmed = roi[ int(RATIO) :, int(RATIO) : roi.shape[1] - int(RATIO)]
That said, whenever possible, you want to be cautious of not hand-tuning your problem in a way that is overly specific to the images you have at hand lest risking the solution only working on those specific images and not others, a phenomenon fondly termed as "overfitting" in the machine learning community.
I've re-executed the solution code against some sample image sets, once with the "trimming" in-place and then without the trimming, before settling on the decision. As you will see later, the trimming improves our accuracy and is a relatively safe strategy given how every LCD screen regardless of the issuer (bank) has the same asymmetry with more "blank space" at the top half compared to the bottom half.
Furthermore, in many cases of digit recognition / digit classification you will want to predict the class for each digit in an ordered fashion. Supposed the LCD screen contains the digits "40710382", our algorithm should correctly isolate these digits, classify them iteratively, but do so from the leftmost digit to the rightmost. Failing to account for this may result in your algorithm correctly classifying each digit, but produce an unreasonable output such as "1740238".
There are a few strategies you can employ here. We've seen in contourarea_01.py and contourarea_02.py how contour has attributes that can be retrieved using the contourArea() and arcLength() functions. Inspect the following snippet and it should help jog your memory:
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:9] for i, cnt in enumerate(cnts): cv2.drawContours(img_color, cnts, i, BCOLOR, THICKNESS) area = cv2.contourArea(cnt) peri = cv2.arcLength(cnt, closed=True) print(f"Area:{area}; Perimeter: {peri}")
Indeed, we're using countour area as a good indicator to search for our region of interest. When we take this idea a little further, we can further place a constraint on our search criteria. In the following code, we draw a bounding rectangle and for an extra layer of precaution, only takes any bounding boxes that are taller than 20 pixels (step 1).
Calling boundingRect() on a contour returns 4 values, respectively the x and y coordinate along with the width and height of the contour.
We then use another property of the contour, its top-left coordinate to determine the logical order of our digits. Specifically, we use the first returned value (cv2.boundingRect(cnt)[0]) since that's the x value for the top-left coordinate of each region. By sorting against this value, our digits are stored in the Python list in an ordered fashion, determined by their respective coordinate value.
digits_cnts = [] cnts, _ = cv2.findContours(eroded, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) for cnt in cnts: (x, y, w, h) = cv2.boundingRect(cnt) # step 1 if h > 20: digits_cnts += [cnt] # step 2 sorted_digits = sorted(digits_cnts, key=lambda cnt: cv2.boundingRect(cnt)[0])
When we put these together, we now have a complete pipeline:

The full solution code is in digit_01.py but the essential parts are as follow:
import cv2 import numpy as np # step 1: DIGITSDICT = { (1, 1, 1, 1, 1, 1, 0): 0, (0, 1, 1, 0, 0, 0, 0): 1, (1, 1, 0, 1, 1, 0, 1): 2, (1, 1, 1, 1, 0, 0, 1): 3, (0, 1, 1, 0, 0, 1, 1): 4, (1, 0, 1, 1, 0, 1, 1): 5, (1, 0, 1, 1, 1, 1, 1): 6, (1, 1, 1, 0, 0, 1, 0): 7, (1, 1, 1, 1, 1, 1, 1): 8, (1, 1, 1, 1, 0, 1, 1): 9, } # step 2 roi = cv2.imread("inter/ocbc-roi.png", flags=0) # step 3 RATIO = roi.shape[0] * 0.2 roi = cv2.bilateralFilter(roi, 5, 30, 60) trimmed = roi[int(RATIO) :, int(RATIO) : roi.shape[1] - int(RATIO)] # step 4 edged = cv2.adaptiveThreshold( trimmed, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5 ) # step 5 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 5)) dilated = cv2.dilate(edged, kernel, iterations=1) eroded = cv2.erode(dilated, kernel, iterations=1) # step 6 cnts, _ = cv2.findContours(eroded, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) digits_cnts = [] for cnt in cnts: (x, y, w, h) = cv2.boundingRect(cnt) if h > 20: digits_cnts += [cnt] # step 7 sorted_digits = sorted(digits_cnts, key=lambda cnt: cv2.boundingRect(cnt)[0]) # step 8 digits = [] for cnt in sorted_digits: # step 8a (x, y, w, h) = cv2.boundingRect(cnt) roi = eroded[y : y + h, x : x + w] qW, qH = int(w * 0.25), int(h * 0.15) fractionH, halfH, fractionW = int(h * 0.05), int(h * 0.5), int(w * 0.25) # step 8b sevensegs = [ ((0, 0), (w, qH)), # a (top bar) ((w - qW, 0), (w, halfH)), # b (upper right) ((w - qW, halfH), (w, h)), # c (lower right) ((0, h - qH), (w, h)), # d (lower bar) ((0, halfH), (qW, h)), # e (lower left) ((0, 0), (qW, halfH)), # f (upper left) # ((0, halfH - fractionH), (w, halfH + fractionH)) # center ( (0 + fractionW, halfH - fractionH), (w - fractionW, halfH + fractionH), ), # center ] # step 8c on = [0] * 7 for (i, ((p1x, p1y), (p2x, p2y))) in enumerate(sevensegs): region = roi[p1y:p2y, p1x:p2x] print( f"{i}: Sum of 1: {np.sum(region == 255)}, Sum of 0: {np.sum(region == 0)}, Shape: {region.shape}, Size: {region.size}" ) if np.sum(region == 255) > region.size * 0.5: on[i] = 1 print(f"State of ON: {on}") # step 8d digit = DIGITSDICT[tuple(on)] print(f"Digit is: {digit}") digits += [digit] # step 9 cv2.rectangle(canvas, (x, y), (x + w, y + h), CYAN, 1) cv2.putText(canvas, str(digit), (x - 5, y + 6), FONT, 0.3, (0, 0, 0), 1) cv2.imshow("Digit", canvas) cv2.waitKey(0) print(f"Digits on the token are: {digits}")
h) of our rectangular boxint(h * 0.15)); This segment is w in width and 15% the height of the full digit contour, starting from position (0, 0)0 for each of the 7 segments, then conditionally set regions with more white than black pixels to 1digits list created at the beginning of step 8digits list.An edge can be defined as boundary between regions in an image[1]. Edge detection techniques we'll learn in this course builds upon what we've learned from our lessons in kernel convolution. It is the process of using kernels to reduce the information in our data and preserving only the necessary structural properties in our image[1:1].
Gradient points in the direction of the most rapid increase in intensity. When we apply a gradient based edge detection method, we are searching for the maximum and minimum in the first derivative of the image.
When we apply our convolution onto the image, we are finding for regions in the image where there's a sharp change in intensity or color. Arguably the most common edge detection method using this approach is the Sobel Operator.
The Sobel operator applies a filtering operation to produce an image output where the edge is emphasized. It convolves our original image using two 3x3 kernels to capture approximations of the derivatives in both the horizontal and vertical directions.
The x-direction and y-direction kernels would be:
Each kernel is applied separately to obtain the gradient component in each orientation, $G_x$ and $G_y$. Expressed in formula, the gradient magnitude is:
Where the slope $\theta$ of the gradient is calculated as follow:
If the two formula above confuses you, read on as we unpack these ideas one at a time.
In computer vision literature, you'll often hear about "taking the derivative" and this may erve as a source of confusion for beginning practitioners since "derivatives" is often thought of in the context of a continuous function. Images are a 2D matrix of discrete values, so how do we wrap our head around the idea of finding derivative?
But why do we even bother with derivatives when this course is suppopsed to be about edge detection in images?

Among the many ways to answer the question, my favorite being that image is really just a function. When it treat an image as a function, the utility of taking derivatives become a little more obvious. In the image below, supposed you want to count the number of windows in this area of Venezia Sestiere Cannaregio, your program can look for large derivatives since there are sharp changes in pixel intensity from the windows to the surrounding wall:

The code to generate the surface plot above is in img2surface.py.
Going back to our x-direction kernel in the Sobel Operator.
This kernel has all 0 in the middle, which is quite easy to intuit about. Essentially, for each pixel in our image, we want to compute its derivative in the x-direction by approximating a formula that you may have come across in your calculus class:
This approximation is also called 'forward difference', because we're taking a value of $x$, and computing the difference in $f(x)$ as we increment it by a small amount forward, denoted as $h$.
And as it turns out, using the 'central difference' to compute the derivative of our discrete signal can deliver better results[2]:
To make this more concrete, we can plug the formula into an actual array of pixels:
when we set $h=2$ at the center pixel (index of value 180), we have the following:
Notice that a large part of the calculation we just perform is synonymous to a 1D convolution operation using a $\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}$ kernel.
When the same 1x3 kernel $\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}$ is applied on the right-most part of the image where its just white space ([..., 255, 255, 255]) the kernel would evaluate to 0. In other words, our derivative filter returns no response where it can't detect a sharp change in pixel intensity.
As a reminder, the x-direction kernel in our Sobel Operator is the following:
This takes our 1x3 kernel and instead of convolving one row of pixels at a time, extends it to convolve at 3x3 neighborhoods at a time using a weighted average approach.
The two kernels (one for horizontal and another for vertical edge detection) can be constructed, respectively, like the following:
sobel_x = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]]) sobel_y = np.array([[1, 2, 1], [0, 0, 0], [-1, -2, -1]])
You may have guessed that, given its role in digital image processing, opencv have included a method that performs our Sobel Operator for us, and thankfully there is. Here's an example of using the cv2.Sobel(src, ddepth, dx, dy, dst=None, ksize) method:
gradient_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3) gradient_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3) print(f"Range: {np.min(gradient_x)} | {np.max(gradient_x)}") # Range: -177.0 | 204.0 gradient_x = np.uint8(np.absolute(gradient_x)) gradient_y = np.uint8(np.absolute(gradient_y)) print(f"Range uint8: {np.min(gradient_x)} | {np.max(gradient_x)}") # Range uint8: 0 | 204 cv2.imshow("Gradient X", gradient_x) cv2.imshow("Gradient Y", gradient_y)

The code above, extracted from sobel_01.py reinforces a couple of ideas that we've been working on. It shows that:
cv2.CV_64F). OpenCV suggests to "keep the output datatype to some higher form such as cv2.CV_64F, take its absolute value and then convert back to cv2.CV_8U.[3]"While the code above certainly works, OpenCV also has a method that scales, calculates absolute values and converts the result to 8-bit. cv2.convertScaleAbs(src, dst, alpha=1, beta=0) performs the following:
gradient_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3) gradient_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3) gradient_x = cv2.convertScaleAbs(gradient_x) gradient_y = cv2.convertScaleAbs(gradient_y) print(f"Range: {np.min(gradient_x)} | {np.max(gradient_x)}")
At the beginning of this course I said that images are really just 2d functions before showing you the intricacies of our Sobel kernels. We saw the clever design of both the x- and y-direction kernels, by borrowing from the concept of "taking the derivatives" you often see in calculus text books.
But on a really basic level, these kernels only return the x and y edge responses. These are not the image gradient, just pure arithmetic values from following the convolution process. To get to the final form (where the edges in our image are emphasized) we still need to compute the gradient direction and magnitude for each point in our image.
This brings us back to our original formula. Recall that the x-direction and y-direction kernels are:
We understand that each kernel is applied separately to obtain the gradient component in each orientation, $G_x$ and $G_y$. What is the significance of this? Well as it turns out if we know the shift in the x-direction and the corresponding change in value in the y-direction, then we can use the pythagorean theorem to approximate the "length of the slope", a concept that many of you are familiar with.
Expressed in formula, the gradient magnitude is hence:
Along with the well-known mathematical formula that is Pythagorean theorem, some of you may also have some familiarity with the three trigonometric functions. Particularly, the tangent function tells us that in a right triangle, the tangent of an angle is the length of the opposite side divided by the length of the adjacent side.
This leads us to the following expression:
To rewrite the expression above, we arrive at the formula to capture the gradient's direction:

This whole idea is also illustrated in code, and the script is provided to you:
gradient.py to generate the vector field in the picture above (right)img2surface.py on the penguin image in the assets folder generates the surface plotSuccinctly, supposed the two 3x3 kernels do not fire a response (for example when no edges are detected in the white background of our penguin), both $G_x$ and $G_y$ will be 0, which leads to a gradient magnitude of 0. You can compute these by hand, let OpenCV's implementation handle that for you, or use numpy as illustrated in gradient.py:
dY, dX = np.gradient(img)
Image segmentation is the process of decomposing an image into parts for further analysis. This has many utility:
Current literature on image segmentation techniques can be classified into[4]:
It's important to note, however, that the rise in popularity of deep learning framework and techniques has ushered a proliferation of new methods to perform what was once a highly difficult task. In future lectures, we'll explore image segmentation in far greater details. In this course, we'll study intensity-based segmentation and edge-based segmentation methods.
Intensity-based method is perhaps the simplest as intensity is the simplest property that pixels can share.
To make a more concrete case of this, let's assume you're working with a team of researchers to build an AI-based "sudoku solver" that, unimaginatively, will compete against human sudoku players in an attempt to further stake the claim in an ongoing debate of AI superiority.
While your teammates work on the algorithmic design for the actual solver, your task is comparatively straightforward: write a script to scan newspaper images (or print magazines), binarize them to discard everything except the digits in the sudoku puzzle.
This presents a great opportunity to use an intensity-based segmentation technique we spoke about earlier.
In intensitytresholding_01.py, you'll find a code demonstration of the numerous thresholding methods provided by OpenCV. In total, there are 5 simple thresholding methods: THRESH_BINARY, THRESH_BINARY_INV, THRESH_TRUNC, THRESH_TOZERO and THRESH_TOZERO_INV[5].
The method call between all of them are identical:
cv2.threshold(img, thresh, maxval, type)
We specify our source image img (usually in grayscale), a threshold value thresh used to binarize the image pixels, and a max value maxval for the pixel value to use for any pixel that crosses our threshold.
The mathematical functions for each one of them:

They're collectively known as simple thresholding in OpenCV because they use a global threshold value; Any pixels smaller than the threshold is set to 0 otherwise it is set to the maxval value.
The probably sound too simplistic for anything beyond the simplest of real-world images, and for the majority of cases they are. They call for proper judgment of the task at hand.
Applying the various types of simple thresholding method on our sudoku image, we observe that the digits are for the most part extracted successfully while the background information are greatly reduced:

Refer tointensitythresholding_01.py for the full code.
As a simple homework, try to practice simple thresholding on the car2.png located in your homework folder. To reduce noise, you may have to combine a blurring operation prior to thresholding. As you practice, pay attention to the interaction between your threshold values and the output. Later in the course, you'll learn how to draw contours, which would come in handy in producing the final output:

As you work on your homework, you will notice that given the varying lighting condition across the different region of our image, regardless of the global value we pick we either have a threshold value that is too low or too high.
Using a global value as an intensity threshold may work in particular cases but may be overly naive to perform well when, say, an image has different lighting conditions in different areas. A great example of this case is the object extraction exercise you performed using car2.png.
Adaptive thresholding is not a lot different from the aforementioned thresholding techniques, except it determines the threshold for each pixel based on its neighborhood. This in effect mans that the image is assigned different thresholds across the different regions, leading to a cleaner output when our image has different degrees of illumination.

The method is called with the source image (src), a max value (maxValue), the method (adaptiveMethod), a threshold type (thresholdType), the size of the neighborhood (blockSize) and a constant (C) that is subtracted from the mean or the weightted sum of the neighborhood pixels.
mean_adaptive = cv2.adaptiveThreshold( img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 11, 2 ) gaussian_adaptive = cv2.adaptiveThreshold( img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 )
The code, taken from adaptivethresholding_01.py produces the following:

Edge-based segmentation separates foreground objects by first identifying all edges in our image. Sobel Operator and other gradient-based filter function are good and well-known candidates for such an operation.[6]
Once we obtain the edges, we perform the contour approximation operation using the findContours method in OpenCV. But what exactly are contours?
In OpenCV's words[7],
Contours can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition.
If we have "a curve joining all the continuous points along the boundary", then we are able to extract this object. If we wish to count the number of contours in our image, the method also convenient return a list of all the found contours, making it easy to perform len() on the list to retrieve the final value.
There are three arguments to the findContours() function, first being the source image, second is the retrieval mode and last is the contour approximation method. Both the contour retrieval mode and approximation method is discussed in the next sub-section.
(cnts, hierarchy) = cv2.findContours( img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE, )
The function returns the contours and hierarchy, with contours being a list of all the contours in the image. Each contour is a Numpy array of (x,y) coordinates of boundary points of the object, giving each contour a shape of (n, x, y).
What this allow us to do, is to combine the contours we retrieved with the cv2.drawContours() function either individually, exhaustively in a for-loop fashion, or just everything in one go.
Assuming img being the image we want to draw our contours on, the following code demonstrates these different methods:
# draw all contours cv2.drawContours(img, cnts, -1, (0,255,0), 3) # draw the 3rd contour cv2.drawContours(img, cnts, 2, (0,255,0), 3) # draw the first, fourth and fifth contour cnt_selected = [cnts[0], cnts[3], cnts[4]] cv2.drawContours(canvas, cnt_selected, -1, (0, 255, 255), 1) # draw the fourth contour cv2.drawContours(img, contours, 3, (0,255,0), 3)
The first argument to this function being the source image, the second is the contours as a Python list, the third is the index of contours and remaining arguments are color and thickness of contour lines respectively.
One common problem beginners can run into is to perform the findContours operation on the grayscale image instead of the binary image, leading to poorer accuracy.
When we execute contour_01.py, we notice that the drawContour operation yields the following output:

There are 5 occurrences where our findContours function incorrectly approximated the wrong contour because two penguins were too close to each other. When we execute len(cnts), we will find that the returned value is 5 less than the actual count.
Try to fix contour_01.py by performing the contour approximation on our binary image using the thresholding technique you've learned in previous section.
In the findContours() function call, we passed our image to src in the first argumet. The second argument is the contour retrieval mode, and there are documentation for 4 of them[8]:
RETR_EXTERNAL: retrieves only the extreme outer contours (see image below for reference)RETR_LIST: retrieves all contours without establishing any hierarchical relationshipsRETR_CCOMP: retrieves all contours and organize them into a two-level hierarchy (external boundary + boundaries of the holes)RETR_TREE: retrieves all of the contours and reconstructs a full hierarchy of nested contours
In our case, we don't particularly care about the hierarchy, and so the second to fourth method all has the same effect. In other cases, you may experiment with a different contour retrieval method to obtain both the contours and the hierarchy for further processing.
What about the last parameter passed to our findContours method?
Recall that contours are just boundaries of a shape? In a sense, it is an array of (x,y) coordinates used to "record" the boundary of a shape. Given this collection of coordinates, we can then recreate the boundary of our shape. This begs the next question: how many set of coordinates do we need to store to recreate our boundary?
Supposed we perform the findContour operation on an image of two rectangles, one method it may use to achieve that is to store as many points around these rectangle boxes as possible? When we set cv2.CHAIN_APPROX_NONE, that is in fact what the algorithm would do, resulting in 658 points around the border of the top rectangle:

However, notice the more efficient solution would have been to store only the 4 coordinates at each corner of the rectangle. The contour is perfectly represented and recreated using just 4 points for each rectangle, resulting in a total number of 8 points compared to 1,316 points. cv2.CHAIN_APPROX_SIMPLE[9] is an implementation of this, and you can find the sample code below:

cnts, _ = cv2.findContours( # does this need to be changed? edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE, ) print(f"Cnts Simple Shape (1): {cnts[0].shape}") # return: Cnts Simple Shape (1): (4, 1, 2) # output of cnts[0]: # array([[[ 47, 179]], # [[ 47, 259]], # [[296, 259]], # [[296, 179]]], dtype=int32) cnts2, _ = cv2.findContours( # does this need to be changed? edged, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE, ) print(f"Cnts NoApprox Shape:{cnts2[0].shape}") # Cnts NoApprox Shape:(658, 1, 2)
The full script for the experiment above is in contourapprox.py.
You may, at this point, hop to the Learn By Building section to attempt your homework.
John Canny developed a multi-stage procedure that, some 30 years later, is "still a state-of-the-art edge detector"[10]. Better edge detection algorithms usually require greater computational resources -- and consequently -- longer processing times -- or a greater number of parameters, in an area where algorithm speed is oftentimes the most important criteria. For the reasons above along with its general robustness, the canny edge algorithm has become one of the "most important methods to find edges" even in modern literature[1:2].
I said it's a multi-stage procedure, because the technique as described in his original paper, computational theory of edge detection, works as follow[11]:
Step (1) and (2) in the procedure above can be achieved using code we've written so far in our Sobel Operator scripts. We use the Sobel mask filters to compute $G_x$ and $G_y$, respectively the gradient component in each orientation. We then compute the gradient magnitude and the angle $\theta$:
Gradient magnitude:
And recall that the slope $\theta$ of the gradient is calculated as follow:
Step (3) in the procedure is another common technique in computer vision known as the non-maximum suppression (NMS). Let's begin by taking a look at the output of our Sobel edge detector from earlier exercises:

Notice as we zoom in on the output image, we can see the gradient-based method did create our strong edges, but it also created "weak" edges it find in our image. Because it is not a parameterized function -- the edge is computed using values of the gradient magnitude and direction -- we have to rely on an additional mechanism for the edge thinning operation with the criterion being one accurate response to any given edge[12].
Non-maximum suppression help us obtain the strongest edge by suppressing all the gradient values, i.e. setting them to 0 except for the local maxima, which indicate locations with the sharpest change of intensity value. In the words of OpenCV:
After getting gradient magnitude and direction, a full scan of image is done to remove any unwanted pixels which may not constitute the edge. For this, at every pixel, pixel is checked if it is a local maximum in its neighborhood in the direction of gradient. If point A is on the edge, and point B and C are in gradient directions, point A is checked with point B and C to see if it forms a local maximum. If so, it is considered for next stage, otherwise, it is suppressed (put to zero).
The output of step (3) is a binary image with thin edges.
The code[13] demonstrates how you would code such an NMS for the purpose of canny edge detection.
The final step of this multi-stage algorithm decides which among all edges are really edges and which of them are not. It accomplishes this using two threshold values, specified when we call the cv2.Canny() function:
canny = cv2.Canny(img, threshold1=50, threshold2=180)
Any edges with an intensity gradient above threshold2 are considered edges and any edges below threshold1 are considered non-edges and so are suppressed.
The edges that lie between these two values (in our code above, edges with intensity gradient between 50 and 180) are classified as edges if they are connected to sure-edge pixels (the ones above 180) otherwise they are also discarded.
This stage also removes small pixels ("noises") on the assumption that edges are long lines ("connected").
The full procedure is implemented in a single function, cv2.Canny() and the first three parameters are required, respectively being the input image, the first and second threshold value. canny_01.py implements this and compare that to the Sobel Edge detector we developed earlier:

In the homework directory, you'll find a picture of scattered lego bricks lego.jpg. Exactly the kind of stuff you don't want on your bedroom floor, as anyone living with kids at home would testify.
Your job is to apply what you've learned in this lesson to combine what you've learned from the class in kernel convolutions and Edge Detection (kernel.md) to build a lego brick counter.
Note that there are many ways you can build an edge detection. Given what you've learned so far, there are at least 3 equally adequate routines you can apply for this particular problem set.
For the sake of this exercise, your script should feature the use of a Sobel Operator (or a similar gradient-based edge detection method) since this is the main topic of this chapter.

S.Kaur, I.Singh, Comparison between Edge Detection Techniques, International Journal of Computer Applications, July 2016 ↩︎ ↩︎ ↩︎
Carnegie Mellon University, Image Gradients and Gradient Filtering (16-385 Computer Vision) ↩︎
Image Gradients, OpenCV Documentation ↩︎
University of Victoria, Electrical and Computer Engineering, Computer Vision: Image Segmentation ↩︎
Image Thresholding, OpenCV Documentation ↩︎
C.Leubner, A Framework for Segmentation and Contour Approximation in Computer-Vision Systems, 2002 ↩︎
Contours: Getting Started, OpenCV Documentation ↩︎
Structural Analysis and Shape Descriptors, OpenCV Documentation ↩︎
Contours Hierarchy, OpenCV Documentation ↩︎
Shapiro, L. G. and Stockman, G. C, Computer Vision, London etc, 2001 ↩︎
Bastan, M., Bukhari, S., and Breuel, T., Active Canny: Edge Detection and Recovery with Open Active Contour Models, Technical University of Kaiserslautern, 2016 ↩︎
Maini, R. and Aggarwal, H., Study and Comparison of various Image Edge Detection Techniques, Internal Jounral of Image Processing (IJIP) ↩︎
When performing an arithmetic computation on a given image, one approach is to apply said computation in a neighborhood-by-neighborhood manner. This approach is very braodly termed as a convolution. In other words, convolution is an operation between every part of an image ("pixel neighborhood") and an operator ("kernel")[1][2].
As the computation slides over each pixel neighborhood, we perform some arithmetic using the kernel, with the kernel typically being represented as a matrix or a fixed size array.
This kernel describes how the pixels in that neighborhood are combined or transformed to yield a corresponding output.
You will notice from the video that the output image now has a shape that is smaller than the original input. Mathematically, the shape of this output would be:
Where the input matrix has a size of $(X_m, X_n)$, the kernel $M$ is of size $(M_i, M_j)$, $s_x$ represents the stride over rows while $s_y$ represents the stride over columns.
In the linked video, we are sliding the kernel on both the x- and y- direction by 1 pixel at a time after each computation, giving a value of 1 for $s_x$ and $s_y$. The input matrix in our video is of size 5, and our kernel is of size 3x3, giving us an output size of:
Expressed mathematically, the full procedure as implemented in opencvlooks like this for a convolution:
$H(x, y) = \sum^{M_i-1}_{i=0}\sum^{M_j-1}_{j=0} I(x+i-a_i, y+j-a_j)K(i,j)$
We'll see the step-by-step given a kernel represented by matrix M:
Place the kernel anchor (in this case, $3$) on top of a determined pixel, with the rest of the kernel overlaying the corresponding local pixels in the image
Multiply the kernel coefficients by the corresponding image pixel values and sum the result
Replace the value at the location of the anchor in the input image with the result
Repeat the process for all pixels by sliding the kernel across the entire image, as specified by the stride
Keen readers may observe from executing meanblur_02.py that the original dimension of our image is preserved after the convolution. This may seem unexpected given what we know about the formula to derive the output dimension.
As it turns out, to preserve the dimension between the input and output images, a common technique known as "padding" is applied. From the documentation itself,
For example, if you want to smooth an image using a Gaussian 3 * 3 filter, then, when processing the left-most pixels in each row, you need pixels to the left of them, that is, outside of the image. You can let these pixels be the same as the left-most image pixels (“replicated border” extrapolation method), or assume that all the non-existing pixels are zeros (“constant border” extrapolation method), and so on.
The various border interpolation techniques available in opencv are as below (image boundaries are denoted with '|'):
aaaaaa|abcdefgh|hhhhhhhfedcba|abcdefgh|hgfedcbgfedcb|abcdefgh|gfedcbacdefgh|abcdefgh|abcdefgiiiiii|abcdefgh|iiiiiii with some specified 'i'It is useful to remember that OpenCV only supports convolving an image where the dimension of its output matches that of the input, so in almost all cases we need a way to extrapolate an extra layer of pixels around the borders. To specify an extrapolation method, supply the filtering method with an extra argument:
cv2.GaussianBlur(..., borderType=BORDER_CONSTANT)Given what we've just learned, we can rewrite our formula to determine the output dimensions more generally and this time incorporating the padding technique:
Before moving on to the next section, try and think through the following problem:
In the case on a 333x333 input image, with a strides of 1 using a kernel of size 5*5, what is the amount of zero-padding you should add to the borders of your image such that the output image is also 333x333?
To fully appreciate the idea of kernel convolutions, we'll see some real examples. We'll use the cv2.filter2D to convolve over our image using the following kernel:
The kernel we specified above is equivalent to a normalized box filter of size 5. Having watched the video earlier, you may intuit that the outcome of such a convolution is that each pixel in the input image is replaced by the average of the 5x5 pixels around it. You are in fact correct. If you are skeptical and would rather see proof of it, we'll see proof of this in the Code Illustrations: Mean Filtering section of this coursebook.
Mathematically, by dividing our matrix by 25 (normalizing) we apply a control that stop our pixel values from being artificially increased since each pixel is now the weighted sum of its neighborhood.
A Note on Terminology
Kernels or Filters?
When all we've been talking about is kernels, why is it that we're using the "filter" terminology in
opencvcode instead? That depends on the context. In the case of a convolutional neural network, kernel and filters are used interchangably: they both refer to the same thing.
Some computer vision researchers have proposed to use a stricter definition, prefering to use the term "kernel" for a 2D array of weights, like our matrix above, and the term "filter" for the 3D structure of multiple kernels stacked together[3], a concept we'll explore further in the Convolutional Neural Network part of this course.Correlations vs Convolutions
Imaging specialists may point to the fact that
opencvdoes not mirror / flip the kernel around the anchor point and hence doesn't qualify as a convolution under strict definitions of digital imaging theory. For a pure implementation of a "convolution", you should insteadscipy.ndimage.convolve(src, kernel)instead or usecv2.filter2Din conjunction with aflipon the kernel[4]. This is in large part owed to the difference in scientific parlance adopted by the various scientific communities, a phenomenon more common than you'd expect. As an additional example, deep learning scientists usings convolutional neural network (CNN) generally refer to a non-flipped kernel when performing convolution.
meanblur_01.py demonstrates the construction of a 5x5 mean average filter using np.ones((5,5))/25. Because every coefficient is basically the same, this merely replaces the value of each pixel in our input image with the average of the values in its 5x5 neighborhood.img = cv2.imread("assets/canal.png") mean_blur = np.ones((5, 5), dtype="float32") * (1.0 / (5 ** 2)) smoothed_col = cv2.filter2D(img, -1, mean_blur)
Alternatively, we can be explicit in our creation of the 5x5 kernel using numpy's array:
mean_blur = np.array( [[0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04], [0.04, 0.04, 0.04, 0.04, 0.04]])
To be fully convinced that the mean filtering operation is doing what we expect it to do, we can inspect the pixel values before and after the convolution, to verify that the math checks out by hand. We do this in meanblur_02.py.
img = cv2.imread("assets/canal.png") gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) print(f'Gray: {gray[:5, :5]}') # [[ 31 27 21 17 21] # [ 77 85 86 87 90] # [205 205 215 227 222] # [224 230 222 243 249] # [138 210 206 218 242]] for i in range(3): newval = np.round(np.mean(gray[:5, i:i+5])) print(f'Mean of 25x25 pixel #{i+1}: {np.int(newval)}') # output: # Mean of 25x25 pixel #1: 152 # Mean of 25x25 pixel #2: 158 # Mean of 25x25 pixel #3: 160
The code above shows that the output of such a convolution operation beginning at the top-left region of the image would be 152. As we slide along the horizontal direction and re-compute the mean of the neighborhood, we get 158. As we slide our kernel along the horizontal direction for a second time and re-compute the mean of the neighborhood we obtain the value of 160.
If you prefer you can verify these values by hand, using the raw pixel values from gray[:5, :5] (5x5 top-left region of the image).
mean_blur = np.ones(KERNEL_SIZE, dtype="float32") * (1.0 / (5 ** 2)) smoothed_gray = cv2.filter2D(gray, -1, mean_blur) print(f'Smoothed: {smoothed_gray[:5, :5]}') # output: # [[122 123 125 127 128] # [126 127 128 131 132] # [148 149 152 158 160] # [177 179 184 196 202] # [197 199 204 222 229]]
Notice that from the output of our mean-filter, the first anchor (center of the neighborhood) has transformed from 215 to 152, and the one to the right of it has transformed from 227 to 158, and so on. The math does work out and you can observe the blur effect directly by running meanblur02.py.
As it turns out, opencv provides a set of convenience functions to apply filtering onto our images. All the three approaches below yield the same output, as can be verified from the output pixel values after executing meanblur_03.py:
# approach 1 mean_blur = np.ones(KERNEL_SIZE, dtype="float32") * (1.0 / (5 ** 2)) smoothed_gray = cv2.filter2D(gray, -1, mean_blur) # approach 2 smoothed_gray = cv2.blur(gray, KERNEL_SIZE) # approach 3 smoothed_gray = cv2.boxFilter(gray, -1, KERNEL_SIZE)
There are several types of kernels we can apply to achieve a blur filter on our image. The averaging filter method serves as a good introductory point because it is easy to intuit about, but it is good to know that opencv provides a collection of convenience functions, each being an implementation of some blurring filter. See Handy kernels for image processing for a list of smoothing kernels implemented in opencv.
Earlier, it was said that kernels play a play integral role in all modern convolutional neural networks architecture. Using TensorFlow, one will rely on the tf.nn.conv2d function to perform a 2D convolution. The syntax looks like this:
tf.nn.conv2d( input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None )
Where:
input is assumed to be a tensor of shape (batch, height, width, channels) where batch is the number of images in a minibatchfilter is a tensor of shape (filter_height, filter_width, channels, out_channels) that specifies the learnable weights for the nonlinear transformation learned in the convoliutional kernelstrides contains the filter strides and is a list of length 4 (one for each input dimension)padding determines whether the input tensors are padded (with extra zeros) to guarantee the output from the convolutional layer has the same shape as the input. padding="SAME" adds padding to the input and padding="VALID" results in no paddingWorthy to note is that the input and filters parameters follow what we've implemented using opencv thus far. When we're applying a filter like the mean blur example earlier, we slide our kernel along the stride of 1. In TensorFlow code, we would have set strides=[1,1,1,1] such that the kernel would slide by 1 unit across all 4 dimensions (x, y, channel, and image index).
Example of a Convolutional Neural Network architecture[5]:

Notice from the image that the dimension of our output from the first convolution layer is smaller (28x28) than its input (32x32) when we perform the operation without padding. C1 and C3 are examples of this in the above illustration.
In S1 and S2, we're applying a max-pooling filter to down-sample our image representation, allowing our network to learn the parameters from the higher-order representations in each region of the image. An example operation is depicted below:

cv2.blur(img, KERNEL_SIZE)
meanblur_03.py, replace each pixel with the mean of its neighboring pixelscv2.medianBlur(img, KERNEL_SIZE)
cv2.GaussianBlur(img, KERNEL_SIZE, 0)cv2.bilateralFilter(img, d, sigmaColor, sigmaSpace)
Gaussian filter deserves its own section given its prevalence in image processing, and is achieved by convolving each point in the input array (read: each pixel in our image) with a Gaussian kernel and take the sum of them to produce the output array.
If you remember your lessons from statistics, you may recall a 1D gaussian distribution looks like this:

For completeness' sake, the code to graph the distribution above is in utils/gaussiancurve.r.
For a 1-dimensional image, the pixel located in the middle would be assigned the largest weight, with the weight of its neighbours decreasing as the spatial distance between them and the center pixel increases.
For the mathematically inclined, the graphed distribution above is generated from the Gaussian function[6]:
Where $x$ is the spatial distance between the center pixel and the corresponding neighbor unit.
For a 1D kernel of size 7, each pixel would therefore be weighted accordingly:
The above should not be hard to intuit about, as if we refer back to the graphed distribution we can see that the center pixel (at position x=0) the $g(x)$ would evaluate to a value of $1$.
import numpy as np weights = [] sd = 1 for i in range(4): weights += [np.round(np.exp((-i**2)/(2*sd**2)),3)] print(weights) # output: # [1.0, 0.607, 0.135, 0.011]
For a 2D kernel, the formula would take the form of:
When we compare the output of a mean filter to a gaussian filter, as in the example script in gaussianblur_01.py, we can then observe the difference in output visually:

This should also come as little surprise, since the mean filter just replace each pixels with the average values of its neighboring pixels, essentially giving a coefficient of 1 (without normalized) to a grid of 5x5 pixels.
Where on the other hand, gaussian filters weigh pixels using a gaussian distribution (think: bell curve in a 2d space) around the center pixel such that farther pixels are given a lower coefficient than nearer ones.
The opposite of blurring would be sharpening. There are again several approaches to this, and we'll start by looking at specifically two of them.
The first approach relies on the familiar cv2.filter2D() function to perform the following kernel and is implemented in sharpening_01.py:
The outcome:

We can apply the same principles behind a Gaussian kernel for sharpening operations (as opposed to blurring). The full script is in sharpening_02.py but the essential parts are as follow:
approx_gaussian = ( np.array( [ [-1, -1, -1, -1, -1], [-1, 2, 2, 2, -1], [-1, 2, 8, 2, -1], [-1, 2, 2, 2, -1], [-1, -1, -1, -1, -1], ] )/ 8.0 ) sharpen_col = cv2.filter2D(img, -1, approx_gaussian)
Notice how this method uses an approximate Gaussian kernel and that the result is an overall more natural smoothing:

The second approach is known as "unsharp masking", derived from that fact that the technique uses a blurred, or "unsharp", negative image to create a mask of the original image[7]. This technique is one of the oldest tool in photographic processing (tracing back to 1930s) and popular tools such as Adobe Photoshop and GIMP have direct implementations of it named, appropriately, Unsharp Mask.
Lifted straight from the Wikipedia article itself, a "typical blending formula for unsharp masking is sharpened = original + (original - blurred) * amount". Amount represents how much contrast is added to the edges.
To rewrite the formula, we get:
Where $a$ is the amount, $B$ is the blurred image (mask) and $O$ is the original image. The final form is convenient because we can plug it into cv2.addWeighted and get an output. From OpenCV's documentation, the function addWeighted calculates the weighted sum of two arrays as follows:
When you perform the arithmetic above, you will find that the values (eg. src1(I) * alpha when alpha is > 1.5 will produce values greater than 255) may fall outside the range of 0 and 255. Saturation clips the value in a way that is synonymous to the following:
The following code demonstrates the unsharp masking technique:
img = cv2.imread("assets/sarpi.png") amt = 1.5 blurred = cv2.GaussianBlur(img, (5,5), 10) unsharp = cv2.addWeighted(img, 1+amt, blurred, -amt, 0) unsharp_manual = np.clip(img * (1+amt) + blurred * (-amt), 0, 255) cv2.imshow("Unsharp Masking", unsharp)

You can find the sample code for this in unsharpmask_01.py (using addWeighted) and in unsharpmask_02.py (manual calculation) respectively.
Why go to such lengths on the mathematical ideas behind image filtering operations?
Filtering is perhaps the most fundamental operation of image processing and computer vision. In the broadest sense of the term "filtering", the value of the filtered image at a given location is a function of the values of the input image in a small neighborhood of the same location.[8]
It is fundamental to a host of common image processing techniques, from enhancements (sharpening, denoise, increase / reduce contrast), to edge detection, and texture detection, and in the case of deep learning, feature detections.
To help with your recall, I made a simple illustration below:

Whenever you're ready, move on to edgedetect.md to learn the essentials of edge detection using kernel operations.
Making your own linear filters, OpenCV Documentation ↩︎
Bradski, Kaehler, Learning OpenCV ↩︎
Stacks Exchange, https://stats.stackexchange.com/a/366940 ↩︎
R.Zadeh and B.Ramsundar, TensorFlow for Deep Learning, O'Reilly Media ↩︎
Wikipedia, Gaussian function, https://en.wikipedia.org/wiki/Gaussian_function ↩︎
W.Fulton, A few scanning tips, Sharpening - Unsharp Mask ↩︎
C. Tomasi and R. Manduchi, "Bilateral Filtering for Gray and Color Images", Proceedings of the 1998 IEEE International Conference on Computer Vision, Bombay, India. ↩︎
For completeness' sake, the code to graph the distribution above is in `utils/gaussiancurve.r`.
For a 1-dimensional image, the pixel located in the middle would be assigned the largest weight, with the weight of its neighbours decreasing as the spatial distance between them and the center pixel increases.
For the mathematically inclined, the graphed distribution above is generated from the Gaussian function[^6]:
$$g(x) = e^{\frac{-x^2}{2\sigma^2}}$$
Where $x$ is the spatial distance between the center pixel and the corresponding neighbor unit.
For a 1D kernel of size 7, each pixel would therefore be weighted accordingly:
$$g(x) = \begin{bmatrix}.011 & .13 & .6 & 1 & .6 & .13 & .011\end{bmatrix}$$
The above should not be hard to intuit about, as if we refer back to the graphed distribution we can see that the center pixel (at position x=0) the $g(x)$ would evaluate to a value of $1$.
```py
import numpy as np
weights = []
sd = 1
for i in range(4):
weights += [np.round(np.exp((-i**2)/(2*sd**2)),3)]
print(weights)
# output:
# [1.0, 0.607, 0.135, 0.011]
```
For a 2D kernel, the formula would take the form of:
$$g(x,y) = e^{\frac{-(x^2+y^2)}{2\sigma^2}}$$
When we compare the output of a mean filter to a gaussian filter, as in the example script in `gaussianblur_01.py`, we can then observe the difference in output visually:

This should also come as little surprise, since the mean filter just replace each pixels with the average values of its neighboring pixels, essentially giving a coefficient of 1 (without normalized) to a grid of 5x5 pixels.
Where on the other hand, gaussian filters **weigh pixels using a gaussian distribution** (think: bell curve in a 2d space) around the center pixel such that farther pixels are given a lower coefficient than nearer ones.
#### Sharpening Kernels
The opposite of blurring would be sharpening. There are again several approaches to this, and we'll start by looking at specifically two of them.
The first approach relies on the familiar `cv2.filter2D()` function to perform the following kernel and is implemented in `sharpening_01.py`:
$$K = \begin{bmatrix} -1 & -1 & -1 \\ -1 & 9 & -1 \\ -1 & -1 & -1 \end{bmatrix}$$
The outcome:

##### Approximate Gaussian Kernel for Sharpening
We can apply the same principles behind a Gaussian kernel for sharpening operations (as opposed to blurring). The full script is in `sharpening_02.py` but the essential parts are as follow:
```py
approx_gaussian = (
np.array(
[
[-1, -1, -1, -1, -1],
[-1, 2, 2, 2, -1],
[-1, 2, 8, 2, -1],
[-1, 2, 2, 2, -1],
[-1, -1, -1, -1, -1],
]
)/ 8.0
)
sharpen_col = cv2.filter2D(img, -1, approx_gaussian)
```
Notice how this method uses an approximate Gaussian kernel and that the result is an overall more natural smoothing:

##### Unsharp Masking
The second approach is known as "unsharp masking", derived from that fact that the technique uses a blurred, or "unsharp", negative image to create a mask of the original image[^7]. This technique is one of the oldest tool in photographic processing (tracing back to 1930s) and popular tools such as Adobe Photoshop and GIMP have direct implementations of it named, appropriately, Unsharp Mask.
Lifted straight from the Wikipedia article itself, a "typical blending formula for unsharp masking is **sharpened = original + (original - blurred) * amount**". **Amount** represents how much contrast is added to the edges.
To rewrite the formula, we get:
$$\begin{aligned}
Sharpened & = O + (O-B) \cdot a \\
& = O + Oa - Ba \\
& = O (1+a) + B(-a)\end{aligned}$$
Where $a$ is the amount, $B$ is the blurred image (mask) and $O$ is the original image. The final form is convenient because we can plug it into `cv2.addWeighted` and get an output. From OpenCV's documentation, the function `addWeighted` calculates the weighted sum of two arrays as follows:
$$dst(I) = saturate(src1(I) * alpha + src2(I) * beta + gamma)$$
When you perform the arithmetic above, you will find that the values (eg. `src1(I) * alpha` when alpha is > 1.5 will produce values greater than 255) may fall outside the range of 0 and 255. Saturation clips the value in a way that is synonymous to the following:
$$Saturate(x) = min(max(round(r), 0), 255)$$
The following code demonstrates the unsharp masking technique:
```py
img = cv2.imread("assets/sarpi.png")
amt = 1.5
blurred = cv2.GaussianBlur(img, (5,5), 10)
unsharp = cv2.addWeighted(img, 1+amt, blurred, -amt, 0)
unsharp_manual = np.clip(img * (1+amt) + blurred * (-amt), 0, 255)
cv2.imshow("Unsharp Masking", unsharp)
```

You can find the sample code for this in `unsharpmask_01.py` (using `addWeighted`) and in `unsharpmask_02.py` (manual calculation) respectively.
## Summary and Key Points
Why go to such lengths on the mathematical ideas behind image filtering operations?
> Filtering is perhaps the most fundamental operation of image processing and computer vision. In the broadest sense of the term "filtering", the value of the filtered image at a given location is a function of the values of the input image in a small neighborhood of the same location.[^8]
It is fundamental to a host of common image processing techniques, from enhancements (sharpening, denoise, increase / reduce contrast), to edge detection, and texture detection, and in the case of deep learning, feature detections.
To help with your recall, I made a simple illustration below:

Whenever you're ready, move on to `edgedetect.md` to learn the essentials of edge detection using kernel operations.
## References
[^1]: Making your own linear filters, [OpenCV Documentation](https://docs.opencv.org/2.4/doc/tutorials/imgproc/imgtrans/filter_2d/filter_2d.html)
[^2]: Bradski, Kaehler, Learning OpenCV
[^3]: Stacks Exchange, https://stats.stackexchange.com/a/366940
[^4]: [OpenCV Documentation](http://docs.opencv.org/modules/imgproc/doc/filtering.html#filter2d)
[^5]: R.Zadeh and B.Ramsundar, TensorFlow for Deep Learning, O'Reilly Media
[^6]: Wikipedia, Gaussian function, https://en.wikipedia.org/wiki/Gaussian_function
[^7]: W.Fulton, A few scanning tips, Sharpening - Unsharp Mask
[^8]: C. Tomasi and R. Manduchi, "Bilateral Filtering for Gray and Color Images", Proceedings of the 1998 IEEE International Conference on Computer Vision, Bombay, India.
================================================
FILE: edgedetect/meanblur_01.py
================================================
import numpy as np
import cv2
KERNEL_SIZE = (5, 5)
img = cv2.imread("assets/canal.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cv2.imshow("Gray", gray)
cv2.waitKey(0)
# Create the following 5x5
# np.array(
# [[0.04, 0.04, 0.04, 0.04, 0.04],
# [0.04, 0.04, 0.04, 0.04, 0.04],
# [0.04, 0.04, 0.04, 0.04, 0.04],
# [0.04, 0.04, 0.04, 0.04, 0.04],
# [0.04, 0.04, 0.04, 0.04, 0.04]])
mean_blur = np.ones(KERNEL_SIZE, dtype="float32") * (1.0 / (5 ** 2))
smoothed_col = cv2.filter2D(img, -1, mean_blur)
smoothed_gray = cv2.filter2D(gray, -1, mean_blur)
cv2.imshow("Smoothed Colored", smoothed_col)
cv2.waitKey(0)
cv2.imshow("Smoothed Gray", smoothed_gray)
cv2.waitKey(0)
================================================
FILE: edgedetect/meanblur_02.py
================================================
import numpy as np
import cv2
KERNEL_SIZE = (5, 5)
img = cv2.imread("assets/canal.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print(f'Gray: {gray[:5, :5]}')
print(f'Shape of Original: {gray.shape}')
for i in range(3):
newval = np.round(np.mean(gray[:5, i:i+5]))
print(f'Mean of 25x25 pixel #{i+1}: {np.int(newval)}')
cv2.imshow("Gray", gray)
cv2.waitKey(0)
mean_blur = np.ones(KERNEL_SIZE, dtype="float32") * (1.0 / (5 ** 2))
smoothed_col = cv2.filter2D(img, -1, mean_blur)
smoothed_gray = cv2.filter2D(gray, -1, mean_blur)
cv2.imshow("Smoothed Colored", smoothed_col)
cv2.waitKey(0)
cv2.imshow("Smoothed Gray", smoothed_gray)
cv2.waitKey(0)
print(f'Smoothed: {smoothed_gray[:5, :5]}')
print(f'Shape of Smoothed: {smoothed_gray.shape}')
================================================
FILE: edgedetect/meanblur_03.py
================================================
import numpy as np
import cv2
KERNEL_SIZE = (5, 5)
img = cv2.imread("assets/canal.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print(f'Gray: {gray[:5, :5]}')
print(f'Shape of Original: {gray.shape}')
for i in range(3):
newval = np.round(np.mean(gray[:5, i:i+5]))
print(f'Mean of 25x25 pixel #{i+1}: {np.int(newval)}')
cv2.imshow("Gray", gray)
cv2.waitKey(0)
smoothed_col = cv2.blur(img, KERNEL_SIZE)
# equivalently:
# smoothed_gray = cv2.boxFilter(gray, -1, KERNEL_SIZE)
smoothed_gray = cv2.blur(gray, KERNEL_SIZE)
cv2.imshow("Smoothed Colored", smoothed_col)
cv2.waitKey(0)
cv2.imshow("Smoothed Gray", smoothed_gray)
cv2.waitKey(0)
print(f'Smoothed: {smoothed_gray[:5, :5]}')
print(f'Shape of Smoothed: {smoothed_gray.shape}')
================================================
FILE: edgedetect/sharpening_01.py
================================================
import numpy as np
import cv2
img = cv2.imread("assets/canal.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
for i in range(3):
newval = np.round(np.mean(gray[:5, i : i + 5]))
print(f"Mean of 25x25 pixel #{i+1}: {np.int(newval)}")
cv2.imshow("Gray", gray)
cv2.waitKey(0)
sharpen = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]])
sharpen_col = cv2.filter2D(img, -1, sharpen)
sharpen_gray = cv2.filter2D(gray, -1, sharpen)
cv2.imshow("Sharpen Colored", sharpen_col)
cv2.waitKey(0)
cv2.imshow("Sharpen Gray", sharpen_gray)
cv2.waitKey(0)
================================================
FILE: edgedetect/sharpening_02.py
================================================
import numpy as np
import cv2
img = cv2.imread("assets/canal.png")
cv2.imshow("Original", img)
cv2.waitKey(0)
approx_gaussian = (
np.array(
[
[-1, -1, -1, -1, -1],
[-1, 2, 2, 2, -1],
[-1, 2, 8, 2, -1],
[-1, 2, 2, 2, -1],
[-1, -1, -1, -1, -1],
]
)
/ 8.0
)
sharpen_col = cv2.filter2D(img, -1, approx_gaussian)
cv2.imshow("Sharpen (approx. Gaussian)", sharpen_col)
cv2.waitKey(0)
cv2.waitKey(0)
================================================
FILE: edgedetect/sobel_01.py
================================================
import numpy as np
import cv2
import matplotlib.pyplot as plt
img = cv2.imread("assets/sudoku.jpg", 0)
img = cv2.medianBlur(img, 5)
img = cv2.GaussianBlur(img, (7, 7), 0)
cv2.imshow("Image", img)
cv2.waitKey(0)
gradient_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
gradient_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
print(f"Range: {np.min(gradient_x)} | {np.max(gradient_x)}")
gradient_x = np.uint8(np.absolute(gradient_x))
gradient_y = np.uint8(np.absolute(gradient_y))
print(f"Range uint8: {np.min(gradient_x)} | {np.max(gradient_x)}")
cv2.imshow("Gradient X", gradient_x)
cv2.waitKey(0)
cv2.imshow("Gradient Y", gradient_y)
cv2.waitKey(0)
# plt.imshow(gradient_x, cmap="gray")
# plt.show()
================================================
FILE: edgedetect/sobel_02.py
================================================
import numpy as np
import cv2
import matplotlib.pyplot as plt
img_original = cv2.imread("assets/castello.png")
img_original = cv2.cvtColor(img_original, cv2.COLOR_BGR2RGB)
img = cv2.cvtColor(img_original, cv2.COLOR_BGR2GRAY)
img = cv2.medianBlur(img, 9)
img = cv2.GaussianBlur(img, (9, 9), 0)
gradient_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
gradient_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
gradient_x = cv2.convertScaleAbs(gradient_x)
gradient_y = cv2.convertScaleAbs(gradient_y)
print(f"Range: {np.min(gradient_x)} | {np.max(gradient_x)}")
gradient_xy = cv2.addWeighted(gradient_x, 0.5, gradient_y, 0.5, 0)
plt.subplot(2, 2, 1), plt.imshow(img_original)
plt.title("Original"), plt.xticks([]), plt.yticks([])
plt.subplot(2, 2, 2), plt.imshow(gradient_x, cmap="gray")
plt.title("Gradient X"), plt.xticks([]), plt.yticks([])
plt.subplot(2, 2, 3), plt.imshow(gradient_y, cmap="gray")
plt.title("Gradient Y"), plt.xticks([]), plt.yticks([])
plt.subplot(2, 2, 4), plt.imshow(gradient_xy, cmap="gray")
plt.title("Gradient X and Y"), plt.xticks([]), plt.yticks([])
plt.show()
================================================
FILE: edgedetect/sobel_03.py
================================================
import numpy as np
import cv2
import matplotlib.pyplot as plt
img = cv2.imread("assets/castello.png", flags=0)
img = cv2.medianBlur(img, 9)
img = cv2.GaussianBlur(img, (9, 9), 0)
gradient_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
gradient_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
gradient_x = cv2.convertScaleAbs(gradient_x)
gradient_y = cv2.convertScaleAbs(gradient_y)
print(f"Range: {np.min(gradient_x)} | {np.max(gradient_x)}")
gradient_xy = cv2.addWeighted(gradient_x, 0.5, gradient_y, 0.5, 0)
plt.imshow(gradient_xy, cmap="gray")
plt.title("Sobel Edge")
plt.show()
================================================
FILE: edgedetect/unsharpmask_01.py
================================================
import numpy as np
import cv2
KERNEL_SIZE = (5, 5)
img = cv2.imread("assets/sarpi.png")
cv2.imshow("Original", img)
cv2.waitKey(0)
amt = 1.5
blurred = cv2.GaussianBlur(img, (5,5), 10)
unsharp = cv2.addWeighted(img, 1+amt, blurred, -amt, 0)
cv2.imshow("Unsharp Masking", unsharp)
cv2.waitKey(0)
================================================
FILE: edgedetect/unsharpmask_02.py
================================================
import numpy as np
import cv2
KERNEL_SIZE = (5, 5)
img = cv2.imread("assets/sarpi.png")
cv2.imshow("Original", img)
cv2.waitKey(0)
amt = 1.5
blurred = cv2.GaussianBlur(img, (5,5), 10)
unsharp_manual = np.clip(img * (1+amt) + blurred * (-amt), 0, 255)
# unsharp_manual = img * (1+amt) + blurred * (-amt)
# unsharp_manual = np.maximum(unsharp_manual, np.zeros(unsharp_manual.shape))
# unsharp_manual = np.minimum(unsharp_manual, 255 * np.ones(unsharp_manual.shape))
unsharp_manual = unsharp_manual.round().astype(np.uint8)
cv2.imshow("Unsharp Masking Manual", unsharp_manual)
cv2.waitKey(0)
================================================
FILE: edgedetect/utils/gaussiancurve.r
================================================
x <- seq(-3, 3, length=1000000)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=1, ylab="g(x)")
================================================
FILE: quiz.md
================================================
## Affine Transformation
1. Which of the following constructs the correct transformation matrix to perform a 2x scaling?
- [ ] `np.float32([[2, 0, 0], [0, 2, 0]])`
- [ ] `np.float32([[0, 2, 0], [0, 2, 0]])`
- [ ] `np.float32([[2, 2, 2], [0, 0, 0]])`
- [ ] `np.float32([[2, 1, 1], [1, 2, 1]])`
2. In the case on a 333x333 input image, with a strides of 1 using a kernel of size 5*5, what is the amount of zero-padding you should add to the borders of your image such that the output image is also 333x333?
- [ ] 1
- [ ] 2
- [ ] 3
- [ ] No zero-padding
## Kernels and Convolution
3. For an input image of size 140W (Width) x 600H (Height), supposed we perform a convolution with slide S=1 using a filter of size 7W x 7H and two pixels of constant-padding (padding our image with a constant value of 5), what would the dimension of our image be?
- [ ] 135 Width x 595 Height
- [ ] 140 Width x 600 Height
- [ ] 138 Width x 598 Height
- [ ] None of the answers above
## Tresholding Edge Detection
4. In an image with lighting conditions that result in some parts of the image being shaded differently than the others, which of the thresholding techniques may yield a more robust output?
- [ ] Pixel-intensity based thresholding
- [ ] Otsu's global thresholding method
- [ ] Adaptive thresholding
5. We want to retrieve only the extreme outer contours. We do not need to store all the boundary points to minimise redundancy and save memory requirements. Which are the values to be passed into the findContours() function?
- [ ] RETR_EXTERNAL, CHAIN_APPROX_SIMPLE
- [ ] RETR_EXTERNAL, CHAIN_APPROX_NONE
- [ ] RETR_OUTER, CHAIN_APPROX_SIMPLE
- [ ] RETR_OUTER, CHAIN_APPROX_NONE
- [ ] RETR_LIST, CHAIN_APPROX_NONE
6. The function call cv2.Canny(img, 50, 180) will determine which of the intensity gradients as definite edges?
- [ ] 40
- [ ] 100
- [ ] 200
7. Which of the following is NOT part of the Canny Edge procedure?
- [ ] Compute gradient in each direction
- [ ] Suppress edges that are non-maximal
- [ ] Discard pixels that are more likely noise than true edges
- [ ] Retrieve only the extreme outer contours from the edges
================================================
FILE: requirements.txt
================================================
cycler==0.10.0
decorator==4.4.1
imageio==2.6.1
imutils==0.5.3
joblib==0.14.0
kiwisolver==1.1.0
mahotas==1.4.9
matplotlib==3.1.1
networkx==2.4
numpy==1.17.4
opencv-contrib-python==4.1.1.26
Pillow==8.1.1
pip==21.1
pyparsing==2.4.5
python-dateutil==2.8.1
PyWavelets==1.1.1
scikit-image==0.16.2
scikit-learn==0.21.3
scipy==1.3.2
setuptools==41.6.0
six==1.13.0
wheel==0.33.6
================================================
FILE: summarynotes/class2201.md
================================================
# Computer Vision (Chapter 1 to 3)
## Administrative Details
- Prerequisites:
- Python 3
- OpenCV
- Numpy (automatically installed as dependency to opencv)
- Tip: Use `pip install -r requirements.txt` to install from the requirement file (`requirements.txt`) in the repo. Get help from Teaching Assistant (Tommy) or myself before the beginning of the class
- Any code editor
- Atom, VSCode, Sublime etc...
- Personally, I use VSCode (free)
- Materials
- https://github.com/onlyphantom/cvessentials
- WiFi
- Network: Accelerice
- Password: gapura19
## Day 1
1. Synonymous role to data preprocessing
Data Analysis
- Read data (usually using pandas as pd)
- Inspect your data (dat.shape)
- Data Preprocessing
- Reshape, ...
2. Basic Routine
```
import cv2
import numpy as np
img = cv2.imread("Desktop/family.png")
print(img.shape) # output: (h, w, c)
gray = cv2.cvtColor(img, cv2.BGR2GRAY)
cv2.imshow("Gray Image", gray)
cv2.waitKey(0)
```
3. Affine Transformation
```
import cv2
import numpy as np
img = cv2.imread("Desktop/family.png")
(h, w, c) = img.shape
print(f'Height: {h}; Width: {w}')
gray = cv2.cvtColor(img, cv2.BGR2GRAY)
# option 1: create 2x3 matrix
mat = np.float32([[1, 0, 0], [0, 1, 0]])
# option 2: ask for a 2x3 matrix
mat = cv2.getRotationMatrix2D(center, angle=180, scale=1)
mat = cv2.getAffineTransform(src, dst)
transformed = cv2.warpAffine(gray, mat, dsize=(h,w))
cv2.imshow("Transformed", transformed)
cv2.waitKey(0)
```
================================================
FILE: transformation/lecture_affine.html
================================================
Any transformation that can be expressed in the form of a matrix multiplication (linear transformation) followed by a vector addition (translation).
In which:
When concatenated horizontally, this can be expressed in a larger Matrix:
By the definition above (matmul + vector addition), affine transformation can be used to achieve:
Affine transformation preserves points, straight lines, and planes. Parallel lines will remain parallel. It does not however preserve the distance and angles between points.
We represent an Affine Transformation using a 2x3 matrix.
Consider the goal of transforming a 2D vector using and to obtain , we can do it like such:
Or equivalently:
In scale_04.py from the Examples and Illustrations section, you'll see that the 2x3 matrix is simple defined as such:
np.float32([[3, 0, 0], [0, 3, 0]])
When you explicitly specify a 2x3 matrix, think of the first two columns as the component, or the matrix-multiplication process. The third column, naturally, represents the component, or the vector addition process. This may sound a little abstract, so I encourage you to pause and take a look at the code below:
(h, w) = img.shape[:2] mat = np.float32([[1, 0, -140], [0, 1, 20]]) translated = cv2.warpAffine(img, mat, (w, h)) cv2.imshow("Translated", translated)
Notice that our is an identity matrix of size 2. An identity matrix is the matrix equivalent of a scalar 1. Multiplying a matrix by its identity matrix doesn't change it by anything.
Which leads to:
And our , the vector addition component, moves each pixel -- or more formally, translate each pixel -- on the image by -140 in the direction and 20 on the direction. Find the full code example on translate_01.py.
Imaging systems in the real-world are often subject to geometric distortion. The distortion may be introduced by perspective irregularities, physical constraints (e.g camera placements), or other reasons.
In the field of GIS (geographic information systems), routinely one would use affine transformation to "convert" geographic coordinates into screen coordinates such that it can be displayed and presented on our handheld / navigational devices.
One may also overlay coordinate data on a spatial data that reference a different coordinate systems; Or to "stitch" together different sources of data using a series of transformation
These are but a handful of examples where one may expect to see routine use of affine transformations. If you're spending any amount of time in computer vision, a high degree of familiarity with these remapping routines in OpenCV will come in very handy.
In your learn-by-building section, you will find a less-than-perfectly-digitalized map, belitung_raw.jpg. Your job is to apply what you've apply the necessary affine transformation to correct its perspective distortion and the resize the map accordingly.
Given the importance of such a relation between two images, it should come as no surprise that opencv packs a number of convenience methods to help us specify this transformation. The two common use-cases are:
numpy.img = cv2.imread("our_image.png") mat = np.float32([[3, 0, 0], [0, 3, 0]]) result = cv2.warpAffine(img, M=mat, dsize=(600, 600)) cv2.imshow("Transformed", result)
img = cv2.imread("our_image.png") coords_s = np.float32([[10, 10], [80, 10], [10, 80]]) coords_d = np.float32([[10, 10], [95, 10], [10, 80]]) mat = cv2.getAffineTransform(src=coords_s, dst=coords_d) result = cv2.warpAffine(img, M=mat, dsize=(200, 200)) cv2.imshow("Transformed", result)
Have we printed out mat from the snippet of code above, we would see a 2x3 matrix that looks like this:
[[ 1.21428571 0. -2.14285714] [ 0. 1. 0. ]]
2b [Optional]. As an extension to point (2) above, consider how we would use cv2.warpAffine to achieve a 90 degree clockwise rotation. If you have attended my Unsupervised Learning course from the Machine Learning Specialization, you will undoubtedly have seen this quick reference:

To plug that directly into the of our original formula:
A 90-degree clockwise rotation could be implemented as a 270-degree anti-clockwise rotation. Let's see this implementation in opencv:
img = cv2.imread("assets/cvess.png") (h, w) = img.shape[:2] center = (w // 2, h // 2) mat3 = cv2.getRotationMatrix2D(center, angle=270, scale=1) print(f'270 degree anti-clockwise: \n {np.round(mat3, 2)}') rotated = cv2.warpAffine(img, mat, (w, h)) cv2.imshow("Rotated", rotated) # # print output: # # 270 degree anti-clockwise: # [[ -0. -1. 400.] # [ 1. -0. 0.]]
We learned earlier that:
So would be the [[0, -1], [1, 0]] and would be [400, 0]. Fundamentally, the cv2.getRotationMatrix2D is still applying an affine transformation to map the pixels from one point to another using a 2x3 matrix.
rotate_01.py to obtain for a 180-degree rotation, and a 30-degree counter-clockwise rotationLet's also look at another application of getAffineTransform to strengthen our understanding.
Supposed we specify to be mat = np.float32([[1, 0, 0], [0, 1, 0]]), what do you expect the transformation to be?
Take a minute to discuss with your classmates or refer back to the Mathematical Definitions section above and try to internalize this before moving forward.
To verify your answer, run scale_03.py and see if your hunch was right.
For an extra challenge, let's assume our_image.png is an image of 200x200. Pay attention to the specification of mat (), what do you expect the outcome result to be?
Take a minute to discuss before moving forward.
img = cv2.imread("assets/our_image.png") cv2.imshow("Original", img) # custom transformation matrix mat = np.float32([[3, 0, 0], [0, 3, 0]]) print(mat) result = cv2.warpAffine(img, M=mat, dsize=(200, 200))
You may have expected the 2x3 matrix mat to have a scaling effect on our original image. However, the required argument of dsize in our warpAffine() call constrained the output to its original dimension, 200x200, thus "cropping out" only the top left corner of the image.
Supposed we'll like to see the transformed image (scaled by 3x) in its entirety, how would we have changed the value passed to the dsize argument?
Refer to scale_04.py to verify that you've got this right.
This section is optional; you may choose to skip this section.
Watch Rotation Matrix Explained Visually
If you're done watching the video, see the same example being presented in code:
a = np.float32([[0, -1], [1, 0]]) x = np.float32([3, 6]) np.matmul(a, x) # output: # array([-6., 3.], dtype=float32)
getRotationMatrix2D() to get a 2x3 matrix: rotate_01.pygetAffineTransform(), obtaining a 2x3 matrix of [[1,0,0], [0,1,0]] (no transformation): scale_01.pynp.float32([[1,0,0], [0,1,0]]): scale_02.pydsize parameter in cv2.warpAffine without transformation: scale_03.pydsize parameter accordingly: scale_04.pygetAffineTransform(), obtaining a 2x3 matrix of [[1,0,0], [0,1,0]]: scale_05.pytranslate_01.pyImages from imaging systems and capturing systems are often "subject to geometric distortion introduced by perspective irregularities"[1] or "deformations that occur with non-ideal camera angles"[2].
In the case of translation or scaling, we typically specify our 2x3 matrix using np.float() and feed this matrix to cv2.warpAffine()
In the case of rotation, we typically use the convenience function cv2.getAffineTransform() to obtain the 2x3 matrix before feeding it to cv2.warpAffine()
cv2.getAffineTransform(src, dst)Parameters:
- src - Coordinates of triangle vertices in the source image
- dst - Coordinates of corresponding triangle vertices in the destination triange
In the homework directory, you'll find a digital map belitung_raw.jpg. Your job is to apply what you've learned in this lesson to restore the map by correcting its skew and resize it appropriately.
