Sign Language Interpreter Final Blog: Putting everything together

10 Dec 2024

Introduction
Two months backs I took part in the Eye on Intelligence Challenge. My project is to create a sign language interpreter which will use ML on the Arty Z7 board to do real time classification on hand signs and classify those into characters. Having very little prior knowledge on ML and PYNQ, I started working on it and learning about it. I have shared my learning journey in the below blogs:

In this blog, I will put together whatever I have learned till now.

ASL Dataset
First of all we are going to need the dataset. For that I plan to use python to capture a 30 sec video of my hand with the webcam and randomly select 200 frames from that as training data and 40 frames as test data. The image resolution will 266x200.

Getting the video

Creating capture object

#Imports and create capture object
import cv2
import time

# Define the desired resolution
desired_width = 640
desired_height = 480

# Initialize webcam
cap = cv2.VideoCapture(1)

# Check if the webcam is opened correctly
if not cap.isOpened():
    print("Error: Could not open webcam.")
    exit()

# Set the desired resolution
cap.set(cv2.CAP_PROP_FRAME_WIDTH, desired_width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, desired_height)

Recording video and saving in a file

# Define the codec and create VideoWriter object
fourcc = cv2.VideoWriter_fourcc(*'XVID')
output_path = 'train_video/Z_video.avi'  # Specify the file path
out = cv2.VideoWriter(output_path, fourcc, 20.0, (desired_width, desired_height))

recording = False
start_time = None

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Display the resulting frame
    cv2.imshow('Webcam Video', frame)

    # Check for key press
    key = cv2.waitKey(1) & 0xFF
    if key == ord('r'):  # Press 'r' to start recording
        recording = True
        start_time = time.time()
        print("Recording started...")

    # Record video if 'r' was pressed
    if recording:
        out.write(frame)
        # Stop recording after 30 seconds
        if time.time() - start_time >= 30:
            recording = False
            print("Recording stopped after 30 seconds.")

    # Press 'q' to exit the loop
    if key == ord('q'):
        break

out.release()
cv2.destroyAllWindows()

Closing the image capture object.

# Release everything when the job is finished
cap.release()
out.release()
cv2.destroyAllWindows()

Saving images

The below code randomly collects 240 frames from each video and saves it in folders with names same as the class. The frames are split into train and test dataset (200 for training and 40 for testing). The directory structure is shown below:
[current dir] > train > A > A_1.jpg, A_3.jpg, A_9.jpg ... 200 images
B > B_2.jpg, B_5.jpg ... 200 images
[current dir] > test > A > A_5.jpg, A_10.jpg ... 40 images
B > B_1.jpg, B_6.jpg ... 40 images

import cv2
import random
import os

count = 0
# Function to resize and save a frame
def save_frame(frame, frame_index, label):
    resized_frame = cv2.resize(frame, (266, 200))
    global count
    count = count + 1
    if count <= 200:
        output_path = os.path.join(output_train_dir, f'{label}_{frame_index}.jpg')
    else:
        output_path = os.path.join(output_test_dir, f'{label}_{frame_index}.jpg')
    cv2.imwrite(output_path, resized_frame)

labels = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

for label in labels:
    # Define the video file path
    video_path = 'train_video/' + label + '_video.avi'
    output_train_dir = 'train/' + label
    output_test_dir = 'test/' + label

    # Create the output directory if it doesn't exist
    if not os.path.exists(output_train_dir):
        os.makedirs(output_train_dir)
    if not os.path.exists(output_test_dir):
        os.makedirs(output_test_dir)

    # Open the video file
    cap = cv2.VideoCapture(video_path)

    # Get the total number of frames in the video
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    print(total_frames)

    # Select 240 random frame indices
    random_frames = random.sample(range(total_frames), 240)

    # Read and save selected frames
    for frame_index in random_frames:
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
        ret, frame = cap.read()
        if ret:
            save_frame(frame, frame_index, label)

    # Release the video capture object
    cap.release()
    count = 0

    print("Saved 240 randomly selected frames as JPEG files.")

Some example images are shown below

Training the model
So, now we have our train and test dataset. Now we need to train the model. I am using my laptop for the training part and saving the model. I created a custom tinier yolo classifier with some help from internet. I followed the below steps.

Import

#Set up the environment
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

Define the model architecture

#Define the model architecture
class TinierYOLO(nn.Module):
    def __init__(self):
        super(TinierYOLO, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 8, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(8, 16, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Additional layers can be added as needed
        )
        self.classifier = nn.Sequential(
            nn.Linear(16 * 50 * 66, 128),  # Adjust dimensions based on input size
            nn.ReLU(),
            nn.Linear(128, 26),  # Number of classes (26 classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# Initialize the model
model = TinierYOLO()

Preparing the dataset. My dataset is 266x200 images.

#Prepare the dataset
# Define transformations for your dataset
transform = transforms.Compose([
    transforms.Resize((200, 266)),
    transforms.ToTensor(),
])

# Load your custom dataset
# Replace 'YourDatasetClass' with your custom dataset class if you have one
train_dataset = datasets.ImageFolder(root='train', transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataset = datasets.ImageFolder(root='test', transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

Training and saving the weights.

#Train the model
# Initialize the model, loss function, and optimizer
model = TinierYOLO()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):  # Number of epochs
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {running_loss / len(train_loader)}')

# Save the trained model
torch.save(model.state_dict(), 'tinier_yolo_weights.pth')

Evaluating the mode

#Evaluate the model
# Evaluation loop
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Total: ' + str(total))
print(f'Accuracy: {100 * correct / total}%')

Deploy the model

#Deploy the model
# Save the model for deployment
torch.save(model.state_dict(), 'tinier_yolo_deployment.pth')

Performing classification on real time video
Now lets perform image classification on real time video in the laptop before running it on the FPGA board.

Load the model and weights

#Perform classification on real time video.
import cv2
import PIL

# Load the model
model = TinierYOLO()
model.load_state_dict(torch.load('tinier_yolo_weights.pth', weights_only=True))
model.eval()

# Define the classes (change as per your dataset)
classes = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

# Define image transformations
transform = transforms.Compose([
    transforms.Resize((200, 266)),
    transforms.ToTensor(),
])

Run classification

def classify_frame(frame, model, transform, classes, threshold=0.8):
    # Preprocess the frame
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pil_frame = PIL.Image.fromarray(frame_rgb)
    input_tensor = transform(pil_frame).unsqueeze(0)

    # Perform the classification
    with torch.no_grad():
        output = model(input_tensor)
    
    # Get confidence scores
    scores = nn.functional.softmax(output, dim=1)
    max_score, predicted = torch.max(scores, 1)
    
    # Check if confidence score is above threshold
    if max_score.item() >= threshold:
        class_name = classes[predicted.item()]
    else:
        class_name = None

    return class_name

# Initialize webcam
cap = cv2.VideoCapture(1)

# Check if the webcam is opened correctly
if not cap.isOpened():
    print("Error: Could not open webcam.")
    exit()

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Resize the frame to 200x200 before classification
    resized_frame = cv2.resize(frame, (266, 200))

    # Perform classification
    class_name = classify_frame(resized_frame, model, transform, classes)

    # Display the resulting frame with the class name if a match is found
    if class_name:
        cv2.putText(resized_frame, class_name, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    cv2.imshow('Webcam Video', resized_frame)

    # Press 'q' to exit the loop
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release everything when done
cap.release()
cv2.destroyAllWindows()

Here is a video of the live classification happening. I think its not that accurate. It is able to correctly classify B, D and W only. For now, I will move forward with this.

You don't have permission to edit metadata of this video.

Edit media

Dimensions x
Subject (required) Brief Description Tags (separated by comma) Video visibility in search results
Parent content

Poster
Upload Preview

Saving the network architecture and weights
Now, we must save the network architecture in json format and the weights to run the classification on on pynq.

To dump the network architecture, I followed the below code.

import json

def model_to_json(model):
    model_json = {
        "network": "tinier-yolo",
        "input_image": "../../tests/Test_image/tinier-yolo/input.bin",
        "verification_image": "../../tests/Test_image/tinier-yolo/verification.bin",
        "binparam": "binparam-tinier-yolo-nopool",
        "use_binparams": True,
        "binparam_skip": 1,
        "layer_skip": 0,
        "layers": []
    }

    for name, module in model.named_modules():
        if name == "":
            continue  # Skip the top-level module itself
        layer = {
            "name": name,
            "func": "",
            "input_bits": 3,
            "output_bits": 3,
            "weight_bits": 1,
            "threshold_bits": 16,
            "kernel_shape": None,
            "kernel_stride": None,
            "input_channels": None,
            "input": None,
            "output_channels": None,
            "output": None,
            "padding": None
        }
        if isinstance(module, nn.Conv2d):
            layer["func"] = "conv_layer"
            layer["kernel_shape"] = module.kernel_size[0]
            layer["kernel_stride"] = module.stride[0]
            layer["input_channels"] = module.in_channels
            layer["input"] = [module.in_channels, None, None]  # Example placeholder
            layer["output_channels"] = module.out_channels
            layer["output"] = [module.out_channels, None, None]  # Example placeholder
            layer["padding"] = module.padding[0]
        elif isinstance(module, nn.MaxPool2d):
            layer["func"] = "maxpool_layer"
            layer["kernel_shape"] = module.kernel_size
            layer["kernel_stride"] = module.stride
            layer["input_channels"] = None  # Example placeholder
            layer["input"] = [None, None, None]  # Example placeholder
            layer["output_channels"] = None  # Example placeholder
            layer["output"] = [None, None, None]  # Example placeholder
            layer["padding"] = module.padding
        elif isinstance(module, nn.Linear):
            layer["func"] = "fc_layer"
            layer["input_channels"] = module.in_features
            layer["input"] = [module.in_features]
            layer["output_channels"] = module.out_features
            layer["output"] = [module.out_features]
        
        model_json["layers"].append(layer)

    return model_json

# Initialize the model
model = TinierYOLO()

# Convert the model to JSON
model_json = model_to_json(model)

# Save the JSON to a file
with open('tinier_yolo_layers.json', 'w') as json_file:
    json.dump(model_json, json_file, indent=4)

To dump the weights and biases, I followed below code.

import os
import numpy as np

# Create directory to save weights
os.makedirs('tinier_yolo_weights', exist_ok=True)

# Extract and save weights and biases
for name, param in model.named_parameters():
    if 'weight' in name:
        weight_np = param.data.cpu().numpy()
        np.save(f'tinier_yolo_weights/{name.replace(".", "-")}-W.npy', weight_np)
    elif 'bias' in name:
        bias_np = param.data.cpu().numpy()
        np.save(f'tinier_yolo_weights/{name.replace(".", "-")}-bias.npy', bias_np)

print("Weights and biases have been saved in NumPy format.")

Unfortunately, after this point I couldn't complete my project. I am still stuck with how to use my custom network architecture on arty z7 and use the overlays to run it and there is no more time.

Summary
Although, I couldn't complete my project on time due to some unforeseen reasons, I had a very good time working on it and I learned a lot. It seems there is still much more to learn and I will be continuing on the project.

bidrohini 4 months ago

Thanks for the series.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel

Sign Language Interpreter Final Blog: Putting everything together

Sign Language Interpreter Blog 1: Project Overview and Kit Unboxing

Sign Language Interpreter Blog 2: Hands on with Arty Z7

Sign Language Interpreter Blog 3: Running PYNQ on Arty Z7

Sign Language Interpreter Blog 4: Using PYNQ overlays with HDMI and USB webcam on Arty Z7

Sign Language Interpreter Blog 5: Running CNN image classification on Arty Z7 using PYNQ