Drone Aerial View Segmentation
How to teach drone to see what is below and segment the object with high resolution
Introduction
Drone uses already gain popularity in the past few years, it provides high resolution images compare to satellite imagery with lower cost, flexibility and low-flying altitude thus leading to increasing interest in the field or even it can carry various sensor such as magnetic sensor.

Teaching drone to see is quite challenging due to bird’s eye view and most of pre-trained models are trained in normal images we see (point of view) in daily basis (ImageNet, PASCAL VOC, COCO). In this project I want to experiment how to train drone datasets, the aims are:
- Model that light weight (less parameters)
- High score (I hope so)
- Fast inference latency.
Datasets
[2] The Semantic Drone Datasets focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird’s eye) view acquired at an altitude of 5 to 30 meters above ground. A high resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.


The complexity of the datasets is limited to 20 classes (but actually it has 23 classes in its mask) as listed in the following: tree, grass, other vegetation, dirt, gravel, rocks, water, paved area, pool, person, dog, car, bicycle, roof, wall, fence, fence-pole, window, door, obstacle.
Methods
Preprocessing
I resize the image with the same ratio like the original input to 704 x 1056, I don’t crop the image into patches because a few reasons, the object is not too small, it doesn’t take much memory and save my training time. I split the datasets into tree parts training(306), validation(54) and test(40) sets and applied HorizontalFlip, VerticalFlip, GridDistortion, Random Brightness Contrast and add GaussNoise into training data, with mini-batch size 3.
Model Architecture
I use 2 model architectures, I purposely use light model as backbone like MobileNet and EfficientNet for computational efficiency.
- U-Net with MobileNet_V2 and EfficientNet-B3 as backbone
- FPN (Feature Pyramid Network) with EfficientNet-B3 backbone
I see Parmar’s paper[3] for the model choices (I already trained different models before and these choices seem work)


Training Strategy
In training process I use two phases training,
- First training the model for 30 epoch with CrossEntropyLoss
- Second, continue for 20 epoch with weighted CrossEntropyLoss and Lovasz Softmax loss to maximize IoU score[1][4], using One Cycle Learning Rate Policy for both phases. The loss look like this.

with α is the weight and output is from the model. In this process the given weight is 0.7.
- Using early stopping, if validation loss doesn’t improve for 7 times
Results
In this experiment FPN model give the best result compare to U-Net, FPN loss also decrease faster than U-Net base model with less training time thanks to early stopping.
Loss
FPN base model converge quickly compare to others model and reach minimum loss around 0.3 while U-Net B3 and MobileNet_v2 around 0.4 to 0.5. The loss increase in phase 2 training because it sums product between CrossEntropy and Lovasz.



Evaluation Metrics
Using two metrics to evaluate the model performance, Intersection over Union and Pixel accuracy, training using Lovasz loss boost the IoU score if we see the graphic below where CrossEntropy stuck the IoU aroud 50% where clearly seen in U-Net MobileNet_v2



If we see the pixel accuracy bellow, training using Lovasz doesn’t really increase pixel accuracy it just stuck around phase 1 accuracy instead CrossEntropy in training phase 1 is better in increasing the accuracy, that’s why I use CrossEntropy first and continue with Lovasz.



Inference
After model has been taught to see objects, it’s time to see model performance with test set that model never seen before I kept it in a bunker. Test set consist 40 images, I resize the test set to 768 x 1152 (the original is 4000 x 6000) to save my free colab GPU, the result quite satisfying, below are inference examples from 3 different models.




to summarize the result, I make this table to compare the result for each model performance, to remember Inference Latency is really depend on your machine and image resolution (Mine is 768 x 1152), in this summary I use Google Colab GPU that vary over time often include Nvidia K80s, T4s, P4s and P100s to make inference and calculate the mean time to take over all test set.

Conclusions
Let’s conclude what we have accomplished so far
- FPN architecture works well in this datasets and outperform U-Net base model even with the same backbone
- Lovasz Softmax loss works well to boost IoU score and doesn’t really increase accuracy instead CrossEntropy boost the pixel accuracy.
- Model that light weight (less parameters) and also have fast inference latency is U-Net with MobileNet_v2 even though sacrifice model score.
The trade off between score and inference time really depend on the purpose and computational resources.
I really excited to embedded this model into real drone application.
For more about my projects please visit my github
References
[1] Berman, Maxim & Rannen, Amal & Blaschko, Matthew. (2018). The Lovasz-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection Over-Union Measure in Neural Networks. 4413–4421. 10.1109/CVPR.2018.00464.
[2] Graz University of Technology. (2019). accessed 13 July 2020. http://dronedataset.icg.tugraz.at.
[3] Parmar, Vivek & Bhatia, Narayani & Negi, Shubham & Suri, Dr. Manan. (2020). Exploration of Optimized Semantic Segmentation Architectures for edge-Deployment on Drones.
[4] Rakhlin, Alexander & Davydow, Alex & Nikolenko, Sergey. (2018). Land Cover Classification from Satellite Imagery with U-Net and Lovász-Softmax Loss. 257–2574. 10.1109/CVPRW.2018.00048.