JISE

Recent developments in neural networks have enabled them to achieve comparable or better than human accuracy on different computer vision tasks like image classification, object detection, segmentation, keypoint estimation etc. Large scale datasets have explicitly been curated for each of these tasks, and researchers from around the world compete to get state of the art results on these benchmarks. Outside these datasets, these models may not always perform equally better. Real-world applications often require multiple models to be used together to generate meaningful results. The processing time primarily relies on the slowest model in the pipeline. The allocated resources for other models remain idle in the meantime, causing a sub-optimal processing pipeline. Each of these models comes with its own input data pre-processing, feature extraction, output post-processing etc., causing a significant unnecessary overhead. To overcome these issues, we explored multitasking architectures to do multiple closely related tasks together. In this work, we developed a single multitasking model to perform object detection, instance segmentation and keypoint estimation tasks. We presume such models will be more robust to data specific noises as it finds a better representation of the trained data by learning to predict multiple closely related tasks. Our most accurate model gave 41.2 AP on object detection, 38.2 AP on instance segmentation and 53.0 AP on keypoint estimation tasks when evaluated on COCO validation dataset. We optimised the models through layer fusion and float 16 quantisation. We achieved 107 frames per second (fps), while a lighter version achieved 131fps on RTX 3090 GPU.We also benchmarked the models on Nvidia Jetson Tx2 and got 4.2 fps and 6.3 fps for the two respective models. The models were successfully deployed in a server-client system architecture / cloud computing with future possibilities for on-premise deployment using edge devices.