Acivs 2020 Advanced Concepts for Intelligent Vision Systems |
||
Feb. 10-14, 2020 Auckland, New Zealand |
||
Acivs 2020 Abstracts
Regular papers
Paper 102: Design of Perspective Affine Motion Compensation for Versatile Video Coding (VVC)
The fundamental motion model of the conventional block-based motion compensation in High Eciency Video Coding (HEVC) is a translational motion model. However, in the real world, the motion of an object exists in the form of combining many kinds of motions. In Versatile Video Coding (VVC), a block-based 4-parameter and 6-parameter ane motion compensation (AMC) is being applied. The AMC still has a limit to accurate complex motions in the natural video. In this paper, we design a perspective ane motion compensation (PAMC) method which can improve the coding eciency and maintain low-computational complexity compared with existing AMC. Because the block with the perspective motion model is a rectangle without specic feature, the proposed PAMC shows eective encoding performance for the test sequence containing irregular object distortions or dynamic rapid motions in particular. Our proposed algorithm is implemented on VTM 2.0. The experimental results show that the BD-rate reduction of the proposed technique can be achieved up to 0.30%, 0.76%, and 0.04% for random access (RA) conguration and 0.45%, 1.39%, and 1.87% for low delay P (LDP) conguration on Y, U, and V components, respectively. Meanwhile, the increase of encoding complexity is within an acceptable range.
Paper 103: Investigation of Coding Standards Performances on Optically Acquired and Synthetic Holograms
Digital holography needs efficient coding tools that facilitate storage and transmission of this type of data in order to reach practical applications. This paper presents an experimental analysis of the performance of different coding tools for the compression of digital holograms. During the experiments, a dedicated compression architecture is employed in order to transform the holographic data in a representation suitable to be provided to the encoders, and for performing an objective quality evaluation of the obtained results. Several state-of-the-art image and video codecs are evaluated on different reference datasets, comprising different types of digital holograms. The evaluation is carried out on the reconstructed images with different metrics, and obtained results are critically analyzed and discussed.
Paper 104: A Novel Framework for Early Fire Detection Using Terrestrial and Aerial 360-degree Images
In this paper, in order to contribute to the protection of the value and potential of forest ecosystems and global forest future we propose a novel fire detection framework, which combines recently introduced 360-degree remote sensing technology, multidimensional texture analysis and deep convolutional neural networks. Once 360-degree data are obtained, we convert the distorted 360-degree equirectangular projection format images to cubemap images. Subsequently, we divide the extracted cubemap images into blocks using two different sizes. This allows us to apply h-LDS multidimensional spatial texture analysis to larger size blocks and then, depending on the probability of fire existence, to smaller size blocks. Thus, we aim to accurately identify the candidate fire regions and simultaneously to reduce the computational time. Finally, the candidate fire regions are fed into a CNN network in order to distinguish between fire-coloured objects and fire. For evaluating the performance of the proposed framework, a dataset, namely “360-FIRE”, consisting of 100 images with unlimited field of view that contain synthetic fire, was created. Experimental results demonstrate the potential of the proposed framework.
Paper 107: Automatic Optical Inspection for Millimeter Scale Probe Surface Stripping Defects using Convolutional Neural Network
Surface defect inspection is a crucial step during the production process of IC probe. The traditional way of identifying defective IC probes mostly relies on the human visual examination through the microscope screen. However, this approach will be affected by some subjective factors or misjudgments of inspectors, and the accuracy and efficiency are not sufficiently stable. Therefore, we propose an automatic optical inspection system by incorporating the ResNet-101 deep learning architecture into the faster region-based convolutional neural network (Faster R-CNN) to detect the stripping-gold defect on the IC probe surface. The training samples were collected through our designed multi-function investigation platform IMSLAB. To circumvent the challenge of insufficient images in our datasets, we introduce data augmentation using cycle generative adversarial networks (CycleGAN). The proposed system was evaluated using 133 probes. The experimental results revealed our method performed high accuracy in stripping defect detection. The overall mean average precision (mAP) was 0.732, and the defect IC probe classification accuracy rate was 97.74%.
Paper 110: A Local Flow Phase Stretch Transform for Robust Retinal Vessel Detection
This paper presents a new method for reliably detecting retinal vessel tree using a local flow phase stretch transform (LF-PST). A local flow evaluator is proposed to increase the local contrast and the coherence of the local orientation of vessel tree. This is achieved by incorporating information about the local structure and direction of vessels, which is estimated by introducing a second curvature moment evaluation matrix (SCMEM). The SCMEM evaluates vessel patterns as only features having linearly coherent curvature. We present an oriented phase stretch transform to capture retinal vessels running at various diameters and directions. The proposed method exploits the phase angle of the trans-form, which includes structural features of lines and curved patterns. The LF-PST produces several phase maps, in which the vessel structure is characterized along various directions. To produce an orientation invariant response, all phases are linearly combined. The proposed method is tested on the publicly available DRIVE and IOSTAR databases with different imaging modalities and achieves encouraging segmentation results outperforming the state-of-the-art benchmark methods.
Paper 111: Object Contour Refinement using Instance Segmentation in Dental Images
A very accurate detection is required for fitting 3D dental model onto color images for tracking the milimetric displacement of each tooth along orthodontics treatment. Detecting the teeth boundaries with high accuracy on these images is a challenging task because of the various quality and high resolution of images. By training Mask R-CNN on a very large dataset of 170k images of patients' mouth taken with different mobile devices, we have a reliable teeth instance segmentation, but each tooth boundaries are not accurate enough for dental care monitoring. To address this problem, we propose an efficient method for object contour refinement using instance segmentation (CRIS). Instance segmentation provides high-level information on the location and the shape of the object to guide and refine locally the contour detection process. We evaluate CRIS method on a large dataset of 600 dental images. Our method improves significantly the efficiency of several state-of-the-art contour detectors: Canny (+32.0% in ODS F-score), gPb (+17.8%), Sketch Tokens (+17.3%), Structured Edge(+12.2%), DeepContour (+15.5%), HED (+2.9%), CEDN (+2.2%), RCF (+2.2%) and also the best result (ODS F-score of 0.819). Our CRIS method can be used with any contour detection algorithms to refine object contours. In that way, this approach is promising for other applications requiring very accurate contour detection.
Paper 112: Dynamic Texture Representation Based on Hierarchical Local Patterns
A novel effective operator, named HIerarchical LOcal Pattern (HILOP), is proposed to efficiently exploit relationships of local neighbors at a pair of adjacent hierarchical regions which are located around a center pixel of a textural image. Instead of being thresholded by the value of the central pixel as usual, the gray-scale of a local neighbor in a hierarchical area is compared to that of all neighbors in the other region. In order to capture shape and motion cues for dynamic texture (DT) representation, HILOP is taken into account investigating hierarchical relationships in plane-images of a DT sequence. The obtained histograms are then concatenated to form a robust descriptor with high performance for DT classification task. Experimental results on various benchmark datasets have validated the interest of our proposal.
Paper 113: Verifying Kinship from RGB-D Face Data
We present a kinship verification (KV) approach based on Deep Learning applied to RGB-D facial data. To work around the lack of an adequate 3D face database with kinship annotations, we provide an online platform where participants upload videos containing faces of theirs and of their relatives. These videos are captured with ordinary smartphone cameras. We process them to reconstruct recorded faces in tridimensional space, generating a normalized dataset which we call Kin3D. We also combine depth information from the normalized 3D reconstructions with 2D images, composing a set of RGBD data. Following approaches from related works, images are organized into four categories according to their respective type of kinship. For the classification, we use a Convolutional Neural Network (CNN) and a Support Vector Machine (SVM) for comparison. The CNN was tested both on a widely used 2D Kinship Verification database (KinFaceW-I and II) and on our Kin3D for comparison with related works. Results indicate that adding depth information improves the model's performance, increasing the classification accuracy up to 90%. To the extent of our knowledge, this is the first database containing depth information for Kinship Verification. We provide a baseline performance to stimulate further evaluations from the research community.
Paper 115: Correction of Temperature Estimated from a Low-Cost Handheld Infrared Camera for Clinical Monitoring
The use of low-cost cameras for medical applications has its advantages as it enables affordable and remote evaluations of health problems; however, the accuracy is a limiting factor to use them. Previous studies indicate that parameters from object position like distance camera-object and angle of view could be used to improve temperature estimation from thermal cameras. Nevertheless, most studies are focused on expensive thermal cameras with good accuracy. In this study, an innovative experimental setup is used to study the errors associated to temperature estimation from a low-cost infrared camera: FlirOne Gen3. In our experiments, the image acquisition is done from multiple point of view (distance camera-object and viewing angles) and by using a thermal camera manipulated by hand. Then, using a regression model, a correction is proposed and tested. The results show that our proposed correction improves the temperature estimation and enhance the thermal accuracy.
Paper 116: Towards Approximating Personality Cues Through Simple Daily Activities
The goal of this work is to investigate the potential of making use of simple activity and motion patterns in a smart environment for approximating personality cues via machine learning techniques. Towards this goal, we present a novel framework for personality recognition, inspired by both Computer Vision and Psychology. Results show a correlation between several behavioral features and personality traits, as well as insights of which type of everyday tasks induce stronger personality display. We experiment with the use of Support Vector Machines, Random Forests and Gaussian Process classification achieving promising predictive ability, related to personality traits. The obtained results show consistency to a good degree, opening the path for applications in psychology, game industry, ambient assisted living, and other fields.
Paper 117: Multiview 3D Markerless Human Pose Estimation
Despite the fact that marker-based systems for human motion estimation provide very accurate tracking of the human body joints (at mm precision), these systems are often intrusive or even impossible to use depending on the circumstances, e.g. markers cannot be put on an athlete during competition. Instrumenting an athlete with the appropriate number of markers requires a lot of time and these markers may fall off during the analysis, which leads to incomplete data and requires new data capturing sessions and hence a waste of time and effort. Therefore, we present a novel multiview video-based markerless system that uses 2D joint detections per view (from OpenPose) to estimate their corresponding 3D positions while tackling the people association problem in the process to allow the tracking of multiple persons at the same time. Our proposed system can perform the tracking in real-time at 20-25 fps. Our results show a standard deviation between 9.6 and 23.7 mm for the lower body joints based on the raw measurements only. After filtering the data, the standard deviation drops to a range between 6.6 and 21.3 mm. Our proposed solution can be applied to a large number of applications, ranging from sports analysis to virtual classrooms where submillimeter precision is not necessarily required, but where the use of markers is impractical.
Paper 119: Automatic Focal Blur Segmentation based on Difference of Blur Feature using Theoretical Thresholding and Graphcuts
Focal blur segmentation is one of the interesting topics in computer vision. With recent improvements of camera devices, multiple focal blur images of different focal settings can be obtained by a single shooting. Utilizing the information of multiple focal blur images is expected to improve the segmentation performance. We propose one of the automatic focal blur segmentation using a pair of two focal blur images with different focal settings. Difference of blur features can be obtained from an image pair which are focused on an object and background, respectively. A theoretical threshold identifies the object and background in the difference of blur feature space. The proposed method consists of i) the theoretical thresholding in the blur feature space; and ii) energy minimization based on Graphcuts using color and blur features. We evaluate the proposed method using 12 and 48 image pairs, including single objects and flowers, respectively. As results of the evaluation, the averaged Informedness of the initial and the final segmentation are 0.897 and 0.972 for the single object images, and 0.730 and 0.827 for the flower images, respectively.
Paper 120: Segmentation of Phase-Contrast MR Images for Aortic Pulse Wave Velocity Measurements
Aortic stiffness is an important diagnostic and prognostic parameter for many diseases, and is estimated by measuring the Pulse Wave Velocity (PWV) from Cardiac Magnetic Resonance (MR) images. However, this process requires combinations of multiple sequences, which makes the acquisition long and processing tedious. For this reason, we propose a method for segmentation and centerline extraction of aorta from para-sagittal Phase-Contrast (PC) MR images. The method uses the order of appearance of the blood flow in PC images to track the aortic centerline from the seed start position to the seed end position of the aorta. The only required user interaction involves selection of 2 input seed points for the start and end position of the aorta. We validate our results against the ground truth manually extracted centerlines from para-sagittal PC images and anatomical MR images. Both centerline length measurements and PWV measurements show high accuracy and low variability, which allows for use in clinical setting. The main advantage of our method is that it requires only velocity encoded PC image, while being able to process images encoded only in one direction.
Paper 121: An Improved GAN Semantic Image Inpainting
Image inpainting is used to fill in missing regions based on remaining image data. Although the existing methods, that use deep generative models to infer the missing content, produce realistic images, sometimes the results are unsatisfactory due to arithmetical issues caused by the use of unbalanced ingredients of the proposed cost functions. In this paper, we propose a loss that generates more plausible results. Experiments on two datasets show that our method predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming state-of-the-art methods cite{Yeh_2017_CVPR} and cite{yeh2018image}. Having improved the semantic image inpainting we focus on applying the method to laparoscopic images that suffer from glares. The modified technique again outperforms its rivals. Moreover, it is faster than classical PDE based inpainting techniques and, more importantly, its running time is almost independent on the size of missing area, both critical issues in medical image processing.
Paper 124: Region Proposal Oriented Approach for Domain Adaptive Object Detection
Faster R-CNN has become a standard model in deep-learning based object detection. However, in many cases, few annotations are available for images in the application domain referred as the target domain whereas full annotations are available for closely related public or synthetic datasets referred as source domains. Thus, a domain adaptation is needed to be able to train a model performing well in the target domain with few or no annotations in this target domain. In this work, we address this domain adaptation problem in the context of object detection in the case where no annotations are available in the target domain. Most existing approaches consider adaptation at both global and instance level but without adapting the region proposal sub-network leading to a residual domain shift. After a detailed analysis of the classical Faster R-CNN detector, we show that adapting the region proposal sub-network is crucial and propose an original way to do it. We run experiments in two different application contexts, namely autonomous driving and ski-lift video surveillance, and show that our adaptation scheme clearly outperforms the previous solution.
Paper 125: Real Time Embedded Person Detection and Tracking in Camera Streams
Shopping behaviour analysis through counting and tracking of people in shop like environments offers valuable information for store operators and provides key insights in the stores layout (e.g. frequently visited spots). Instead of using extra staff for this, automated on premise solutions are preferred. These automated systems should be cost effective, preferably on lightweight embedded hardware, work in very challenging situations (e.g. handling occlusions) and preferably work realtime. We solve this challenge by implementing a realtime TensorRT optimized YOLOv3 based pedestrian detector, on a Jetson TX2 hardware platform. By combining the detector with a sparse optical flow tracker we assign a unique ID to each customer and tackle the problem of loosing partially occluded customers. Our detector tracker based solution achieves an average precision of 81.59% at a processing speed of 10 FPS. Besides valuable statistics, heat maps of frequently visited spots are extracted and used as an overlay on the video stream.
Paper 126: Clip-level Feature Aggregation: A Key Factor for Video-based Person Re-Identification
In the task of video-based person re-identification, features of persons in the query and gallery sets are compared to search the best match. Generally, most existing methods aggregate the frame-level features together using a temporal method to generate the clip-level features, instead of the sequence- level representations. In this paper, we propose a new method that aggregates the clip-level features to obtain the sequence-level representations of persons, which consists of two parts, i.e., Average Aggregation Strategy (AAS) and Raw Feature Utilization (RFU). AAS makes use of all frames in a video sequence to generate a better representation of a person, while RFU investigates how batch normalization operation influences feature representations in person re-identification. The experimental results demonstrate that our method can boost the performance of existing models for better accuracy. In particular, we achieve 87.7% rank-1 and 82.3% mAP on MARS dataset without any post-processing procedure, which outperforms the existing state-of-the-art.
Paper 127: Temporal-clustering based Technique for Identifying Thermal Regions in Buildings
Nowadays, moistures and thermal leaks in buildings are manually detected by an operator, who roughly delimits those critical regions in thermal images. Neverthe-less, the use of artificial intelligence (AI) techniques can greatly improve the manual thermal analysis, providing automatically more precise and objective re-sults. This paper presents a temporal-clustering based technique that carries out the segmentation of a set of thermal orthoimages (STO) of a wall, which have been taken at different times. The algorithm has two stages: region labelling and consensus. In order to delimit regions with similar temporal temperature varia-tion, three clustering algorithms are applied on STO, obtaining the respective three labelled images. In the second stage, a consensus algorithm between the la- belled images is applied. The method thus delimitates regions with different ther-mal evolutions over time, each characterized by a temperature consensus vector. The approach has been tested in real scenes by using a 3D thermal scanner. A case study, composed of 48 thermal orthoimages at 30 minute- intervals over 24 hours, are presented.
Paper 129: VLW-Net: A Very Light-Weight Convolutional Neural Network (CNN) for Single Image Dehazing
Camera imaging is one of the most important application areas of computer image and video processing. However, computational cost is usually the main reason preventing many state of the art image processing algorithms from being applied to practical applications including camera imaging. This paper proposes a very light-weight end-to-end CNN network (VLW-Net) for single image haze removal. We proposed a new Inception structure. By combining it with a reformulated atmospheric scattering model, our proposed network is at least 6 times more light-weight than the state-of-the-arts. We conduct the experiments on both synthesized and realistic hazy image dataset, and the results demonstrate our superior performance in terms of network size, PSNR, SSIM and the subjective image quality. Moreover, the proposed network can be seamlessly applied to underwater image enhancement, and we witness obvious improvement by comparing with the state-of-the-arts.
Paper 130: Distance Weighted Loss for Forest Trail Detection using Semantic Line
Unlike structured urban roads, forest trails do not have defined shape or appearance and have ambiguous boundaries making them challenging to be detected. In this work we propose to train a deep convolutional encoder- decoder network with a novel distance weighted loss function for end to end learning of unstructured forest trail. The forest trail is annotated with “semantic line" representing the trail, and a L1 distance map is derived from the binarized ground truth. We propose to use the distance map to weigh the loss function to guide the focus of the network on the forest trail. The proposed loss function penalizes low activations around the ground truth and high activations in areas further away from the trail. The proposed loss function is compared against other commonly used loss functions by evaluating the performance on the publicly available IDSIA forest trail data-set. The proposed method leads to higher trail detection accuracy with 2.52 %, 4.69 % and 8.18 % improvement in mean intersection over union (mIoU) over mean squared error, Jaccard loss and cross entropy, respectively.
Paper 134: Exposing Presentation Attacks by a Combination of Multi-intrinsic Image Properties, Convolutional Networks and Transfer Learning
Nowadays, adoption of face recognition for biometric authentication systems is widespread, mainly because this is one of the most accessible biomet- ric characteristic. Techniques that rely on deceive these kinds of systems by using a forged biometric sample, such as a printed paper or a recorded video of a gen- uine access, are known as presentation attacks. Presentation Attack Detection is a crucial step for preventing this kind of unauthorized accesses into restricted areas and/or devices. In this paper, we propose a new method which relies in a com- bination between intrinsic image properties and deep neural networks to detect presentation attack attempts. Exploring depth, salience and illumination prop- erties, along with a Convolutional Neural Network, proposed method produce robust and discriminant features which are then classified to detect presentation attacks attempts. In a very challenging cross-dataset scenario, proposed method outperform state-of-the-art methods in two of three evaluated datasets.
Paper 137: Natural Images Enhancement Using Structure Extraction and Retinex
Variational Retinex model-based methods for low-light image enhancement have been popularly studied in recent years. In this paper, we present an enhanced variational Retinex method for low-light natural image enhancement, based on the initial smoother illumination component with a structure extraction technique. The Bergman splitting algorithm is then introduced to estimate the illuminance component and reflectance component. The de-block processing and illuminance component correction are used for the enhanced reflectance as the ultimate enhanced image. Moreover, The estimated smoother illumination component can make enhanced images preserve edge details. Experimental results with a comparison demonstrate the present variational Retinex method can effectively enhance image quality and maintain image color.
Paper 140: Vehicles Tracking by combining Convolutional Neural Network based Segmentation and Optical Flow Estimation
Object tracking is an important proxy task towards action recognition. The recent successful CNN models for detection and segmentation, such as Faster R-CNN and Mask R-CNN lead to an effective approach for tracking problem: tracking-by-detection. This very fast type of tracker takes into account only the Intersection-Over-Union (IOU) between bounding boxes to match objects without any other visual information. In contrast, the lack of visual information of IOU tracker combined with the failure detections of CNNs detectors create fragmented trajectories. Inspired by the work of predicting future segmentations by using Optical flow, we propose an enhanced tracker based on tracking-by-detection and optical flow estimation in vehicle tracking scenario. Our solution generates new detections or segmentations based on translating backward and forward results of CNNs detectors by optical flow vectors. This task can fill in the gaps of trajectories. The qualitative results show that our solution achieved stable performance with different types of flow estimation methods. Then we match generated results with fragmented trajectories by SURF features. DAVIS dataset is used for evaluating the best way to generate new detections. Finally, the entire process is test on DETRAC dataset. The qualitative results show that our methods significantly improve the fragmented trajectories.
Paper 141: Initial Pose Estimation of 3D Object with Severe Occlusion Using Deep Learning
During the last decade, augmented reality (AR) has gained explosive attention and demonstrated high potential on educational and training applications. As a core technique, AR requires a tracking method to get 3D poses of a camera or an object. Hence, providing fast, accurate, robust, and consistent tracking methods have been a main research topic in the AR field. Fortunately, tracking the camera pose using a relatively small and less-textured known object placed on the scene has been successfully mastered through various types of model-based tracking (MBT) methods. However, MBT methods requires a good initial camera pose estimator and estimating an initial camera pose from partially visible objects remains an open problem. Moreover, severe occlusions are also challenging problems for initial camera pose estimation. Thus, in this paper, we propose a deep learning method to estimate an initial camera pose from a partially visible object that may also be severely occluded. The proposed method handles such challenging scenarios by relying on the information of detected subparts of a target object to be tracked. Specifically, we first detect subparts of the target object using a state-of-the-art convolutional neural networks (CNN). The object detector returns two dimensional bounding boxes, associated classes, and confidence scores. We then use the bounding boxes and classes information to train a deep neural network (DNN) that regresses to cameras 6-DoF pose. After initial pose estimation, we attempt to use a tweaked version of an existing MBT method to keep tracking the target object in real time on mobile platform. Experimental results demonstrate that the proposed method can estimate accurately initial camera poses from objects that are partially visible or/and severely occluded. Finally, we analyze the performance of the proposed method in more detail by comparing the estimation errors when different number of subparts are detected.
Paper 143: On the Uncertainty of Retinal Artery-vein Classification with Dense Fully-convolutional Neural Networks
Retinal imaging is a valuable tool in diagnosing many eye diseases but offers opportunities to have a direct view to central nervous system and its blood vessels. The accurate measurement of the characteristics of retinal vessels allows not only analysis of retinal diseases but also many systemic diseases like diabetes and other cardiovascular or cerebrovascular diseases. This analysis benefits from precise blood vessel characterization. Automatic machine learning methods are typically trained in the supervised manner where a training set with ground truth data is available. Due to difficulties in precise pixelwise labeling, the question of the reliability of a trained model arises. This paper addresses this question using Bayesian deep learning and extends recent research on the uncertainty quantification of retinal vasculature and artery-vein classification. It is shown that state-of-the-art results can be achieved by using the trained model. An analysis of the predictions for cases where the class labels are unavailable is given.
Paper 144: Design data model for Big Data Analysis System
One of the most important challenges of the modern digital world is the in ow of a large amount of information from various sources and with dierent characteristics, which we call Big Data. Big Data as a complex of IT issues requires the introduction of new data analysis techniques and technological solutions that will allow to extract valuable and useful knowledge from them. The correct acquisition and interpretation of data will play a key role in the global and local economy as well as social policy and large corporations. The article is a continuation of research and development works on the design of the data analysis system using articial intelligence, in which we present a data model for this system.
Paper 145: Localization of Map Changes by Exploiting SLAM Residuals
Simultaneous Localization and Mapping is widespread in both robotics and autonomous driving. This paper proposes a novel method to identify changes in maps constructed by SLAM algorithms without feature-to-feature comparison. We use ICP-like algorithms to match frames and pose graph optimization to solve the SLAM problem. Finally, we analyze the residuals to localize possible alterations of the map. The concept was tested with 2D LIDAR SLAM problems in simulated and real-life cases.
Paper 146: Deep-Learning for Tidemark Segmentation in Human Osteochondral Tissues Imaged with Micro-computed Tomography
Three-dimensional (3D) semi-quantitative grading of pathological features in articular cartilage (AC) offers significant improvements in basic research of osteoarthritis (OA). We have earlier developed the 3D protocol for imaging of AC and its structures which includes staining of the sample with a contrast agent (phosphotungstic acid, PTA) and a consequent scanning with micro-computed tomography. Such a protocol was designed to provide X-ray attenuation contrast to visualize AC structure. However, at the same time, this protocol has one major disadvantage: the loss of contrast at the tidemark (calcified cartilage interface, CCI). An accurate segmentation of CCI can be very important for understanding the etiology of OA and ex-vivo evaluation of tidemark condition at early OA stages. In this paper, we present the first application of Deep Learning to PTA- stained osteochondral samples that allows to perform tidemark segmentation in a fully-automatic manner. Our method is based on U-Net trained using a combination of binary cross-entropy and soft-Jaccard loss. On cross- validation, this approach yielded intersection over the union of 0.59, 0.70, 0.79, 0.83 and 0.86 within 15 micrometers, 30 micrometers, 45 micrometers, 60 micrometers and 75 micrometers padded zones around the tidemark, respectively. Our codes and the dataset that consisted of 35 PTA-stained human AC samples are made publicly available together with the segmentation masks to facilitate the development of biomedical image segmentation methods.
Paper 147: Bayesian Feature Pyramid Networks for Automatic Multi-Label Segmentation of Chest X-rays and Assessment of Cardio-Thoratic Ratio
Cardiothoratic ratio (CTR) estimated from chest radiographs is a marker indicative of cardiomegaly, the presence of which is in the criteria for heart failure diagnosis. Existing methods for automatic assessment of CTR are driven by Deep Learning-based segmentation. However, these techniques produce only point es- timates of CTR but clinical decision making typically assumes the uncertainty. In this paper, we propose a novel method for chest X-ray segmentation and CTR assessment in an automatic manner. In contrast to the previous art, we, for the first time, propose to estimate CTR with uncertainty bounds. Our method is based on Deep Convolutional Neural Network with Feature Pyramid Network (FPN) decoder. We propose two modifications of FPN: replace the batch normalization with instance normalization and inject the dropout which allows to obtain the Monte-Carlo estimates of the segmentation maps at test time. Finally, using the predicted segmentation mask samples, we estimate CTR with uncertainty. In our experiments we demonstrate that the proposed method generalizes well to three different test sets. Finally, we make the annotations produced by two radiologists for all our datasets publicly available.
Paper 150: Evaluation of Unconditioned Deep Generative Synthesis of Retinal Images
Retinal images have been increasingly important in clinical diagnostics of several eye and systemic diseases. To help the medical doctors in this work, automatic and semi-automatic diagnosis methods can be used to increase the efficiency of diagnostic and follow-up processes, as well as enable wider disease screening programs. However, the training of advanced machine learning methods for improved retinal image analysis typically requires large and representative retinal image data sets. Even when large data sets of retinal images are available, the occurrence of different medical conditions is unbalanced in them. Hence, there is a need to enrich the existing data sets by data augmentation and introducing noise that is essential to build robust and reliable machine learning models. One way to overcome these shortcomings relies on generative models for synthesizing images. To study the limits of retinal image synthesis, this paper focuses on the deep generative models including a generative adversarial model and a variational autoencoder to synthesize images from noise without conditioning on any information regarding to the retina. The models are trained with the Kaggle EyePACS retinal image set, and for quantifying the image quality in a no-reference manner, the generated images are compared with the retinal images of the DiaRetDB1 database using common similarity metrics.
Paper 151: Person Identification by Walking Gesture using Skeleton Sequences
We present an approach to identify people through the skeleton sequences of their walking gestures. Current works that cope with person identification problems, either directly take raw RGB images as inputs, or use more sophisticated devices to capture other information, e.g., depth and silhouette. However, most of approaches are vulnerable to the change of environment and different clothing. To this end, we propose an approach to utilize the uniqueness of “gait”, that is, the manner of walking is unique to every human being, in order to have great robustness to the diversification of both environment and appearance. The proposed method uses skeletal information to figure out the characteristic of individual gait. First, we analyze the spatial relationship of joints and transform the 3D skeleton coordinates into relative distances and angles between joints. Then, a bidirectional long short-term memory is applied to explore the temporal information of skeleton sequences. Experiment results show that the proposed method outperforms previous methods on both BIWI and IAS-Lab datasets by gaining 9.5% and 11.2% accuracy improvements, respectively.
Paper 152: Quadratic Tensor Anisotropy Measures for Reliable Curvilinear Pattern Detection
A wide range of applications needs the analysis of biomedical images as a fundamental task to extract meaningful information and allow high throughput measurements. A new method for the detection of curve-like structures in biomedical images is presented by exploiting local phase vector and the structural anisotropy information at various directions. We introduce an orient-ed gaussian derivative quadrature filter not only for estimating the local phase vectors, which include line features, but also for its immunity to inhomogeneous intensity and its capability to enhance curved structures having various diameters, leading to more reliable hessian analysis. A novel measure function-based hessian tensor is proposed to detect curvilinear patterns by incorporating the anisotropic indices (coherence and linearity) of curved features, producing a uniform and strong response. Over multiple orientations, the responses are maximized to achieve a rotationally invariant response, and to detect target structures with different widths and illuminations. The evaluation of the proposed method on the extraction of retinal vessels and leaf venation patterns exhibits its superior performance against state-of-the-art methods.
Paper 153: Learning Target-Specific Response Attention for Siamese Network Based Visual Tracking
Recently, the Siamese network based visual tracking methods have shown great potentials in balancing the tracking accuracy and computational efficiency. These methods use two-branch convolutional neural networks (CNNs) to generate a response map between the target exemplar and each of candidate patches in the search region. However, since these methods have not fully exploit the target- specific information contained in the CNN features during the computation of the response map, they are less effective to cope with target appearance variations and background clutters. In this paper, we propose a Target-Specific Response Attention (TSRA) module to enhance the discriminability of these methods. In TSRA, a channel-wise cross-correlation operation is used to produce a multi-channel response map, where different channels correspond to different semantic information. Then, TSRA uses an attention network to dynamically re- weight the multi-channel response map at every frame. Moreover, we introduce a shortcut connection strategy to generate a residual multi-channel response map for more discriminative tracking. Finally, we integrate the proposed TSRA into the classical Siamese based tracker (i.e., SiamFC) to propose a new tracker (called TSRA-Siam). Experimental results on three popular benchmark datasets show that the proposed TSRA-Siam outperforms the baseline tracker (i.e., SiamFC) by a large margin and obtains competitive performance compared with several state-of-the-art trackers.
Paper 156: Distributed Multi-Class Road User Tracking in Multi-Camera Network for Smart Traffic Applications
Reliable tracking of road users is one of the important tasks in smart traffic applications. In these applications, a network of cameras is often used to extend the coverage. However, efficient usage of information from cameras which observe the same road user from different view points is seldom explored. In this paper, we present a distributed multi-camera tracker which efficiently uses information from all cameras with overlapping views to accurately track various classes of road users. Our method is designed for deployment on smart camera networks so that most computer vision tasks are executed locally on smart cameras and only concise high-level information is sent to a fusion node for global joint tracking. We evaluate the performance of our tracker on a challenging real-world traffic dataset in an aspect of Turn Movement Count (TMC) application and achieves high accuracy of 93% and 83% on vehicles and cyclist respectively. Moreover, performance testing in anomaly detection shows that the proposed method provides reliable detection of abnormal vehicle and pedestrian trajectories.
Paper 158: Using Normal/ Abnormal Video Sequence Categorization to Efficient Facial Expression Recognition in the Wild
The facial expression recognition in real-world conditions, with a large variety of illumination, pose, resolution, and occlusions, is a very challenging task. The majority of the literature approaches, which deal with these challenges, do not take into account the variation of the quality of the different videos. Unlike these approaches, this paper suggests treating the video sequences according to their quality. Using Isolation Forests (IF) algorithm, the video sequences are categorized into two categories: normal videos that visibly express clear illumination and frontal pose of face, and abnormal videos that present poor illumination, different poses of face, occulted face. Two independent facial expression classifiers for the normal and abnormal videos are built using Random Forests (RF) algorithm. The experiments have demonstrated that processing independently normal and abnormal videos can be used to improve the efficiency of the facial expression recognition in the Wild.
Paper 160: Deep Learning-based Techniques for Plant Diseases Recognition in Real-Field Scenarios
Deep Learning has solved complicated applications with increasing accuracies over time. The recent interest in this technology, especially in its potential application in agriculture, has powered the growth of efficient systems to solve real problems, such as non-destructive methods for plant anomalies recognition. Despite the advances in the area, there remains a lack of performance in real-field scenarios. To deal with those issues, our research proposes an efficient solution that provides farmers with a technology that facilitates proper management of crops. We present two efficient techniques based on deep learning for plant disease recognition. The first method introduces a practical solution based on a deep meta-architecture and a feature extractor to recognize plant diseases and their location in the image. The second method addresses the problem of class imbalance and false positives through the introduction of a refinement function called Filter Bank. We validate the performance of our methods on our tomato plant diseases and pest dataset. We collected our own data and designed the annotation process. Qualitative and quantitative results show that despite the complexity of real- field scenarios, plant diseases are successfully recognized. The insights drawn from our research helps to better understand the strengths and limitations of plant diseases recognition.
Paper 161: A New SVM-based Zero-watermarking Technique for3D Videos Traitor Tracing
The watermarking layer has a crucial role in a collusion-secure ngerprinting framework since the hidden information, or the identier, directly attached to user identication, is implanted in the media as a watermark. In this paper, we propose a new zero watermarking technique for 3D videos based on Support Vector Machine (SVM) classier. Hence, the proposed scheme consists of two major contributions. The rst one is the protection of both the 2D video frames and the depth maps simultaneously and independently. Robust features are extracted from Temporally Informative Representative Images (TIRIs) of both the 2D video frames and depth maps to construct the master shares. Then, the relationship between the identier and the extracted master shares is generated by performing an Exclusive OR (XOR) operation. The second contribution uses the SVM and the XOR operation to estimate the watermark. Compared to other zero watermarking techniques, the proposed scheme has proven good results of robustness and transparency even for long size watermarks, which makes it suitable for a tracing traitor framework.
Paper 163: Fire Segmentation in Still Images
In this paper, we propose a novel approach to fire localization in images based on a state of the art semantic segmentation method DeepLabV3. We compiled a data set of 1775 images containing fire from various sources for which we created polygon annotations. The data set is augmented with hard non-fire images from SUN397 data set. The segmentation method trained on our data set achieved results better than state of the art results on BowFire data set. We believe the created data set will facilitate further development of fire detection and segmentation methods, and that the methods should be based on general purpose segmentation networks.
Paper 164: SuperNCN: Neighbourhood Consensus Network for Robust Outdoor Scenes Matching
In this paper, we present a framework for computing dense keypoint correspondences between images under strong scene appearance changes. Traditional methods, based on nearest neighbour search in the feature descriptor space, perform poorly when environmental conditions vary, e.g. when images are taken at different times of the day or seasons. Our method improves finding keypoint correspondences in such difficult conditions. First, we use Neighbourhood Consensus Networks to build spatially consistent matching grid between two images at a coarse scale. Then, we apply Superpoint-like corner detector to achieve pixel-level accuracy. Both parts use features learned with domain adaptation to increase robustness against strong scene appearance variations. The framework has been tested on a RobotCar Seasons dataset, proving large improvement on pose estimation task under challenging environmental conditions.
Paper 166: Red-Green-Blue Augmented Reality Tags for Retail Stores
In this paper, we introduce a new Augmented Reality (AR) Tag to enhance detection rates, accuracy and also user experiences in marker-based AR technologies. The tag is a colour printed card, divided into three colour channels: red, blue, and green; to label the three components: (1) an oriented marker, (2) a bar-code and (3) a graphic image, respectively. In this tag, the oriented marker is used for tag detection and orientation identification, the bar-code is for storing and achieving numerical information (IDs of the models), and the texture image is to provide the users with an original sight of what the tag is displaying. When our new AR tags are placed in front of the camera, the corresponding 3D graphics (models of figures or products) will appear directly on top of it. Also, we can rotate the tags to rotate the 3D graphics; and move the camera to zoom in/out or view it from a different angle. The embedded bar-code could be 1D or 2D bar-codes; the currently popular QR code could be used. Fortunately, QR codes include position detection patterns that could be used to identify the orientation for the code. Thus, the oriented marker is not needed for QR code, and one channel is saved and used for presenting the initially displaying image. Some experiments have been carried out to identify the robustness of the proposed tags. The results show that our tags and its orientations (marker stored in the blue colour channel) are relatively easy to detect using commodity webcams. The embedded QR code (painted in blue) is readable in most test cases. Compared to the ordinary QR tag (black and white), our embedded QR code has the detection rates of 95%. The image texture is stored in the red and green channel is relatively visible. However, the blue channel is missing, which makes it not visually correctly in some cases. Application-wise, this could be used in many AR applications such as shopping. Thanks to the large storage of QR Code, this AR Tag is capable of storing and displaying virtual products of much more extensive variety. The user could see its 3D figure, zoom and rotate using intuitive on-hand controls.
Paper 167: VA-StarGAN: Continuous Affect Generation
Recent advances in Generative Adversarial Networks have shown impressive results for the task of facial affect synthesis. The most successful architecture is StarGAN, which is effective, but can only generate a discrete number of expressions. However, dimensional emotion representations, usually valence (indicating how positive or negative an emotional state is) and arousal (measuring the power of the emotion activation), are more appropriate to represent subtle emotions appearing in everyday human computer interactions. In this paper, we adapt StarGAN for continuous emotion synthesis and propose VA-StarGAN; we use a correlation-based loss instead of the usual MSE; we adapt the discriminator network to account for continuous output; we exploit and utilize the in-the-wild Aff-Wild and AffectNet databases; we propose a trick for generating the target domain when training the generator. Qualitative experiments illustrate the generation of realistic images, whilst comparison with state-of-the-art approaches shows the superiority of our method. Quantitative experiments (in which the synthesized images are used for data augmentation in training Deep Neural Networks) further validate our development.
Paper 170: Unsupervised Desmoking of Laparoscopy Images using Multi-scale DesmokeNet
The presence of surgical smoke in laparoscopic surgery reduces the visibility of the operative field. In order to ensure better visualization, the present paper proposes an unsupervised deep learning approach for the task of desmoking of the laparoscopic images. This network builds upon generative adversarial networks (GANs) and converts laparoscopic images from smoke domain to smoke-free domain. The network comprises a new generator architecture that has an encoder-decoder structure composed of multi-scale feature extraction (MSFE) blocks at each encoder block. The MSFE blocks of the generator capture features at multiple scales to obtain a robust deep representation map and help to reduce the smoke component in the image. Further, a structure- consistency loss has been introduced to preserve the structure in the desmoked images. The proposed network is called Multi-scale DesmokeNet, which has been evaluated on the laparoscopic images obtain from Cholec80dataset. The quantitative and qualitative results shows the efficacy of the proposed Multi- scale DesmokeNet in comparison with other state-of-the-art desmoking methods.
Paper 171: EpNet: a Deep Neural Network for Ear Detection in 3D Point Clouds
The human ear is full of distinctive features, and its rigidness to facial expressions and ageing has made it attractive to biometric research communities. Accurate and robust ear detection is one of the essential steps towards biometric systems, substantially affecting the efficiency of the entire identification system. Existing ear detection methods are prone to failure in the presence of typical day-to-day circumstances, such as partial occlusions due to hair or accessories, pose variations, and different lighting conditions. Recently, some researchers have proposed different state-of-the-art deep neural network architectures for ear detection in two- dimensional (2D) images. However, the ear detection directly from three- dimensional (3D) point clouds using deep neural networks is still an unexplored problem. In this work, we propose a deep neural network architecture named EpNet for 3D ear detection, which can detect ear directly from 3D point clouds. We also propose an automatic pipeline to annotate ears in the profile face images of UND J2 public data set. The experimental results on the public data show that our proposed method can be an effective solution for 3D ear detection.
Paper 174: Feature Map Augmentation to Improve Rotation Invariance in Convolutional Neural Networks
Whilst it is a trivial task for a human vision system to recognize and detect objects with good accuracy, making computer vision algorithms achieve the same feat remains an active area of research. For a human vision system, objects seen once are recognized with high accuracy despite alterations to its appearance by various transformations such as rotations, translations, scale, distortions and occlusion making it a state-of-the-art spatially invariant biological vision system. To make computer algorithms such as Convolutional Neural Networks (CNNs) spatially invariant one popular practice is to introduce variations in the data set through data augmentation. This achieves good results but comes with increased computation cost. In this paper, we address rotation transformation and instead of using data augmentation we propose a novel method that allows CNNs to improve rotation invariance by augmentation of feature maps. This is achieved by creating a rotation transformer layer called Rotation Invariance Transformer (RiT) that can be placed at the output end of a convolution layer. Incoming features are rotated by a given set of rotation parameters which are then passed to the next layer. We test our technique on benchmark CIFAR10 and MNIST datasets in a setting where our RiT layer is placed between the feature extraction and classification layers of the CNN. Our results show promising improvements in the networks ability to be rotation invariant across classes with no increase in model parameters.
Paper 176: Deep Convolutional Network-Based Framework for Melanoma Lesion Detection and Segmentation
Analysis of skin lesion images is very crucial in melanoma detection. Melanoma is a form of skin cancer with high mortality rate. Both semi and fully automated systems have been proposed in the recent past for analysis of skin lesions and detection of melanoma. These systems have however been restricted in performance due to the complex visual characteristics of the skin lesions. Skin lesions images are characterised with fuzzy borders, low contrast between lesions and the background, variability in size and resolution and with possible presence of noise and artefacts. In this work, an efficient deep learning framework has been proposed for melanoma lesion detection and segmentation. The proposed method performs pixel-wise classification of skin lesion images to identify melanoma pixels. The framework employs an end-to-end and pixel by pixels learning approach using Deep Convolutional Networks with softmax classifier. A novel framework which learns the complex visual characteristics of skin lesions via an encoder and decoder subnet-works that are connected through a series of skip pathways that brings the semantic level of the encoder feature maps closer to that of the decoder feature maps is hereby designed. This efficiently handles multi-size, multi-resolution and noisy skin lesion images. The proposed system was evaluated on both the ISBI 2018 and PH2 skin lesion datasets.
Paper 177: Guided Stereo to Improve Depth Resolution of a Small Baseline Stereo Camera Using an Image Sequence
Using calibrated synchronised stereo cameras significantly simplify multi- image 3D reconstruction. This is because they produce point clouds for each frame pair, which reduces multi-image 3D recon- struction to a relatively simple process of pose estimation followed by point cloud merging. There are several synchronized stereo cameras avail- able on the market for this purpose, however a key problem is that they often come as fixed baseline units. This is a problem since the baseline that determines the range and resolution of the acquired 3D. This work deals with the fairly common scenario of trying to acquire a 3D recon- struction from a sequence of images, when the baseline of our camera is too small. Given such a sequence, in many cases it is possible to match each image with another in the sequence that has a more appropriate baseline. However is there still value in having calibrated stereo pairs then? Clearly not using the calibrated stereo pairs reduces the prob- lem to a monocular 3D reconstruction problem, which is more complex with known issues such as scale ambiguity. This work attempts to solve the problem by proposing a guided stereo strategy that refines the coarse depth estimates from calibrated narrow stereo pairs with frames that are further away. Our experimental results are promising, since they show that this problem is solvable provided there are appropriate frames in the sequence to supplement the depth estimates from the original narrow stereo pairs.
Paper 178: CUDA Implementation of a Point Cloud Shape Descriptor Method for Archaeological Studies
In this work we present a new approach to study shape descriptors of archaeological objects using an implementation of the smoothed-points shape descriptor (SPSD) method that is based on the numerical mesh-free simulation method smoothed-particles hydrodynamics. SPSD can describe the textural or morphological properties of a surface by obtaining a property field descriptor based on the points shape descriptors and a smoothing function over a neighborhood of each point. The neighborhood size depends on a smoothing distance function which drives the field descriptor to either focus on small local details or larger details over big surfaces. SPSD is designed to provide a real-time scientific visualization of cloud points shape descriptors to assist in the field study of archaeological artifacts. It also has the potential to provide quantitative values (e.g. morphological properties) for artifacts analysis and classification (computational and archaeological). Due to the visualisation requirement for a real-time solution, SPSD is implemented in CUDA using an Octree method as the mechanism to solve the neighborhood particles interaction for each point cloud.
Paper 181: Fast Iris Segmentation Algorithm for Visible Wavelength Images Based on Multi-Color Space
Iris recognition for eye images acquired in visible wavelength is receiving increasing attention. In visible wavelength environments, there are many factors that may cover or affect the iris region which makes the iris segmentation step more difficult and challenging. In this paper, we propose a novel and fast segmentation algorithm to deal with eye images acquired in visible wavelength environments by considering the color information form multiple color spaces. The various existing color spaces such as RGB, YCbCr, and HSV are analyzed and an appropriate set of color models is selected for the segmentation process. To accurately localize the iris region, a set of convenient techniques are applied to detect and remove the non-iris regions such as pupil, specular reflection, eyelids, and eyelashes. Our experimental results and comparative analysis using the UBIRIS v2 database demonstrate the efficiency of our approach in terms of segmentation accuracy and execution time.