In order for autonomous vehicles to safely navigate the road ways, accurate object detection must take place before safe path planning can occur. Currently, general purpose object detection convolutional neural network (CNN) models have the highest detection accuracies of any method. However, there is a gap in the proposed detection frameworks. Specifically, those that provide high detection accuracy necessary for deployment but do not perform inference in realtime, and those that perform inference in realtime but detection accuracy is low. We propose multimodel fusion detection system (MFDS), a sensor fusion system that combines the speed of a fast image detection CNN model along with the accuracy of light detection and range (LiDAR) point cloud data through a decision tree approach. The primary objective is to bridge the tradeoff between performance and accuracy. The motivation for MFDS is to reduce the computational complexity associated with using a CNN model to extract features from an image. To improve efficiency, MFDS extracts complimentary features from the LiDAR point cloud in order to obtain comparable detection accuracy. MFDS is novel by not only using the image detections to aid three-dimensional (3D) LiDAR detection but also using the LiDAR data to jointly bolster the image detections and provide 3D detections. MFDS achieves 3.7% higher accuracy than the base CNN detection model and is able to operate at 10 Hz. Additionally, the memory requirement for MFDS is small enough to fit on the Nvidia Tx1 when deployed on an embedded device.

Introduction

Vehicles are an integral part of the world, interweaved in our everyday tasks with the primary objective to provide point-to-point transportation. They transport goods between factories, shipping ports, stores, and also are a primary mode of transportation for many populations of the world. Vehicles may be very beneficial; however, their usefulness does not come without a cost. Every year roughly 30,000 are killed and over 2 million are injured in the U.S. in automobile accidents [1]. Therefore, safety is a large concern not only for government regulators but also for automotive manufacturers. It is desired to have vehicles that are capable of driving themselves, since they would never be distracted leading to a significant improvement in vehicle safety. In 2005, Stanley, the robotic vehicle from Stanford University, under the guidance of Sebastian Thrun won the DARPA grand challenge [2]. This success sparked a surge in research and commercial work toward the development of autonomous vehicles. This rush for autonomy was increased when Boss from Carnegie Mellon University won the DARPA grand urban challenge in 2007 [3]. Most vehicles are now equipped with advanced driver assistance systems, such as adaptive cruise control, that are classified as National Highway Traffic Safety Administration (NHTSA) level one or two autonomy (partial autonomy). However, one of the challenges in designing level four or five autonomous vehicles (highly or completely autonomous) is accurate perception in all environments of the world around the vehicle [4].

At the heart of perception lies computer vision. The vehicle must take in raw data from sensors such as cameras, light detection and range (LiDAR), and RADAR and then process it to form a meaningful representation of the world around it including object classification, detection, and localization. These different modalities of data provide unique benefits, but also have shortcomings and failure points. These different streams of data must be analyzed in order to gather important information from the environment around the vehicle in order to be passed to later stages of the autonomy system such as path planning.

A modern approach for object detection and classification is to use a convolutional neural network (CNN). The CNN is passed an image and will output a set of both bounding boxes and object labels for each bounding box. The CNN learns which features to extract by optimizing its convolutional kernels in order to form rich representational power to detect and classify objects. This class of algorithms yields high accuracy and generalizable detections. However, LiDAR data are able to directly reason with the three-dimensional (3D) world instead of a two-dimensional (2D) projection of it. Directly manipulating 3D data allows detections to be used for path planning in the next stage of an autonomous vehicle, which would not be possible with the 2D detections of an image algorithm. Although LiDAR analysis would be useful because of its 3D reasoning, detection accuracy is traditionally worse for LiDAR-based algorithms than camera-based ones, making LiDAR only algorithms unsuitable for autonomous vehicles [5].

In order to overcome the weaknesses of these independent modes of data, it is possible to fuse the data. The two defining types of data fusion between image and LiDAR data are decision and feature level fusion. Both of these fusion techniques have their own weaknesses; feature level fusion is difficult to find equivalent representations for each type of data and decision fusion can be less accurate as the system relies on both sets of features independently. The benefits though are increased detection quality over LiDAR only algorithms and 3D detections which cannot be obtained by monocular camera algorithms, thus making sensor fusion algorithms the final output of an autonomous vehicle's perception system.

The purpose of this research was to develop multimodel fusion detection system (MFDS) to fuse camera and LiDAR data in order to provide accurate, fast object detection and classification. MFDS attempts to avoid large memory consumption while still being accurate, and operate at close to real-time by using information from different types of sensors to efficiently augment one another, rather than greatly increase the computational cost of a single sensor for marginal benefit. The target deployment platform for MFDS is the Nvidia Tegra series, specifically the Tx1.

Related Works

Convolutional neural networks have seen great success in image classification [610]. CNNs have also proven successful in a variety of other computer vision tasks including, but not limited to, detection and segmentation [1116]. Although classification, detection, and segmentation tasks are all different, neural networks are called universal approximators and are able to learn how to perform each of them with high levels of accuracy [17]. Each task requires specific network architectures for optimality, but due to their generality, many improvements to CNNs in one task also prove to be improvements in the others.

By 2017, image classification had reached the necessary accuracy levels for real world performance; however, to be useful for most applications, models would need to take up less memory and perform inference faster. The first major paper to address this problem of model deployment was MobileNets, which developed an accurate, small, fast classification network and was shown to perform well at transfer learning to more complicated tasks [18]. MobileNets was based upon the idea of the depthwise separable convolution, which was shown to be nearly equivalent to normal convolution. Depthwise separable convolution is faster than standard convolution because they are optimized for the general matrix to matrix function call from within the cublas library of CUDA, which is utilized by the cudnn library [19]. cudnn is a library specifically dedicated to graphics processing unit (GPU) implementations of CNNs due to their high computational complexity, making them intractable on standard central processing units (CPUs).

Once classification networks had reached a high enough level of accuracy, CNNs were applied to more complicated tasks such as detection tasks. One such successful detection network is single shot multibox detector (SSD) [20]. Instead of only outputting a class label, SSD outputs a list of detections where each detection consists of a class label, a confidence, and the four coordinates of a bounding box. SSD is a unified detection network meaning that it performs a single pass through a network without using a region proposal network popularized by fast and faster regional convolutional neural network (RCNN) [12,13]. Detections are outputted at different stages of the network because the feature map size decreases as the layer depth increases, meaning that larger objects will be outputted earlier on in the network since their feature maps are larger.

In addition to camera, or image, based methods are LiDAR, or point cloud, based methods for object detection. As range sensors such as LiDAR have become more affordable and more readily available, point cloud detection methods have increased in popularity but are still not as accurate as image based methods [2126]. Unlike image methods which are almost entirely handled by deep learning methods, LiDAR methods utilize both modern deep learning techniques and traditional hand crafted feature extraction methods. This is largely due to the computational complexity of 3D convolution, making it difficult to form CNNs for a LiDAR point cloud. Therefore, CNN algorithms that operate on LiDAR data typically perform some sort of projection or form a 2D representation of the point cloud which 2D convolution can be applied to.

There are also fusion-based methods which take multiple modes of data as input and output detections similar to the camera and LiDAR methods previously mentioned. These fusion methods can generally be grouped into two classes, feature fusion and decision fusion. Feature fusion methods generally involve projecting the LiDAR point cloud into a 2D space and then processing both the image and the projection with some sort of CNN in order for the learned features that are extracted from both mediums to complement one another [2731]. The other group of fusion algorithms is decision level fusions, which performs independent detections in both mediums and then combine both sets of detections together in order to output a single set of superior detections [32,33]. The idea behind fusion methods is to utilize the advantages of each type of data to augment one another in order to provide superior detection quality over each independent medium alone. The advantages of fusion methods are not only relegated to learned features either and can benefit handcrafted feature extraction methods [34,35]. A novel approach that was developed at the same time as the proposed algorithm is FrustrumNet [36]. Both the proposed algorithm and FrustrumNet utilize a decision-feature fusion where advanced image detection techniques are used to narrow the 3D search space to create detections from LiDAR point clouds. FrustrumNet differs from the proposed algorithm by only using the image detections to aid 3D LiDAR detection whereas the proposed algorithm uses the LiDAR data to jointly bolster the image detections and provide 3D detections. To the knowledge of the authors, no fusion method currently has been developed with deployment on an embedded system in mind.

Methods

The proposed MFDS is a decision-feature fusion, seen in Fig. 1. At the start of the algorithm, an image-based CNN performs detection to output a set of possible objects represented as bounding boxes with classes and probabilities after which the synchronized point cloud is masking to the same viewing angle's as the camera, the ground plane is removed, the cloud is transformed into the camera coordinates, and the remaining points are separated into clusters based upon Euclidean distance. Then, image detections and clusters are associated and paired together. Next, a vector of hand selected features is extracted from the cluster of each pair, if there are any, and run through a multilayer perceptron (MLP) to regress class probability, object length, distance to the object's center, and orientation of the object. The pairs confidence is adjusted, or could potentially be removed, based upon the output of the MLP. The output of the fusion algorithm is confident 3D localized detections. The novelty of MFDS is twofold from both an algorithmic and systems standpoint. From the algorithmic standpoint, MFDS has the novelty of being able to receive feedback on 2D detections gained from a high quality object detector from its 3D LiDAR detector counterpart. From the systems standpoint, MFDS is both fast enough and has a small enough memory footprint to be deployed on embedded systems for real-world deployment. Code, along with instructions, for both training and deployment will be made publicly available upon publication.

Fig. 1
System diagram for the operation of MFDS
Fig. 1
System diagram for the operation of MFDS
Close modal

Creation of Fusion Models.

The first step was acquiring a pretrained image detection CNN model, specifically SSD, and then performing transfer learning on the SSD model to be fine tuned on the KITTI dataset [5]. The SSD model selected for transfer learning used a MobileNetv1 feature extractor (pretrained on ImageNet) and then was trained on the common objects in context dataset [37]. L1 localization and softmax cross entropy were used to form the objective function, each with equal weighting. Hard negative mining was used along with the RMSProp optimizer and an initial learning rate of 0.0001. The KITTI dataset was split to create training and validation sets, out of the 7481 images and point clouds, 5611 were used for training and 1870 were used for validation. Next, the LiDAR processing pipeline was built up to the feature extraction of identified clusters stage. A converted LiDAR dataset was then created by running every LiDAR point cloud in the KITTI dataset through the processing pipeline to extract features of every cluster, which were then compared to KITTI's labels and saved to disk if they matched. After the converted dataset was formed, the classifying MLP was built and trained.

Feature Extraction.

After point cloud clusters had been formed, the last step before the MLP was to perform feature extraction. The features that were decided upon were the sample mean and standard deviation of the clusters x, y, and z coordinates, ranges for all three-dimensional, as well as ratios of xy, xz, yx, yz, zx, and zy. Although these ratios were redundant and included inverses, they proved to be beneficial as the MLP did not learn as well if some were removed. It is believed that this is because this redundancy slightly reduced the complexity of the nonlinear decision boundary that the network needed to learn.

These features were chosen because they represent important geometric information about the point cloud cluster while being computationally cheap to compute. The geometric data of the point cloud are used to augment the color intensity data stored in the camera image.

Dataset Creation.

After the LiDAR point cloud pipeline had been created, the dataset needed to be created. To form the dataset, each point cloud in the separated training portion of KITTI was processed with the same pipeline that was used during inference, which included masking, ground plane extraction, transformation, and clustering. Each cluster was checked against each labeled cuboid to determine if that cluster represented one of the labeled objects. Since the labels were not perfect, a cluster was considered a labeled object if no more than 5% of the points existed outside the label cuboid. In addition to the KITTI labels of vehicle, pedestrian, or cyclist, clusters with no class label were included in the final dataset in order for the network to learn to be able to reject objects that were not relevant, which meant that the MLP would output a score for four classes instead of three. Therefore, the final classes used were don't care, vehicles, pedestrians, and cyclists. No experimentation was performed with only the original three classes; however, the authors intuit that this would lead to large increase in false positives and, ultimately, a weaker algorithm. This ability for the model to differentiate between object and background is a common feature of object detectors [1115]. In total, there were 17,382 point clouds, 17% was other, 69% was vehicles, 10% was pedestrians, and 3% was cyclists, seen in Fig. 2. After the datasets clusters were identified, each cluster had their features extracted and their class, distance to center, length, orientation, and features saved to disk. 75% of the point clouds were used for training and the remaining 25% were used for validation.

Fig. 2
Class representation in the created MLP cluster dataset
Fig. 2
Class representation in the created MLP cluster dataset
Close modal

Mutlilayer Perceptron Architecture.

The MLP to classify point cloud clusters was built in python with the tensorflow library. The MLP's jobs were to predict the class of the object's cluster, the distance to the center of the object from the LiDAR sensor's origin, the size of the object along the z axis, and the rotation of the object given the cluster's features and the trained model. The class probability distribution is an unknown nonparametric distribution. The MLP is used to form an estimate of the posterior distribution given the trained model and input feature vector. The class probability is mathematically stated in Eq. (1). A MLP was used as the learning algorithm as opposed to a support vector machine or Boosting algorithm because a MLP is able to jointly perform classification and regression, whereas other common general machine learning algorithms can only perform one or the other
(1)

Ci represents the class probability for the ith feature vector and Θ represents the trained MLP model parameters. The distance to the center of the object, zi, the size of the object, li, and the rotation of the object, αi, are all estimated values.

The MLP needed one output layer for every value that needed to be regressed; therefore, the MLP had four different output layers. The input layer for the feature vector had a size of 15 since that was the length of the feature vector, the class output layer had a size of four to be able to regress each class probability, and distance, size, and rotation layers each had a size of one.

The number and shape of the MLP's hidden layers needed to be determined. The number of layers tested varied from one to seven and the number of neurons in each layer was varied between 10 and 200. The hidden layer configuration that converged to the best accuracy had a single hidden layer with 150 neurons.

A truncated normal distribution with a standard deviation of 0.01 was sampled from to initialize all parameters. As is standard for detection networks, the distance, length, and rotation was not predicted outright, scaled versions were used instead. The entire dataset was analyzed and the maximum distance was 50 m as was the maximum length. The maximum length was this large because of random clusters that were included in the dataset. Therefore, the distance and length did not need to be predicted but instead a scaled version was predicted, which was easier for the network to learn. In addition, the rotation value can only assume values between −π and π so the natural scaling factor used was π. The scalings are seen below:
(2)

The activation function that was used for the hidden layer was a rectified linear unit, the class output layer was a softmax, the distance and the length output layers were both sigmoids, and the rotation layer was a hyperbolic tangent [7]. The class layer used the softmax to squash all class probabilities between 0 and 1 and to make the sum of the probabilities equal 1. The distance and length layers used the sigmoid because the predictions fell between 0 and 1 since neither of the values could be negative and also the scaled value could never be over 1. The rotation layer used a hyperbolic tangent since the values fell between −1 and 1.

Training.

In order to train the MLP to regress all four values simultaneously, a multipart loss function was created. The total loss would be the summation of a weight regularization penalty term, Lreg, the weighted class error, Lclass, and the weighted distance, length, and rotation errors, Ldist, Lsize, Lrot, respectively. The final multipart loss equation can be seen below:
(3)
(4)

Lclass was cross entropy loss and Ldist, Lsize, and Lrot were smooth L1 loss or huber loss [38]. Lreg was the summation of the L2 norm of all trainable parameters with decay value of 0.001. Finally, λclass = 0.8, and λdist=λsize=λrot=((10.8)/3). The class loss weight, λclass, was set higher than the other three weight terms as that output was the most important and needed to be emphasized to learn to the highest degree of accuracy possible. Ntotal is the total number of training examples per batch and Ncare is the number of training examples that do not have a don't care class label. The reason why the summations were over Ntotal and Ncare and not the same is that there were no labels for length, size, or rotation of the clusters that were labeled don't care. Therefore, the don't care terms in the summation needed to be avoided when computing loss, which would impact the gradients and degrade the performance of the MLP.

The batch size was set to 12 on each GPU so there was an effective batch size of 36, since three GPUs were used. The network was trained for 50,000 iterations. Both Nesterov momentum [39], with learning rate of 0.1 and momentum of 0.9, and Adam [40] were tested; however, Adam converged to a higher class accuracy and was therefore used. The total loss during training can be seen in Fig. 3.

Fig. 3

Due to the unknown shape of the error surface, the MLP accuracy was extremely sensitive to initial conditions used for the learnable parameters. Since the MLP was so sensitive, the network was trained 75 times and the network that scored the highest classification accuracy on the validation set was selected to be the final network.

Association Problem.

In order to perform the fusion algorithm after trained models were created for the image detection CNN and LiDAR classifying MLP, image detections and clusters needed to be associated together due to potentially noisy sensor readings detecting the same object. The projection matrix from 3D LiDAR space to 2D camera image space is provided for every timestep in the KITTI dataset. The projection matrix allows to compute the pixel location for each 3D point within the cloud. The projection can be computed in the following way:
(5)
(6)

After the matrix multiplication had been computed, the output vector needed to be normalized by the last element and the projection from 3D LiDAR to 2D camera image space. After the projection had been developed, clusters mean x, y, and z coordinates were projected into image space, along with forming the centroids of each of the CNN detection's bounding box. With the two sets of projected cluster centroids and bounding box centroids, a simple Euclidean distance comparison in pixel space was performed. Any pair of centroids that were under a certain threshold were considered to be a pair of the same object. The threshold that was used was 75 pixels due to the large variance of the cluster centroid's means.

Confidence Adjustment.

After each pair's cluster had been classified, a check was performed between the pair's image detection class and cluster classification class. If the classes were not the same then the detection was eliminated, however if the classes were the same then the agreed upon detected classes' confidence was increased by 50%. The class probability was normalized afterward. This confidence adjustment allowed for uncertain image detections, which would be eliminated, to gain the required confidence to be considered true detections

Multimodel Fusion Detection System Deployment Details.

The fusion algorithm was jointly implemented in python and C++ using robot operating system (ROS) as the communication and build platform. ROS serves multiple purposes in this code; first, to allow for messages to be passed with between the two languages, second, for the use of powerful debugging tools such as topic monitoring and its visualization package RVIZ, and third, because MFDS needs to be easily deployable on robotic systems [41].

Multimodel fusion detection system requires two ROS nodes to operate concurrently, seen in Fig. 4; the first of which is a python node to perform image detection and the second is a C++ node to perform point cloud manipulation, classification, and fusion. The python node is subscribed to a compound data message containing both an image and a synchronized point cloud. Once the CNN finishes computing detections, another compound message is formed containing the detections and the compound data message, which is published to the C++ node. The C++ node performs all point cloud preprocessing, cluster formation, cluster detection pairing, feature extraction, feature classification, and fusion. The output of the C++ node is the final set of fused 3D localized detections. python was chosen for the image detection CNN because of pythons easy access to the tensorflow library, eliminating the need to write the network inference code. C++ was chosen for the fusion node because point cloud library only operates in C-based languages.

Fig. 4
Fusion algorithm's compute graph
Fig. 4
Fusion algorithm's compute graph
Close modal

Since the LiDAR classifying MLP was trained with tensorflow in python, but inference was performed in C++, the MLP needed to be ported over to C++ [42]. Since the network involved no convolutions and only a simple MLP, the cudnn library did not need to be called. A MLP is a series of matrix multiplications followed by nonlinear activation functions so naturally an efficient, large-scale matrix multiplication library was required. cublas was decided upon over eigen due to the size of the matrices that needed to be multiplied. cublas is a library of CUDA that operates similar to the basic linear algebra subprograms (BLAS) library in C++; however, cublas performs the same operations on a GPU instead of a CPU. The cublas function call Sgemm that is used to implement the MLP is the same function call that MobileNets optimized their depthwise separable convolutions around [18].

Results

Light Detection and Range Cluster Multilayer Perceptron.

The MLP was able to learn to classify each cluster with 90% accuracy, on average, without the need for a more computationally expensive feature extraction method such as deep architecture utilizing learned features. The MLP struggled with classifying the nothing, or don't care class, as well as cyclists. It is believed that classifying the nothing class was difficult due to the large variance in shapes and sizes, making it difficult for the network to form a relationship between the high variance features and the class label (Figs. 5 and 6). Cyclists were difficult to classify due to the relatively low number of training examples to learn from.

All distance, length, and rotation errors presented are after rescaling the MLP's outputs and labels back to dimensional values from their nondimensional outputs. The MLP output of predicted distance to the object had a mean square error (MSE) of 1–2 m2, which is a much better predictor than the centroid of the cluster, but not an optimal value. The MLP distance prediction is an order of magnitude more accurate than the naive analysis of the point cloud. The length of the object's prediction had an MSE of approximately a quarter of a meter squared when the average length of vehicles, pedestrians, and cyclists were 2.59 m. Unlike distance, there is no easy way to predict object length or rotation without making strong assumptions about each class shape. Therefore, the MLP provides access to reasonable predictions for these values very quickly at the expense of some accuracy.

Fig. 5
Fig. 6

As the MLP was progressing through its 75 separate training initializations, the network appeared to converge to only three different levels of accuracy. It became apparent that these three levels of accuracy were proportional to each class representation in the dataset. The network would converge to roughly 79%, 87%, or 90%. The networks that converged to 79% had not learned how to predict pedestrians or cyclists and instead only predicted class labels of vehicles and nothing since they made up the majority of the training examples. 87% convergence represented not learning cyclists and 90% represented learning how to predict all classes. This theory was tested by tallying each of the outputs of the test set and it was confirmed that the network never outputted the corresponding class labels.

Multimodel Fusion Detection System and Image Detection Convolutional Neural Network.

Testing was performed on the remaining 1870 images and point clouds that were set aside in the test set. The image only detection method was analyzed in order to see how well the primary SSD model, as well as the regional fully convolutional network (RFCN) and faster regional convolutional neural network (FRCNN) reference models, performed on the KITTI dataset, as well as to act as a benchmark to see how much improvement MFDS provided. The KITTI dataset uses the mean average precision (mAP) metric for reporting results and is computed by finding the area under the precision recall curve. KITTI defines a true positive as a detection that scores over 70% intersection over union (IOU) for cars and 50% for pedestrians and cyclists. However, mAP is a poor indicator of MFDS's detection quality since it is dependent on how many detections an algorithm can output.

Recall is a measure of how completely the detector's output detection set covers all labeled objects, while precision is a measure of how few incorrect detections are in the output set. Recall is largely dependent on how many detections a detector can output, generally in the range of 10–300 [43]. This variable number of detections, at varying confidence levels, allows for increased recall. As more detections are outputted, the likelihood of covering all labeled objects increases. MFDS operates in direct opposition to the idea of a variable number of detections with different confidence levels and works to only output as many detections as necessary, each with high confidence. This difference in output ideology means that recall is not a good evaluation metric for MFDS, and as a result, neither is mAP. As the few high confidence outputs of MFDS are used to compute the systems recall, the precision recall curve drops to zero when there are no more available detections to cover the labels. Therefore, a more applicable metric to evaluate the improvement of MFDS over the base SSD image detection CNN was used.

In order to determine the viability for deployment of the image detection CNN model, three main metrics were considered; the memory footprint on the CPU and GPU, the inference time, and our own accuracy metric called adjusted accuracy. We define adjusted accuracy, Eq. (7), as the percentage of accurate detections with a penalty for incorrect detections, divided by the total number of labeled objects. In addition to these three metrics, it was beneficial to view the distribution of different types of detections in order to see the room for improvement that MFDS could add. Different types of detections were based upon varying combinations of confidence and correctness. Correctness was defined as the detection predicting the correct class with appropriate IOU. Confidence was defined as the confidence MFDS had in the detection's class, and a miss was defined as the detection CNN placing a detection around no labeled object. MFDS's main objective was to reduce the number of unconfident but correct detections and also reduce the number of confident misses. Therefore, MFDS is best suited to be used in conjunction with a CNN detection model that has a high number of unconfidently correct detections, a high number of confident misses, and performs inference at a high rate of speed
(7)

The command line tool nvidia-smi was used to find each model's GPU memory consumption and the command line top was used to find each model's CPU memory consumption, which included all memory required by the process. The memory consumption is much larger than the size of the trained models because the memory consumption includes the amount of memory for storing temporary values and all additional required libraries for performing inference.

The SSD model was chosen for its small memory footprint, fast inference time, and ideal detection type distribution. Roughly, 11% of SSD detection's fell within the categories of unconfidently correct and confidently incorrect, seen in Fig. 7, as the targeted types of detections for MFDS to eliminate. MFDS takes the small, lightweight SSD model and increases the confidence of its detections in order to output high confidence, correct detections with a minimal increase in memory demands, while still being able to operate at 10 Hz, seen in Table 1. Figures 810 show that MFDS increases the confidence of detections that are able to become confident enough to become final detections. MFDS' output is 96% high confidence, correct detections, which is roughly a 50% improvement over the base SSD model, seen in Fig. 7. MFDS suffers from a slight, roughly 0.8%, increase in confidently incorrect detections and misses. With the increase in the number of confident detections, MFDS was able to increase the adjusted accuracy by 3.7%. A comparison to the popular RFCN and Faster RCNN models is supplied in Table 1.

Fig. 7
SSD model detection type
Fig. 7
SSD model detection type
Close modal
Fig. 8
SSD to MFDS comparison; MFDS top and SSD bottom
Fig. 8
SSD to MFDS comparison; MFDS top and SSD bottom
Close modal
Fig. 9
SSD to MFDS comparison; MFDS top and SSD bottom
Fig. 9
SSD to MFDS comparison; MFDS top and SSD bottom
Close modal
Fig. 10
SSD to MFDS comparison; MFDS top and SSD bottom
Fig. 10
SSD to MFDS comparison; MFDS top and SSD bottom
Close modal
Table 1

SSD results in MFDS

RFCNFRCNNSSDMFDS
CPU memory (GB)2.5922.8161.8242.176
GPU memory (GB)1.9627.8760.5540.703
Inference time (s)0.0730.540.01670.1083
Adjusted accuracy (%)58.1157.8437.1840.89
RFCNFRCNNSSDMFDS
CPU memory (GB)2.5922.8161.8242.176
GPU memory (GB)1.9627.8760.5540.703
Inference time (s)0.0730.540.01670.1083
Adjusted accuracy (%)58.1157.8437.1840.89

In order to evaluate the inference speed of MFDS, each function in MFDS was timed, shown in Table 2. Cluster formation took up roughly half of the entire MFDS inference time with the other half evenly spread out among the other functions. Although the MFDS processing is roughly as fast as the RFCN and faster than the FRCNN inference time, they are not directly comparable because the CNN detection models operate on the GPU and MFDS operates on the CPU and GPU with nonoptimized functions. For a speed comparison, MFDS process a synchronized image and LiDAR point cloud roughly 60 ms faster than FrustrumNet [36].

Table 2

Time analysis of MFDS inference

TaskTime (s)
CNN0.0167
Masking0.014
Segmenting ground plane0.009
Coordinate transform0.015
Cluster formation0.039
Detection/cluster association0.001
Feature extraction0.0004
Classification0.008
Confidence adjustment0.0001
Total0.1083
TaskTime (s)
CNN0.0167
Masking0.014
Segmenting ground plane0.009
Coordinate transform0.015
Cluster formation0.039
Detection/cluster association0.001
Feature extraction0.0004
Classification0.008
Confidence adjustment0.0001
Total0.1083

One of the main sources of error in MFDS proved to be the detection-cluster association. A known problem is when objects appear very close to one another, their bounding boxes will be nearly on top of one another and will be erased by nonmaximal suppression [44]. MFDS is very susceptible to this problem due to the way it associates detections and clusters together. An example of this can be seen in Figs. 11 and 12 where a cyclist partially occludes a pair of pedestrians. There is only a single bounding box after nonmaximal suppression, due to their high IOU; however, there are two different clusters that are paired with it, visualized as the pink and blue points in Fig. 12. There will be two final detections with the same bounding box, but with different 3D localized values which, by KITTIs definition, is one true positive and one false positive since labels can only have one detection after which all detections are considered errors.

Fig. 11
Occluded objects from the image viewpoint
Fig. 11
Occluded objects from the image viewpoint
Close modal
Fig. 12
Occluded objects from the point cloud cluster viewpoint
Fig. 12
Occluded objects from the point cloud cluster viewpoint
Close modal

Another source of error for MFDS was the false positive detection rate due to confident misses. Since many clusters are generated from a point cloud and are then paired with image detections with MFDS's association method, nonobject clusters make it to the cluster classification stage. The classifying MLP has a classification accuracy of 90%, which means that for every nonobject to reach the MLP, 10% will be viewed as confident detections due to false cluster classification and a poor confidence image detection. An example of this can be seen in Fig. 13 where the treetop canopy's cluster is falsely detected as a car (see Refs. [45] and [46]).

Fig. 13
False detection much higher than ground plane
Fig. 13
False detection much higher than ground plane
Close modal

Conclusion

An object detection system for autonomous vehicles was discussed in this paper. MFDS was able to increase the adjusted accuracy by 3.7% over SSD while providing detections at 10 Hz. This increase in adjusted accuracy was achieved by changing unconfident into confident detections by performing an analysis on the corresponding point cloud cluster. MFDS performed inference comparably or faster than the reference CNN image detectors, takes up significantly less memory, and provided 3D localized detections. MFDS was able to take unconfident detection proposals from the image CNN and use LiDAR data to add enough confidence for the detection proposals to be considered true detections. MFDS was a step toward a deployable object detection system for autonomous vehicles. It fused information from multiple sensors to produce outputs directly usable by the path planning module of an autonomous vehicle. Although there are limitations to MFDS; the benefits and information that MFDS produces outweigh the problems it faces.

Acknowledgment

We would also like to thank both Magna and Polaris for funding Florida Tech's IGVC team.

Funding Data

  • SMART Scholarship, USD/R&E, National Defense Education Program/BA-1, Basic Research (Funder ID: 10.13039/100005713).

References

1.
NHTS Administration
,
2015
, “
Traffic Safety Facts 2015, A Compilation of Motor Vehicle Crash Data From the Fatality Analysis Reporting System and the General Estimates System
,” U.S. Department of Transportation National Highway Traffic Safety Administration, Washington, DC.
2.
Thrun
,
S.
,
Montemerlo
,
M.
,
Dahlkamp
,
H.
,
Stavens
,
D.
,
Aron
,
A.
,
Diebel
,
J.
,
Fong
,
P.
,
Gale
,
J.
,
Halpenny
,
M.
,
Hoffmann
,
G.
,
Lau
,
K.
,
Oakley
,
C.
,
Palatucci
,
M.
,
Pratt
,
V.
,
Stang
,
P.
,
Strohband
,
S.
,
Dupont
,
C.
,
Jendrossek
,
L.-E.
,
Koelen
,
C.
,
Markey
,
C.
,
Rummel
,
C.
,
van Niekerk
,
J.
,
Jensen
,
E.
,
Alessandrini
,
P.
,
Bradski
,
G.
,
Davies
,
B.
,
Ettinger
,
S.
,
Kaehler
,
A.
,
Nefian
,
A.
, and
Mahoney
,
P.
,
2006
, “
Stanley: The Robot That Won the Darpa Grand Challenge: Research Articles
,”
J. Robot. Syst.
,
23
(
9
), pp.
661
692
.
3.
Urmson
,
C.
,
Anhalt
,
J.
,
Bae
,
H.
,
Bagnell
,
J. A. D.
,
Baker
,
C. R.
,
Bittner
,
R. E.
,
Brown
,
T.
,
Clark
,
M. N.
,
Darms
,
M.
,
Demitrish
,
D.
,
Dolan
,
J. M.
,
Duggins
,
D.
,
Ferguson
,
D.
,
Galatali
,
T.
,
Geyer
,
C. M.
,
Gittleman
,
M.
,
Harbaugh
,
S.
,
Hebert
,
M.
,
Howard
,
T.
,
Kolski
,
S.
,
Likhachev
,
M.
,
Litkouhi
,
B.
,
Kelly
,
A.
,
McNaughton
,
M.
,
Miller
,
N.
,
Nickolaou
,
J.
,
Peterson
,
K.
,
Pilnick
,
B.
,
Rajkumar
,
R.
,
Rybski
,
P.
,
Sadekar
,
V.
,
Salesky
,
B.
,
Seo
,
Y.-W.
,
Singh
,
S.
,
Snider
,
J. M.
,
Struble
,
J. C.
,
Stentz
,
A. T.
,
Taylor
,
M.
,
Whittaker
,
W. R. L.
,
Wolkowicki
,
Z.
,
Zhang
,
W.
, and
Ziglar
,
J.
,
2008
, “
Autonomous Driving in Urban Environments: Boss and the Urban Challenge
,”
J. Field Rob. Spec.
,
25
(
8
), pp.
425
466
.
4.
Berger
,
C.
, and
Rumpe
,
B.
,
2014
, “
Autonomous Driving—5 Years After the Urban Challenge: The Anticipatory Vehicle as a Cyber-Physical System
,” ArXiv e-prints.
5.
Geiger
,
A.
,
Lenz
,
P.
,
Stiller
,
C.
, and
Urtasun
,
R.
,
2013
, “
Vision Meets Robotics: The Kitti Dataset
,”
Int. J. Rob. Res.
,
32
(
11
), pp.
1231
1237
.
6.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, and
Haffner
,
P.
,
1998
, “
Gradient-Based Learning Applied to Document Recognition
,”
Proc. IEEE
,
86
(
11
), pp.
2278
2324
.
7.
Krizhevsky
,
A.
,
Sutskever
,
I.
, and
Hinton
,
G. E.
,
2012
, “
Imagenet Classification With Deep Convolutional Neural Networks
,”
Advances in Neural Information Processing Systems 25
,
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, and
K. Q.
Weinberger
, eds.,
Harrahs and Harveys, Lake Tahoe, NV
, pp.
1097
1105
.
8.
Simonyan
,
K.
, and
Zisserman
,
A.
,
2014
, “
Very Deep Convolutional Networks for Large-Scale Image Recognition
,”
arXiv:1409.1556
[cs.CV]https://arxiv.org/abs/1409.1556
9.
Szegedy
,
C.
,
Liu
,
W.
,
Jia
,
Y.
,
Sermanet
,
P.
,
Reed
,
S.
,
Anguelov
,
D.
,
Erhan
,
D.
,
Vanhoucke
,
V.
, and
Rabinovich
,
A.
,
2014
, “
Going Deeper With Convolutions
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Boston, MA, June 7–12, pp. 1–9.
10.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, and
Sun
,
J.
,
2015
, “
Deep Residual Learning for Image Recognition
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Las Vegas, NV, June 27–30, pp. 770–778.
11.
Girshick
,
R. B.
,
Donahue
,
J.
,
Darrell
,
T.
, and
Malik
,
J.
,
2013
, “
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Columbus, OH, June 23–28, pp. 580–587.
12.
Girshick
,
R.
,
2015
, “
Fast R-CNN
,” IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, Dec. 7–13, pp. 1440–1448.
13.
Ren
,
S.
,
He
,
K.
,
Girshick
,
R.
, and
Sun
,
J.
,
2015
, “
Faster R-CNN: Towards Real-Time Object Detection With Region Proposal Networks
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
39
(6), pp. 1137–1149.
14.
Redmon
,
J.
,
Divvala
,
S.
,
Girshick
,
R.
, and
Farhadi
,
A.
,
2015
, “
You Only Look Once: Unified, Real-Time Object Detection
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Las Vegas, NV, June 27–30, pp. 779–788.
15.
Redmon
,
J.
, and
Farhadi
,
A.
,
2016
, “
YOLO9000: Better, Faster, Stronger
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Honolulu, HI, July 21–26, pp. 7263–7271.
16.
Dai
,
J.
,
Li
,
Y.
,
He
,
K.
, and
Sun
,
J.
,
2016
, “
R-FCN: Object Detection Via Region-Based Fully Convolutional Networks
,”
arXiv:1605.06409
[cs.CV].https://arxiv.org/abs/1605.06409
17.
Cheang
,
G. H. L.
,
2010
, “
Approximation With Neural Networks Activated by Ramp Sigmoids
,”
J. Approx. Theory
,
162
(
8
), pp. 1450–1465.
18.
Howard
,
A. G.
,
Zhu
,
M.
,
Chen
,
B.
,
Kalenichenko
,
D.
,
Wang
,
W.
,
Weyand
,
T.
,
Andreetto
,
M.
, and
Adam
,
H.
,
2017
, “
Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications
,”
arXiv:1704.04861
[cs.CV].https://arxiv.org/abs/1704.04861
19.
Chetlur
,
S.
,
Woolley
,
C.
,
Vandermersch
,
P.
,
Cohen
,
J.
,
Tran
,
J.
,
Catanzaro
,
B.
, and
Shelhamer
,
E.
,
2014
, “
cuDNN: Efficient Primitives for Deep Learning
,” ArXiv e-prints
arXiv:1410.0759
[cs.NE].https://arxiv.org/abs/1410.0759
20.
Liu
,
W.
,
Anguelov
,
D.
,
Erhan
,
D.
,
Szegedy
,
C.
,
Reed
,
S. E.
,
Fu
,
C.
, and
Berg
,
A. C.
,
2015
, “
SSD: Single Shot Multibox Detector
,”
Computer Vision–ECCV 2016
(Lecture Notes in Computer Science, Vol. 9905), B. Leibe, J. Matas, N. Sebe, and M. Welling, eds., Springer, Cham, Switzerland.
21.
Qi
,
C. R.
,
Su
,
H.
,
Mo
,
K.
, and
Guibas
,
L. J.
, 2017, “
Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation
,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 652–660.
22.
Zhou
,
Y.
, and
Tuzel
,
O.
,
2018
, “
Voxelnet: End-to-End Learning for Point Cloud Based 3D Object Detection
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Salt Lake City, UT, June 18–23, pp. 4490–4499.
23.
Wang
,
D. Z.
, and
Posner
,
I.
,
2015
, “
Voting for Voting in Online Point Cloud Object Detection
,”
Robotics: Science and Systems
, Rome, Italy, July 13–17, Paper No. 35.http://www.robots.ox.ac.uk/~mobile/Papers/2015RSS_wang.pdf
24.
Engelcke
,
M.
,
Rao
,
D.
,
Zeng Wang
,
D.
,
Hay Tong
,
C.
, and
Posner
,
I.
,
2017
, “
Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks
,” 2017 IEEE International Conference on Robotics and Automation (
ICRA
), Singapore, May 29–June 3.
25.
Song
,
S.
, and
Xiao
,
J.
,
2014
,
Sliding Shapes for 3D Object Detection in Depth Images
,
Springer International Publishing
,
Cham
, Switzerland,pp.
634
651
.
26.
Song
,
S.
, and
Xiao
,
J.
,
2016
, “
Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images
,”
IEEE Conference on Computer Vision and Pattern Recognition
(
CVPR
),
Las Vegas, NV
,
June 26–July 1
, pp. 808–816.
27.
Chen
,
X.
,
Ma
,
H.
,
Wan
,
J.
,
Li
,
B.
, and
Xia
,
T.
,
2017
, “
Multi-View 3D Object Detection Network for Autonomous Driving
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Honolulu, HI, July 21–26, pp. 1907–1915.
28.
Xu
,
D.
,
Anguelov
,
D.
, and
Jain
,
A.
,
2017
, “
Pointfusion: Deep Sensor Fusion for 3D Bounding Box Estimation
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Salt Lake City, UT, June 18–23, pp. 244–253.
29.
Hegde
,
V.
, and
Zadeh
,
R.
,
2016
, “
FusionNet: 3D Object Classification Using Multiple Data Representations
,”
arXiv:1607.05695
[cs.CV].https://arxiv.org/abs/1607.05695
30.
Gao
,
H.
,
Cheng
,
B.
,
Wang
,
J.
,
Li
,
K.
,
Zhao
,
J.
, and
Li
,
D.
,
2018
, “
Object Classification Using CNN-Based Fusion of Vision and LIDAR in Autonomous Vehicle Environment
,”
IEEE Trans. Ind. Inf.
,
14
(9), pp. 4224–4231.
31.
Wang
,
Z.
,
Zhan
,
W.
, and
Tomizuka
,
M.
,
2017
, “
Fusing Bird Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection
,”
arXiv:1711.06703
[cs.CV].https://arxiv.org/abs/1711.06703
32.
Oh
,
S.-I.
, and
Kang
,
H.-B.
,
2017
, “
Object Detection and Classification by Decision-Level Fusion for Intelligent Vehicle Systems
,”
Sensors
,
17
(
1
), p. 207.
33.
Wu
,
T.
,
Tsai
,
C.
, and
Guo
,
J.
,
2017
, “
LiDAR/Camera Sensor Fusion Technology for Pedestrian Detection
,”
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
(
APSIPA ASC
),
Kuala Lumpur, Malaysia
,
Dec. 12–15
, pp.
1675
1678
.
34.
Zhang
,
F.
,
Clarke
,
D.
, and
Knoll
,
A.
,
2014
, “
Vehicle Detection Based on LIDAR and Camera Fusion
,”
17th International IEEE Conference on Intelligent Transportation Systems
(
ITSC
),
Qingdao, China
,
Oct. 8–11
, pp.
1620
1625
.
35.
Yohannan
,
B.
, and
Chandy
,
D. A.
,
2017
, “
A Novel Approach for Fusing LIDAR and Visual Camera Images in Unstructured Environment
,”
Fourth International Conference on Advanced Computing and Communication Systems
(
ICACCS
),
Coimbatore, India
,
Jan. 6–7
, pp.
1
5
.
36.
Qi
,
C. R.
,
Liu
,
W.
,
Wu
,
C.
,
Su
,
H.
, and
Guibas
,
L. J.
,
2018
, “
Frustum Pointnets for 3D Object Detection From RGB-D Data
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Salt Lake City, UT, June 18–23, pp. 918–927.
37.
Lin
,
T.
,
Maire
,
M.
,
Belongie
,
S. J.
,
Bourdev
,
L. D.
,
Girshick
,
R. B.
,
Hays
,
J.
,
Perona
,
P.
,
Ramanan
,
D.
,
Dollár
,
P.
, and
Zitnick
,
C. L.
,
2014
, “
Microsoft COCO: Common Objects in Context
,”
Computer Vision–ECCV 2014
(Lecture Notes in Computer Science, Vol. 8693), D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds., Springer, Cham, Switzerland.
38.
Huber
,
P. J.
,
1964
, “
Robust Estimation of a Location Parameter
,”
Ann. Math. Stat.
,
35
(
1
), pp.
73
101
.
39.
Sutskever
,
I.
,
Martens
,
J.
,
Dahl
,
G.
, and
Hinton
,
G.
,
2013
, “
On the Importance of Initialization and Momentum in Deep Learning
,”
30th International Conference on International Conference on Machine Learning
(
ICML'13
),
Atlanta, Atlanta, GA
.https://www.cs.toronto.edu/~fritz/absps/momentum.pdf
40.
Kingma
,
D. P.
, and
Ba
,
J.
,
2014
, “
Adam: A Method for Stochastic Optimization
,”
arXiv:1412.6980
[cs.LG].https://arxiv.org/abs/1412.6980
41.
Quigley
,
M.
,
Conley
,
K.
,
Gerkey
,
B. P.
,
Faust
,
J.
,
Foote
,
T.
,
Leibs
,
J.
,
Wheeler
,
R.
, and
Ng
,
A. Y.
,
2009
, “
Ros: An Open-Source Robot Operating System
,”
ICRA Workshop on Open Source Software
.
42.
Abadi
,
M.
,
Agarwal
,
A.
,
Barham
,
P.
,
Brevdo
,
E.
,
Chen
,
Z.
,
Citro
,
C.
,
Corrado
,
G. S.
,
Davis
,
A.
,
Dean
,
J.
,
Devin
,
M.
,
Ghemawat
,
S.
,
Goodfellow
,
I.
,
Harp
,
A.
,
Irving
,
G.
,
Isard
,
M.
,
Jia
,
Y.
,
Jozefowicz
,
R.
,
Kaiser
,
L.
,
Kudlur
,
M.
,
Levenberg
,
J.
,
Mané
,
D.
,
Monga
,
R.
,
Moore
,
S.
,
Murray
,
D.
,
Olah
,
C.
,
Schuster
,
M.
,
Shlens
,
J.
,
Steiner
,
B.
,
Sutskever
,
I.
,
Talwar
,
K.
,
Tucker
,
P.
,
Vanhoucke
,
V.
,
Vasudevan
,
V.
,
Viégas
,
F.
,
Vinyals
,
O.
,
Warden
,
P.
,
Wattenberg
,
M.
,
Wicke
,
M.
,
Yu
,
Y.
, and
Zheng
,
X.
,
2015
, “
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems
,” Accessed: Apr. 11, 2019, http://tensorflow.org/
43.
Huang
,
J.
,
Rathod
,
V.
,
Sun
,
C.
,
Zhu
,
M.
,
Korattikara
,
A.
,
Fathi
,
A.
,
Fischer
,
I.
,
Wojna
,
Z.
,
Song
,
Y.
,
Guadarrama
,
S.
, and
Murphy
,
K.
,
2017
, “
Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
,” IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), Honolulu, HI, July 21–26, pp. 7310–7311.
44.
Bodla
,
N.
,
Singh
,
B.
,
Chellappa
,
R.
, and
Davis
,
L. S.
,
2017
, “
Improving Object Detection With One Line of Code
,” IEEE International Conference on Computer Vision (
ICCV
), Venice, Italy, Oct. 22–29, pp. 5561–5569.
45.
Goodfellow
,
I.
,
Bengio
,
Y.
, and
Courville
,
A.
,
2016
,
Deep Learning
,
MIT Press, Cambridge, MA
.
46.
Maddern
,
W.
,
Pascoe
,
G.
,
Linegar
,
C.
, and
Newman
,
P.
,
2017
, “
1 Year, 1000 km: The Oxford RobotCar Dataset
,”
Int. J. Rob. Res.
,
36
(
1
), pp.
3
15
.