NN-512, ONNX Runtime, TensorFlow, DeepSparse inference speed compared

NN-512 appeared on HN in late 2020.

No benchmarks were provided, which may be a reason why it didn't get much attention.

I decided to try NN-512 with ResNet50. It comes with this network graph as an example, and the generated ResNet50.h file contains some code snippets in the comments of an example of how to use it.

NN-512 doesn't come with any weights / params / floats, or any examples of how to generate them.

The first attempt to save weights was with PyTorch, but eventually I found that it uses a modified ResNet:

# This variant is also known as ResNet V1.5 and improves accuracy according to
# https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

I asked the NN-512 author 37ef what I was doing wrong, and got some useful information:

Once I had saved the caffe weights and checked it works, I moved onto generating a graph from TensorFlow / Keras and saving the weights at the same time.

I compared the speed of NN-512 with Tensorflow and Neural Magic DeepSparse on an AWS c5.large and c5.xlarge on Ubuntu Server 20.04 LTS.

Results

View HTML for full results, I picked a rounded average looking value. Not scientific, but quick.

Machine Type Batch Size Time per inference
C5.large TF/Keras 1 0.13
C5.large TF/Keras 2 0.105
C5.large TF/Keras 4 0.09
C5.large TF/Keras 64 0.10
C5.large DeepSparse 1 0.070
C5.large DeepSparse 2 0.075
C5.large DeepSparse 4 0.068
C5.large DeepSparse 64 0.068
C5.large NN-512 1 0.069
C5.large ONNX 1 0.058
C5.large ONNX 2 0.058
C5.large ONNX 4 0.058
C5.large ONNX 64 0.058
C5.xlarge TF/Keras 1 0.088
C5.xlarge TF/Keras 2 0.065
C5.xlarge TF/Keras 4 0.05
C5.xlarge TF/Keras 64 0.049
C5.xlarge DeepSparse 1 0.033
C5.xlarge DeepSparse 2 0.035
C5.xlarge DeepSparse 4 0.032
C5.xlarge DeepSparse 64 0.031
C5.xlarge NN-512 1 0.035
C5.xlarge ONNX 1 0.035
C5.xlarge ONNX 2 0.031
C5.xlarge ONNX 4 0.03
C5.xlarge ONNX 64 0.03

My interpretation of the results show that NN-512 is significantly faster than Tensorflow (without looking at optimisation) and very similar in speed to DeepSparse. ONNX runtime appears to be the fastest on c5.large, but similar to DeepSparse and NN-512 on c5.xlarge.

DeepSparse is closed source, but apparently free to use. It was also designed to be used with Pruning and Quantisation, which NN-512 has nothing to do with.

In short, if you want to run a ConvNet inference on CPU, and you want to use open source code, NN-512 looks fast.

Future work I'd like to see: