NN-512, ONNX Runtime, TensorFlow, DeepSparse inference speed compared

NN-512 appeared on HN in late 2020.

No benchmarks were provided, which may be a reason why it didn't get much attention.

I decided to try NN-512 with ResNet50. It comes with this network graph as an example, and the generated ResNet50.h file contains some code snippets in the comments of an example of how to use it.

NN-512 doesn't come with any weights / params / floats, or any examples of how to generate them.

The first attempt to save weights was with PyTorch, but eventually I found that it uses a modified ResNet:

# This variant is also known as ResNet V1.5 and improves accuracy according to
# https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

I asked the NN-512 author 37ef what I was doing wrong, and got some useful information:

The orders of the weights weren't right
The example was based on caffe
You should generate the graph and collect the weights at the same time

Once I had saved the caffe weights and checked it works, I moved onto generating a graph from TensorFlow / Keras and saving the weights at the same time.

I compared the speed of NN-512 with Tensorflow and Neural Magic DeepSparse on an AWS c5.large and c5.xlarge on Ubuntu Server 20.04 LTS.

Results

View HTML for full results, I picked a rounded average looking value. Not scientific, but quick.

Machine	Type	Batch Size	Time per inference
C5.large	TF/Keras	1	0.13
C5.large	TF/Keras	2	0.105
C5.large	TF/Keras	4	0.09
C5.large	TF/Keras	64	0.10
C5.large	DeepSparse	1	0.070
C5.large	DeepSparse	2	0.075
C5.large	DeepSparse	4	0.068
C5.large	DeepSparse	64	0.068
C5.large	NN-512	1	0.069
C5.large	ONNX	1	0.058
C5.large	ONNX	2	0.058
C5.large	ONNX	4	0.058
C5.large	ONNX	64	0.058
C5.xlarge	TF/Keras	1	0.088
C5.xlarge	TF/Keras	2	0.065
C5.xlarge	TF/Keras	4	0.05
C5.xlarge	TF/Keras	64	0.049
C5.xlarge	DeepSparse	1	0.033
C5.xlarge	DeepSparse	2	0.035
C5.xlarge	DeepSparse	4	0.032
C5.xlarge	DeepSparse	64	0.031
C5.xlarge	NN-512	1	0.035
C5.xlarge	ONNX	1	0.035
C5.xlarge	ONNX	2	0.031
C5.xlarge	ONNX	4	0.03
C5.xlarge	ONNX	64	0.03

My interpretation of the results show that NN-512 is significantly faster than Tensorflow (without looking at optimisation) and very similar in speed to DeepSparse. ONNX runtime appears to be the fastest on c5.large, but similar to DeepSparse and NN-512 on c5.xlarge.

DeepSparse is closed source, but apparently free to use. It was also designed to be used with Pruning and Quantisation, which NN-512 has nothing to do with.

In short, if you want to run a ConvNet inference on CPU, and you want to use open source code, NN-512 looks fast.

Future work I'd like to see:

Proper, more scientific, benchmarks on more cloud providers against more frameworks
Quantisation
More, and more complete graph converters / weight savers for more frameworks