tensor4

Link to code

tensor4 - pytorch to C++ convertor using lightweight templated tensor library

This project was born as a fun experiment and can be useful because of it is extreamly lightweight.

Idea:

Use pytorch trace to generate C++ code that defines the network.
Can be compiled to WebAsembly.
Single header library
No dependencies
Inference only, no gradients.
Easy to use, simple to embed.
CPU only

What it can do?:

Convert most of pytorch graphs to C++ code
Can run DenseNet, ResNet, AlexNet, Vgg16.
Produces very small binary footprint onto executable. Executable that can run DenseNet is about 100kb.

Web assembly demo: link. The demo demonstrates running DCGAN to generate MNIST digits.

Exampe:

alexnet = torchvision.models.alexnet(pretrained=True)
alexnet.eval()

...

out = tensor4.generate(alexnet, args=(im,)) #im some test tensor of the same type/size as expected for the input

Will produce:

header

#include "tensor4.h"

struct AlexNet
{
 t4::tensor4f features_0_weight;
 t4::tensor1f features_0_bias;
 t4::tensor4f features_3_weight;
 t4::tensor1f features_3_bias;
 t4::tensor4f features_6_weight;
 t4::tensor1f features_6_bias;
 t4::tensor4f features_8_weight;
 t4::tensor1f features_8_bias;
 t4::tensor4f features_10_weight;
 t4::tensor1f features_10_bias;
 t4::tensor2f classifier_1_weight;
 t4::tensor1f classifier_1_bias;
 t4::tensor2f classifier_4_weight;
 t4::tensor1f classifier_4_bias;
 t4::tensor2f classifier_6_weight;
 t4::tensor1f classifier_6_bias;
};


AlexNet AlexNetLoad(const char* filename);

t4::tensor2f AlexNetForward(const AlexNet& ctx, t4::tensor4f x0);

C++ file with definitions of forwatd pass function and weight loading function:

#include "AlexNet.h"

AlexNet AlexNetLoad(const char* filename)
{
 AlexNet ctx;
 t4::model_dict dict = t4::load(filename);
 dict.load(ctx.features_0_weight, "features.0.weight", 64, 3, 11, 11);
 dict.load(ctx.features_0_bias, "features.0.bias", 64);
 dict.load(ctx.features_3_weight, "features.3.weight", 192, 64, 5, 5);
 dict.load(ctx.features_3_bias, "features.3.bias", 192);
 dict.load(ctx.features_6_weight, "features.6.weight", 384, 192, 3, 3);
 dict.load(ctx.features_6_bias, "features.6.bias", 384);
 dict.load(ctx.features_8_weight, "features.8.weight", 256, 384, 3, 3);
 dict.load(ctx.features_8_bias, "features.8.bias", 256);
 dict.load(ctx.features_10_weight, "features.10.weight", 256, 256, 3, 3);
 dict.load(ctx.features_10_bias, "features.10.bias", 256);
 dict.load(ctx.classifier_1_weight, "classifier.1.weight", 4096, 9216);
 dict.load(ctx.classifier_1_bias, "classifier.1.bias", 4096);
 dict.load(ctx.classifier_4_weight, "classifier.4.weight", 4096, 4096);
 dict.load(ctx.classifier_4_bias, "classifier.4.bias", 4096);
 dict.load(ctx.classifier_6_weight, "classifier.6.weight", 1000, 4096);
 dict.load(ctx.classifier_6_bias, "classifier.6.bias", 1000);
 return ctx;
}


t4::tensor2f AlexNetForward(const AlexNet& ctx, t4::tensor4f x0)
{
 t4::tensor4f x17 = t4::Conv2d<11, 11, 4, 4, 2, 2, 1, 1>(x0, ctx.features_0_weight, ctx.features_0_bias); //features.0
 t4::release(x0);
 t4::tensor4f x18 = t4::ReluInplace(x17); //features.1
 t4::release(x17);
 t4::tensor4f x19 = t4::MaxPool2d<3, 3, 2, 2, 0, 0>(x18); //features.2
 t4::release(x18);
 t4::tensor4f x20 = t4::Conv2d<5, 5, 1, 1, 2, 2, 1, 1>(x19, ctx.features_3_weight, ctx.features_3_bias); //features.3
 t4::release(x19);
 t4::tensor4f x21 = t4::ReluInplace(x20); //features.4
 t4::release(x20);
 t4::tensor4f x22 = t4::MaxPool2d<3, 3, 2, 2, 0, 0>(x21); //features.5
 t4::release(x21);
 t4::tensor4f x23 = t4::Conv2d<3, 3, 1, 1, 1, 1, 1, 1>(x22, ctx.features_6_weight, ctx.features_6_bias); //features.6
 t4::release(x22);
 t4::tensor4f x24 = t4::ReluInplace(x23); //features.7
 t4::release(x23);
 t4::tensor4f x25 = t4::Conv2d<3, 3, 1, 1, 1, 1, 1, 1>(x24, ctx.features_8_weight, ctx.features_8_bias); //features.8
 t4::release(x24);
 t4::tensor4f x26 = t4::ReluInplace(x25); //features.9
 t4::release(x25);
 t4::tensor4f x27 = t4::Conv2d<3, 3, 1, 1, 1, 1, 1, 1>(x26, ctx.features_10_weight, ctx.features_10_bias); //features.10
 t4::release(x26);
 t4::tensor4f x28 = t4::ReluInplace(x27); //features.11
 t4::release(x27);
 t4::tensor4f x29 = t4::MaxPool2d<3, 3, 2, 2, 0, 0>(x28); //features.12
 t4::release(x28);
 t4::tensor2f x30 = t4::Flatten<1>(x29);
 t4::release(x29);
 t4::tensor2f x31 = t4::Dropout(x30, 0.5f); //classifier.0
 t4::release(x30);
 t4::tensor2f x33 = t4::Linear(x31, ctx.classifier_1_weight, ctx.classifier_1_bias); //classifier.1
 t4::release(x31);
 t4::tensor2f x34 = t4::ReluInplace(x33); //classifier.2
 t4::release(x33);
 t4::tensor2f x35 = t4::Dropout(x34, 0.5f); //classifier.3
 t4::release(x34);
 t4::tensor2f x37 = t4::Linear(x35, ctx.classifier_4_weight, ctx.classifier_4_bias); //classifier.4
 t4::release(x35);
 t4::tensor2f x38 = t4::ReluInplace(x37); //classifier.5
 t4::release(x37);
 t4::tensor2f x39 = t4::Linear(x38, ctx.classifier_6_weight, ctx.classifier_6_bias); //classifier.6
 t4::release(x38);
 return x39;
}

Also, it produces a binary with weights.

How differently it runs compared to pytorch?

For the case of AlexNet and test example:

Predictions made by tensor4:

  68.935448%: speedboat
  23.621313%: amphibian, amphibious vehicle
  2.844828%: container ship, containership, container vessel
  0.931512%: fireboat
  0.624658%: lifeboat
  0.594834%: sandbar, sand bar
  0.526897%: submarine, pigboat, sub, U-boat
  0.292151%: canoe
  0.263978%: paddle, boat paddle
  0.263804%: trimaran

Pytorch output:

  68.935245% speedboat
  23.621449% amphibian, amphibious vehicle
  2.844823% container ship, containership, container vessel
  0.931520% fireboat
  0.624658% lifeboat
  0.594838% sandbar, sand bar
  0.526899% submarine, pigboat, sub, U-boat
  0.292150% canoe
  0.263979% paddle, boat paddle
  0.263808% trimaran

The difference is due to differences of float point nubares rounding.

	Inference time:
Pytorch CPU	41.5ms
tensor4	82.0ms
tensor4 + MKL	32.4ms

tensor4 has a naive GEMM implementation, however you can enable using the one from MKL: cblas_sgemm.

Row tensor4+MKL in the table above corresponds to the case, when instead of naive GEMM, MKL is used.

Personal Website

tensor4

tensor4 - pytorch to C++ convertor using lightweight templated tensor library

How differently it runs compared to pytorch?