Hi, I’ve been trying to use TVM and BYOC to deploy QNN models on an NPU which supports full integer QNN flow. However, when I import a pre-quantized model produced by PyTorch, all `qint8`

weights are converted into `fp32`

params tensors, and additional `qnn.quantize`

are inserted before `qnn.conv2d`

to convert the weights back into `int8`

.

First few layers of converted Relay of quantized ResNet-18 model from `torchvision.models.quantization.resnet`

:

```
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), float32], %conv1_bias: Tensor[(64), float32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.quantize(%conv1_weight, 0.00308922f, 0, out_dtype="int8", axis=0);
%3 = qnn.conv2d(%1, %2, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%4 = qnn.quantize(%conv1_bias, 5.75275e-05f, 0, out_dtype="int32", axis=0);
%5 = nn.bias_add(%3, %4);
%6 = qnn.requantize(%5, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
```

I don’t know why the PyTorch frontend is designed to map QNN ops in this way, if I want to export a compiled library, all the weights are stored in `fp32`

, and all the extra `qnn.quantize`

ops will have performance cost.

My personal preferred way to map PyTorch QNN ops is to copy the original `qint8`

weights directly into `int8`

params tensors, and since PyTorch does not quantize bias, the frontend could quantize the `fp32`

bias into `int32`

params tensors with the scales information of weights and input of current conv layer (`scale_bias = scale_weights * scale_input`

). In this way, the mapped Relay will be like:

```
def @main(%input: Tensor[(1, 3, 224, 224), float32], %conv1_weight: Tensor[(64, 3, 7, 7), int8], %conv1_bias: Tensor[(64), int32], ...) {
%0 = qnn.quantize(%input, 0.018622f, 114, out_dtype="uint8", axis=1);
%1 = nn.pad(%0, 114f, pad_width=[[0, 0], [0, 0], [3, 3], [3, 3]]);
%2 = qnn.conv2d(%1, %conv1_weight, 114, 0, 0.018622f, 0.00308922f, strides=[2, 2], padding=[0, 0, 0, 0], channels=64, kernel_size=[7, 7], out_dtype="int32");
%3 = nn.bias_add(%2, %conv1_bias);
%4 = qnn.requantize(%3, 5.75275e-05f, 0, 0.0146432f, 0, axis=1, out_dtype="int32");
...
```

Will there be updates on the PyTorch frontend to support this scheme?