Untitled

 avatar
unknown
c_cpp
24 days ago
116 kB
3
Indexable
Model Input Name: unique_ids_raw_output___9:0, Shape: [0]
Model Input Name: segment_ids:0, Shape: [0, 256]
Model Input Name: input_mask:0, Shape: [0, 256]
Model Input Name: input_ids:0, Shape: [0, 256]
Starting model execution...

Inputs Details:
Input Name: input_ids:0
Shape: (1, 256)
Data (first 10 values): [ 101 2054 2003 1996 3007 1997 2605 1029  102 1996]...
--------------------------------------------------
Input Name: segment_ids:0
Shape: (1, 256)
Data (first 10 values): [0 0 0 0 0 0 0 0 0 1]...
--------------------------------------------------
Input Name: input_mask:0
Shape: (1, 256)
Data (first 10 values): [1 1 1 1 1 1 1 1 1 1]...
--------------------------------------------------
Input Name: unique_ids_raw_output___9:0
Shape: (1,)
Data (first 10 values): [0]...
--------------------------------------------------
Node: unique_ids_graph_outputs_Identity__10, Execution Time: 0.000511 seconds

Node: bert/encoder/Shape, Execution Time: 0.000030 seconds

Node: bert/encoder/Shape__12, Execution Time: 0.000038 seconds

Node: bert/encoder/strided_slice, Execution Time: 0.000173 seconds

Node: bert/encoder/strided_slice__16, Execution Time: 0.000029 seconds

Node: bert/encoder/strided_slice__17, Execution Time: 0.000020 seconds

Node: bert/encoder/ones/packed_Unsqueeze__18, Execution Time: 0.000035 seconds

Node: bert/encoder/ones/packed_Concat__21, Execution Time: 0.004840 seconds

Node: bert/encoder/ones__22, Execution Time: 0.000027 seconds

Node: bert/encoder/ones, Execution Time: 0.000075 seconds

Node: bert/encoder/Reshape, Execution Time: 0.000039 seconds

Node: bert/encoder/Cast, Execution Time: 0.000020 seconds

Node: bert/encoder/mul, Execution Time: 0.007645 seconds

Node: bert/encoder/layer_9/attention/self/ExpandDims, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_9/attention/self/sub, Execution Time: 0.006671 seconds

Node: bert/encoder/layer_9/attention/self/mul_1, Execution Time: 0.000213 seconds

Node: bert/embeddings/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/embeddings/Reshape, Execution Time: 0.000005 seconds

Node: bert/embeddings/GatherV2, Execution Time: 0.000162 seconds

Node: bert/embeddings/Reshape_1, Execution Time: 0.000020 seconds

Node: bert/embeddings/one_hot, Execution Time: 0.000219 seconds

Input size: (None, 256, 2, 768)
No Add node related to MatMul output: bert/embeddings/MatMul. Executing regular MatMul.
MatMul Node: bert/embeddings/MatMul, Execution Time: 0.027465 seconds

Node: bert/embeddings/Reshape_3, Execution Time: 0.000025 seconds

Add Node: bert/embeddings/add, Execution Time: 0.000611 seconds

Add Node: bert/embeddings/add_1, Execution Time: 0.000467 seconds

Node: bert/embeddings/LayerNorm/moments/mean, Execution Time: 0.005089 seconds

Node: bert/embeddings/LayerNorm/moments/SquaredDifference, Execution Time: 0.000502 seconds

Node: bert/embeddings/LayerNorm/moments/SquaredDifference__72, Execution Time: 0.000517 seconds

Node: bert/embeddings/LayerNorm/moments/variance, Execution Time: 0.000074 seconds

Add Node: bert/embeddings/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/embeddings/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.010280 seconds

Node: bert/embeddings/LayerNorm/batchnorm/Rsqrt__74, Execution Time: 0.005450 seconds

Node: bert/embeddings/LayerNorm/batchnorm/mul, Execution Time: 0.000053 seconds

Node: bert/embeddings/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/embeddings/LayerNorm/batchnorm/sub, Execution Time: 0.000069 seconds

Node: bert/embeddings/LayerNorm/batchnorm/mul_1, Execution Time: 0.000455 seconds

Add Node: bert/embeddings/LayerNorm/batchnorm/add_1, Execution Time: 0.000453 seconds

Node: bert/encoder/Reshape_1, Execution Time: 0.000024 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_0/attention/self/value/MatMul, Execution Time: 0.001809 seconds

Skipping already processed Node: bert/encoder/layer_0/attention/self/value/BiasAdd

Node: bert/encoder/layer_0/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_0/attention/self/transpose_2, Execution Time: 0.000505 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_0/attention/self/query/MatMul, Execution Time: 0.000672 seconds

Skipping already processed Node: bert/encoder/layer_0/attention/self/query/BiasAdd

Node: bert/encoder/layer_0/attention/self/Reshape, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_0/attention/self/transpose, Execution Time: 0.000450 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_0/attention/self/key/MatMul, Execution Time: 0.000619 seconds

Skipping already processed Node: bert/encoder/layer_0/attention/self/key/BiasAdd

Node: bert/encoder/layer_0/attention/self/Reshape_1, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_0/attention/self/MatMul__306, Execution Time: 0.000444 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_0/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_0/attention/self/MatMul, Execution Time: 0.001491 seconds

Node: bert/encoder/layer_0/attention/self/Mul, Execution Time: 0.001327 seconds

Add Node: bert/encoder/layer_0/attention/self/add, Execution Time: 0.001349 seconds

Node: bert/encoder/layer_0/attention/self/Softmax, Execution Time: 0.009065 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_0/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_0/attention/self/MatMul_1, Execution Time: 0.000635 seconds

Node: bert/encoder/layer_0/attention/self/transpose_3, Execution Time: 0.000550 seconds

Node: bert/encoder/layer_0/attention/self/Reshape_3, Execution Time: 0.000058 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_0/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_0/attention/output/dense/MatMul, Execution Time: 0.001760 seconds

Skipping already processed Node: bert/encoder/layer_0/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_0/attention/output/add

Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000634 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/SquaredDifference__309, Execution Time: 0.000473 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/Rsqrt__311, Execution Time: 0.000068 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000041 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000464 seconds

Add Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000457 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_0/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_0/intermediate/dense/MatMul, Execution Time: 0.000690 seconds

Skipping already processed Node: bert/encoder/layer_0/intermediate/dense/BiasAdd

Node: bert/encoder/layer_0/intermediate/dense/Pow, Execution Time: 0.018049 seconds

Node: bert/encoder/layer_0/intermediate/dense/mul, Execution Time: 0.001407 seconds

Add Node: bert/encoder/layer_0/intermediate/dense/add, Execution Time: 0.001314 seconds

Node: bert/encoder/layer_0/intermediate/dense/mul_1, Execution Time: 0.001507 seconds

Node: bert/encoder/layer_0/intermediate/dense/Tanh, Execution Time: 0.003959 seconds

Add Node: bert/encoder/layer_0/intermediate/dense/add_1, Execution Time: 0.001380 seconds

Node: bert/encoder/layer_0/intermediate/dense/mul_2, Execution Time: 0.001314 seconds

Node: bert/encoder/layer_0/intermediate/dense/mul_3, Execution Time: 0.001374 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_0/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_0/output/dense/MatMul, Execution Time: 0.001047 seconds

Skipping already processed Node: bert/encoder/layer_0/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_0/output/add

Node: bert/encoder/layer_0/output/LayerNorm/moments/mean, Execution Time: 0.000100 seconds

Node: bert/encoder/layer_0/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000494 seconds

Node: bert/encoder/layer_0/output/LayerNorm/moments/SquaredDifference__313, Execution Time: 0.000547 seconds

Node: bert/encoder/layer_0/output/LayerNorm/moments/variance, Execution Time: 0.000057 seconds

Add Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/Rsqrt__315, Execution Time: 0.000076 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul, Execution Time: 0.000056 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000486 seconds

Add Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000471 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_1/attention/self/value/MatMul, Execution Time: 0.000654 seconds

Skipping already processed Node: bert/encoder/layer_1/attention/self/value/BiasAdd

Node: bert/encoder/layer_1/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_1/attention/self/transpose_2, Execution Time: 0.000449 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_1/attention/self/query/MatMul, Execution Time: 0.000632 seconds

Skipping already processed Node: bert/encoder/layer_1/attention/self/query/BiasAdd

Node: bert/encoder/layer_1/attention/self/Reshape, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_1/attention/self/transpose, Execution Time: 0.000474 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_1/attention/self/key/MatMul, Execution Time: 0.000604 seconds

Skipping already processed Node: bert/encoder/layer_1/attention/self/key/BiasAdd

Node: bert/encoder/layer_1/attention/self/Reshape_1, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_1/attention/self/MatMul__320, Execution Time: 0.000483 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_1/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_1/attention/self/MatMul, Execution Time: 0.000508 seconds

Node: bert/encoder/layer_1/attention/self/Mul, Execution Time: 0.001349 seconds

Add Node: bert/encoder/layer_1/attention/self/add, Execution Time: 0.001579 seconds

Node: bert/encoder/layer_1/attention/self/Softmax, Execution Time: 0.001335 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_1/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_1/attention/self/MatMul_1, Execution Time: 0.000563 seconds

Node: bert/encoder/layer_1/attention/self/transpose_3, Execution Time: 0.000447 seconds

Node: bert/encoder/layer_1/attention/self/Reshape_3, Execution Time: 0.000047 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_1/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_1/attention/output/dense/MatMul, Execution Time: 0.000678 seconds

Skipping already processed Node: bert/encoder/layer_1/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_1/attention/output/add

Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000606 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/SquaredDifference__323, Execution Time: 0.000474 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds

Add Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000050 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/Rsqrt__325, Execution Time: 0.000074 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000041 seconds

Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000466 seconds

Add Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000446 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_1/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_1/intermediate/dense/MatMul, Execution Time: 0.000661 seconds

Skipping already processed Node: bert/encoder/layer_1/intermediate/dense/BiasAdd

Node: bert/encoder/layer_1/intermediate/dense/Pow, Execution Time: 0.001371 seconds

Node: bert/encoder/layer_1/intermediate/dense/mul, Execution Time: 0.001382 seconds

Add Node: bert/encoder/layer_1/intermediate/dense/add, Execution Time: 0.001623 seconds

Node: bert/encoder/layer_1/intermediate/dense/mul_1, Execution Time: 0.001303 seconds

Node: bert/encoder/layer_1/intermediate/dense/Tanh, Execution Time: 0.001375 seconds

Add Node: bert/encoder/layer_1/intermediate/dense/add_1, Execution Time: 0.001320 seconds

Node: bert/encoder/layer_1/intermediate/dense/mul_2, Execution Time: 0.001378 seconds

Node: bert/encoder/layer_1/intermediate/dense/mul_3, Execution Time: 0.001307 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_1/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_1/output/dense/MatMul, Execution Time: 0.001064 seconds

Skipping already processed Node: bert/encoder/layer_1/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_1/output/add

Node: bert/encoder/layer_1/output/LayerNorm/moments/mean, Execution Time: 0.000084 seconds

Node: bert/encoder/layer_1/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000484 seconds

Node: bert/encoder/layer_1/output/LayerNorm/moments/SquaredDifference__327, Execution Time: 0.000571 seconds

Node: bert/encoder/layer_1/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds

Add Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/add, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/Rsqrt__329, Execution Time: 0.000080 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000042 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/sub, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds

Add Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000466 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_2/attention/self/value/MatMul, Execution Time: 0.000678 seconds

Skipping already processed Node: bert/encoder/layer_2/attention/self/value/BiasAdd

Node: bert/encoder/layer_2/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_2/attention/self/transpose_2, Execution Time: 0.000461 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_2/attention/self/query/MatMul, Execution Time: 0.000645 seconds

Skipping already processed Node: bert/encoder/layer_2/attention/self/query/BiasAdd

Node: bert/encoder/layer_2/attention/self/Reshape, Execution Time: 0.000010 seconds

Node: bert/encoder/layer_2/attention/self/transpose, Execution Time: 0.000476 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_2/attention/self/key/MatMul, Execution Time: 0.000615 seconds

Skipping already processed Node: bert/encoder/layer_2/attention/self/key/BiasAdd

Node: bert/encoder/layer_2/attention/self/Reshape_1, Execution Time: 0.000008 seconds

Node: bert/encoder/layer_2/attention/self/MatMul__334, Execution Time: 0.000464 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_2/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_2/attention/self/MatMul, Execution Time: 0.000499 seconds

Node: bert/encoder/layer_2/attention/self/Mul, Execution Time: 0.001384 seconds

Add Node: bert/encoder/layer_2/attention/self/add, Execution Time: 0.001380 seconds

Node: bert/encoder/layer_2/attention/self/Softmax, Execution Time: 0.001305 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_2/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_2/attention/self/MatMul_1, Execution Time: 0.000562 seconds

Node: bert/encoder/layer_2/attention/self/transpose_3, Execution Time: 0.000456 seconds

Node: bert/encoder/layer_2/attention/self/Reshape_3, Execution Time: 0.000037 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_2/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_2/attention/output/dense/MatMul, Execution Time: 0.000755 seconds

Skipping already processed Node: bert/encoder/layer_2/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_2/attention/output/add

Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/mean, Execution Time: 0.000100 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000583 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/SquaredDifference__337, Execution Time: 0.000602 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/variance, Execution Time: 0.000071 seconds

Add Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000078 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/Rsqrt__339, Execution Time: 0.000089 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000042 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000518 seconds

Add Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000451 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_2/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_2/intermediate/dense/MatMul, Execution Time: 0.000782 seconds

Skipping already processed Node: bert/encoder/layer_2/intermediate/dense/BiasAdd

Node: bert/encoder/layer_2/intermediate/dense/Pow, Execution Time: 0.001319 seconds

Node: bert/encoder/layer_2/intermediate/dense/mul, Execution Time: 0.001400 seconds

Add Node: bert/encoder/layer_2/intermediate/dense/add, Execution Time: 0.001352 seconds

Node: bert/encoder/layer_2/intermediate/dense/mul_1, Execution Time: 0.001411 seconds

Node: bert/encoder/layer_2/intermediate/dense/Tanh, Execution Time: 0.001316 seconds

Add Node: bert/encoder/layer_2/intermediate/dense/add_1, Execution Time: 0.001329 seconds

Node: bert/encoder/layer_2/intermediate/dense/mul_2, Execution Time: 0.001370 seconds

Node: bert/encoder/layer_2/intermediate/dense/mul_3, Execution Time: 0.001295 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_2/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_2/output/dense/MatMul, Execution Time: 0.000986 seconds

Skipping already processed Node: bert/encoder/layer_2/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_2/output/add

Node: bert/encoder/layer_2/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds

Node: bert/encoder/layer_2/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000505 seconds

Node: bert/encoder/layer_2/output/LayerNorm/moments/SquaredDifference__341, Execution Time: 0.000457 seconds

Node: bert/encoder/layer_2/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds

Add Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000070 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/Rsqrt__343, Execution Time: 0.000066 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/sub, Execution Time: 0.000056 seconds

Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000513 seconds

Add Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000452 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_3/attention/self/value/MatMul, Execution Time: 0.000684 seconds

Skipping already processed Node: bert/encoder/layer_3/attention/self/value/BiasAdd

Node: bert/encoder/layer_3/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_3/attention/self/transpose_2, Execution Time: 0.000478 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_3/attention/self/query/MatMul, Execution Time: 0.000721 seconds

Skipping already processed Node: bert/encoder/layer_3/attention/self/query/BiasAdd

Node: bert/encoder/layer_3/attention/self/Reshape, Execution Time: 0.000010 seconds

Node: bert/encoder/layer_3/attention/self/transpose, Execution Time: 0.000443 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_3/attention/self/key/MatMul, Execution Time: 0.000608 seconds

Skipping already processed Node: bert/encoder/layer_3/attention/self/key/BiasAdd

Node: bert/encoder/layer_3/attention/self/Reshape_1, Execution Time: 0.000007 seconds

Node: bert/encoder/layer_3/attention/self/MatMul__348, Execution Time: 0.000437 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_3/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_3/attention/self/MatMul, Execution Time: 0.000544 seconds

Node: bert/encoder/layer_3/attention/self/Mul, Execution Time: 0.001320 seconds

Add Node: bert/encoder/layer_3/attention/self/add, Execution Time: 0.001428 seconds

Node: bert/encoder/layer_3/attention/self/Softmax, Execution Time: 0.001303 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_3/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_3/attention/self/MatMul_1, Execution Time: 0.000561 seconds

Node: bert/encoder/layer_3/attention/self/transpose_3, Execution Time: 0.000469 seconds

Node: bert/encoder/layer_3/attention/self/Reshape_3, Execution Time: 0.000038 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_3/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_3/attention/output/dense/MatMul, Execution Time: 0.000677 seconds

Skipping already processed Node: bert/encoder/layer_3/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_3/attention/output/add

Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/mean, Execution Time: 0.000088 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000476 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/SquaredDifference__351, Execution Time: 0.000554 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds

Add Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/Rsqrt__353, Execution Time: 0.000072 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000056 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000458 seconds

Add Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000449 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_3/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_3/intermediate/dense/MatMul, Execution Time: 0.000654 seconds

Skipping already processed Node: bert/encoder/layer_3/intermediate/dense/BiasAdd

Node: bert/encoder/layer_3/intermediate/dense/Pow, Execution Time: 0.001374 seconds

Node: bert/encoder/layer_3/intermediate/dense/mul, Execution Time: 0.001344 seconds

Add Node: bert/encoder/layer_3/intermediate/dense/add, Execution Time: 0.001312 seconds

Node: bert/encoder/layer_3/intermediate/dense/mul_1, Execution Time: 0.001383 seconds

Node: bert/encoder/layer_3/intermediate/dense/Tanh, Execution Time: 0.001316 seconds

Add Node: bert/encoder/layer_3/intermediate/dense/add_1, Execution Time: 0.001338 seconds

Node: bert/encoder/layer_3/intermediate/dense/mul_2, Execution Time: 0.001379 seconds

Node: bert/encoder/layer_3/intermediate/dense/mul_3, Execution Time: 0.001310 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_3/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_3/output/dense/MatMul, Execution Time: 0.000992 seconds

Skipping already processed Node: bert/encoder/layer_3/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_3/output/add

Node: bert/encoder/layer_3/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds

Node: bert/encoder/layer_3/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000485 seconds

Node: bert/encoder/layer_3/output/LayerNorm/moments/SquaredDifference__355, Execution Time: 0.000449 seconds

Node: bert/encoder/layer_3/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/Rsqrt__357, Execution Time: 0.000070 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul, Execution Time: 0.000061 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000545 seconds

Add Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000445 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_4/attention/self/value/MatMul, Execution Time: 0.000668 seconds

Skipping already processed Node: bert/encoder/layer_4/attention/self/value/BiasAdd

Node: bert/encoder/layer_4/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_4/attention/self/transpose_2, Execution Time: 0.000548 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_4/attention/self/query/MatMul, Execution Time: 0.000681 seconds

Skipping already processed Node: bert/encoder/layer_4/attention/self/query/BiasAdd

Node: bert/encoder/layer_4/attention/self/Reshape, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_4/attention/self/transpose, Execution Time: 0.000567 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_4/attention/self/key/MatMul, Execution Time: 0.000655 seconds

Skipping already processed Node: bert/encoder/layer_4/attention/self/key/BiasAdd

Node: bert/encoder/layer_4/attention/self/Reshape_1, Execution Time: 0.000007 seconds

Node: bert/encoder/layer_4/attention/self/MatMul__362, Execution Time: 0.000541 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_4/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_4/attention/self/MatMul, Execution Time: 0.000483 seconds

Node: bert/encoder/layer_4/attention/self/Mul, Execution Time: 0.001326 seconds

Add Node: bert/encoder/layer_4/attention/self/add, Execution Time: 0.001472 seconds

Node: bert/encoder/layer_4/attention/self/Softmax, Execution Time: 0.001326 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_4/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_4/attention/self/MatMul_1, Execution Time: 0.000573 seconds

Node: bert/encoder/layer_4/attention/self/transpose_3, Execution Time: 0.000484 seconds

Node: bert/encoder/layer_4/attention/self/Reshape_3, Execution Time: 0.000037 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_4/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_4/attention/output/dense/MatMul, Execution Time: 0.000743 seconds

Skipping already processed Node: bert/encoder/layer_4/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_4/attention/output/add

Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000565 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/SquaredDifference__365, Execution Time: 0.000463 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/variance, Execution Time: 0.000060 seconds

Add Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/Rsqrt__367, Execution Time: 0.000067 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000457 seconds

Add Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000459 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_4/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_4/intermediate/dense/MatMul, Execution Time: 0.000646 seconds

Skipping already processed Node: bert/encoder/layer_4/intermediate/dense/BiasAdd

Node: bert/encoder/layer_4/intermediate/dense/Pow, Execution Time: 0.001339 seconds

Node: bert/encoder/layer_4/intermediate/dense/mul, Execution Time: 0.001356 seconds

Add Node: bert/encoder/layer_4/intermediate/dense/add, Execution Time: 0.001398 seconds

Node: bert/encoder/layer_4/intermediate/dense/mul_1, Execution Time: 0.001317 seconds

Node: bert/encoder/layer_4/intermediate/dense/Tanh, Execution Time: 0.001311 seconds

Add Node: bert/encoder/layer_4/intermediate/dense/add_1, Execution Time: 0.001370 seconds

Node: bert/encoder/layer_4/intermediate/dense/mul_2, Execution Time: 0.001508 seconds

Node: bert/encoder/layer_4/intermediate/dense/mul_3, Execution Time: 0.001303 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_4/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_4/output/dense/MatMul, Execution Time: 0.000987 seconds

Skipping already processed Node: bert/encoder/layer_4/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_4/output/add

Node: bert/encoder/layer_4/output/LayerNorm/moments/mean, Execution Time: 0.000072 seconds

Node: bert/encoder/layer_4/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000470 seconds

Node: bert/encoder/layer_4/output/LayerNorm/moments/SquaredDifference__369, Execution Time: 0.000466 seconds

Node: bert/encoder/layer_4/output/LayerNorm/moments/variance, Execution Time: 0.000052 seconds

Add Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/add, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/Rsqrt__371, Execution Time: 0.000066 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000466 seconds

Add Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000463 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_5/attention/self/value/MatMul, Execution Time: 0.001840 seconds

Skipping already processed Node: bert/encoder/layer_5/attention/self/value/BiasAdd

Node: bert/encoder/layer_5/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_5/attention/self/transpose_2, Execution Time: 0.000459 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_5/attention/self/query/MatMul, Execution Time: 0.000622 seconds

Skipping already processed Node: bert/encoder/layer_5/attention/self/query/BiasAdd

Node: bert/encoder/layer_5/attention/self/Reshape, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_5/attention/self/transpose, Execution Time: 0.000436 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_5/attention/self/key/MatMul, Execution Time: 0.000607 seconds

Skipping already processed Node: bert/encoder/layer_5/attention/self/key/BiasAdd

Node: bert/encoder/layer_5/attention/self/Reshape_1, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_5/attention/self/MatMul__376, Execution Time: 0.000448 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_5/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_5/attention/self/MatMul, Execution Time: 0.000485 seconds

Node: bert/encoder/layer_5/attention/self/Mul, Execution Time: 0.001392 seconds

Add Node: bert/encoder/layer_5/attention/self/add, Execution Time: 0.001310 seconds

Node: bert/encoder/layer_5/attention/self/Softmax, Execution Time: 0.001333 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_5/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_5/attention/self/MatMul_1, Execution Time: 0.000640 seconds

Node: bert/encoder/layer_5/attention/self/transpose_3, Execution Time: 0.000455 seconds

Node: bert/encoder/layer_5/attention/self/Reshape_3, Execution Time: 0.000037 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_5/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_5/attention/output/dense/MatMul, Execution Time: 0.000660 seconds

Skipping already processed Node: bert/encoder/layer_5/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_5/attention/output/add

Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000477 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/SquaredDifference__379, Execution Time: 0.000461 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds

Add Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/Rsqrt__381, Execution Time: 0.000068 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000468 seconds

Add Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000451 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_5/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_5/intermediate/dense/MatMul, Execution Time: 0.000666 seconds

Skipping already processed Node: bert/encoder/layer_5/intermediate/dense/BiasAdd

Node: bert/encoder/layer_5/intermediate/dense/Pow, Execution Time: 0.001391 seconds

Node: bert/encoder/layer_5/intermediate/dense/mul, Execution Time: 0.001312 seconds

Add Node: bert/encoder/layer_5/intermediate/dense/add, Execution Time: 0.001391 seconds

Node: bert/encoder/layer_5/intermediate/dense/mul_1, Execution Time: 0.001297 seconds

Node: bert/encoder/layer_5/intermediate/dense/Tanh, Execution Time: 0.001306 seconds

Add Node: bert/encoder/layer_5/intermediate/dense/add_1, Execution Time: 0.001386 seconds

Node: bert/encoder/layer_5/intermediate/dense/mul_2, Execution Time: 0.001291 seconds

Node: bert/encoder/layer_5/intermediate/dense/mul_3, Execution Time: 0.001279 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_5/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_5/output/dense/MatMul, Execution Time: 0.001012 seconds

Skipping already processed Node: bert/encoder/layer_5/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_5/output/add

Node: bert/encoder/layer_5/output/LayerNorm/moments/mean, Execution Time: 0.000083 seconds

Node: bert/encoder/layer_5/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000461 seconds

Node: bert/encoder/layer_5/output/LayerNorm/moments/SquaredDifference__383, Execution Time: 0.000457 seconds

Node: bert/encoder/layer_5/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds

Add Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/add, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000049 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/Rsqrt__385, Execution Time: 0.000066 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000465 seconds

Add Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000463 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_6/attention/self/value/MatMul, Execution Time: 0.000639 seconds

Skipping already processed Node: bert/encoder/layer_6/attention/self/value/BiasAdd

Node: bert/encoder/layer_6/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_6/attention/self/transpose_2, Execution Time: 0.000466 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_6/attention/self/query/MatMul, Execution Time: 0.000643 seconds

Skipping already processed Node: bert/encoder/layer_6/attention/self/query/BiasAdd

Node: bert/encoder/layer_6/attention/self/Reshape, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_6/attention/self/transpose, Execution Time: 0.000510 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_6/attention/self/key/MatMul, Execution Time: 0.000669 seconds

Skipping already processed Node: bert/encoder/layer_6/attention/self/key/BiasAdd

Node: bert/encoder/layer_6/attention/self/Reshape_1, Execution Time: 0.000008 seconds

Node: bert/encoder/layer_6/attention/self/MatMul__390, Execution Time: 0.000553 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_6/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_6/attention/self/MatMul, Execution Time: 0.000546 seconds

Node: bert/encoder/layer_6/attention/self/Mul, Execution Time: 0.002146 seconds

Add Node: bert/encoder/layer_6/attention/self/add, Execution Time: 0.001294 seconds

Node: bert/encoder/layer_6/attention/self/Softmax, Execution Time: 0.001295 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_6/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_6/attention/self/MatMul_1, Execution Time: 0.000554 seconds

Node: bert/encoder/layer_6/attention/self/transpose_3, Execution Time: 0.000507 seconds

Node: bert/encoder/layer_6/attention/self/Reshape_3, Execution Time: 0.000047 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_6/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_6/attention/output/dense/MatMul, Execution Time: 0.000683 seconds

Skipping already processed Node: bert/encoder/layer_6/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_6/attention/output/add

Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/mean, Execution Time: 0.000087 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000460 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/SquaredDifference__393, Execution Time: 0.000455 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/variance, Execution Time: 0.000062 seconds

Add Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/Rsqrt__395, Execution Time: 0.000072 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000443 seconds

Add Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000454 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_6/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_6/intermediate/dense/MatMul, Execution Time: 0.000655 seconds

Skipping already processed Node: bert/encoder/layer_6/intermediate/dense/BiasAdd

Node: bert/encoder/layer_6/intermediate/dense/Pow, Execution Time: 0.001311 seconds

Node: bert/encoder/layer_6/intermediate/dense/mul, Execution Time: 0.001315 seconds

Add Node: bert/encoder/layer_6/intermediate/dense/add, Execution Time: 0.001377 seconds

Node: bert/encoder/layer_6/intermediate/dense/mul_1, Execution Time: 0.001305 seconds

Node: bert/encoder/layer_6/intermediate/dense/Tanh, Execution Time: 0.001307 seconds

Add Node: bert/encoder/layer_6/intermediate/dense/add_1, Execution Time: 0.001387 seconds

Node: bert/encoder/layer_6/intermediate/dense/mul_2, Execution Time: 0.001303 seconds

Node: bert/encoder/layer_6/intermediate/dense/mul_3, Execution Time: 0.001365 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_6/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_6/output/dense/MatMul, Execution Time: 0.000988 seconds

Skipping already processed Node: bert/encoder/layer_6/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_6/output/add

Node: bert/encoder/layer_6/output/LayerNorm/moments/mean, Execution Time: 0.000092 seconds

Node: bert/encoder/layer_6/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000490 seconds

Node: bert/encoder/layer_6/output/LayerNorm/moments/SquaredDifference__397, Execution Time: 0.000460 seconds

Node: bert/encoder/layer_6/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds

Add Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/Rsqrt__399, Execution Time: 0.000071 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000045 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000481 seconds

Add Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000447 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_7/attention/self/value/MatMul, Execution Time: 0.000656 seconds

Skipping already processed Node: bert/encoder/layer_7/attention/self/value/BiasAdd

Node: bert/encoder/layer_7/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_7/attention/self/transpose_2, Execution Time: 0.000444 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_7/attention/self/query/MatMul, Execution Time: 0.000674 seconds

Skipping already processed Node: bert/encoder/layer_7/attention/self/query/BiasAdd

Node: bert/encoder/layer_7/attention/self/Reshape, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_7/attention/self/transpose, Execution Time: 0.000441 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_7/attention/self/key/MatMul, Execution Time: 0.000600 seconds

Skipping already processed Node: bert/encoder/layer_7/attention/self/key/BiasAdd

Node: bert/encoder/layer_7/attention/self/Reshape_1, Execution Time: 0.000008 seconds

Node: bert/encoder/layer_7/attention/self/MatMul__404, Execution Time: 0.000440 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_7/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_7/attention/self/MatMul, Execution Time: 0.000509 seconds

Node: bert/encoder/layer_7/attention/self/Mul, Execution Time: 0.001363 seconds

Add Node: bert/encoder/layer_7/attention/self/add, Execution Time: 0.001514 seconds

Node: bert/encoder/layer_7/attention/self/Softmax, Execution Time: 0.001384 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_7/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_7/attention/self/MatMul_1, Execution Time: 0.000567 seconds

Node: bert/encoder/layer_7/attention/self/transpose_3, Execution Time: 0.000458 seconds

Node: bert/encoder/layer_7/attention/self/Reshape_3, Execution Time: 0.000047 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_7/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_7/attention/output/dense/MatMul, Execution Time: 0.000650 seconds

Skipping already processed Node: bert/encoder/layer_7/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_7/attention/output/add

Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000473 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/SquaredDifference__407, Execution Time: 0.000465 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds

Add Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000045 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/Rsqrt__409, Execution Time: 0.000066 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000451 seconds

Add Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000458 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_7/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_7/intermediate/dense/MatMul, Execution Time: 0.000650 seconds

Skipping already processed Node: bert/encoder/layer_7/intermediate/dense/BiasAdd

Node: bert/encoder/layer_7/intermediate/dense/Pow, Execution Time: 0.001369 seconds

Node: bert/encoder/layer_7/intermediate/dense/mul, Execution Time: 0.001377 seconds

Add Node: bert/encoder/layer_7/intermediate/dense/add, Execution Time: 0.001498 seconds

Node: bert/encoder/layer_7/intermediate/dense/mul_1, Execution Time: 0.001320 seconds

Node: bert/encoder/layer_7/intermediate/dense/Tanh, Execution Time: 0.001377 seconds

Add Node: bert/encoder/layer_7/intermediate/dense/add_1, Execution Time: 0.001314 seconds

Node: bert/encoder/layer_7/intermediate/dense/mul_2, Execution Time: 0.001305 seconds

Node: bert/encoder/layer_7/intermediate/dense/mul_3, Execution Time: 0.002071 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_7/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_7/output/dense/MatMul, Execution Time: 0.001035 seconds

Skipping already processed Node: bert/encoder/layer_7/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_7/output/add

Node: bert/encoder/layer_7/output/LayerNorm/moments/mean, Execution Time: 0.000083 seconds

Node: bert/encoder/layer_7/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000452 seconds

Node: bert/encoder/layer_7/output/LayerNorm/moments/SquaredDifference__411, Execution Time: 0.000452 seconds

Node: bert/encoder/layer_7/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds

Add Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000045 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/Rsqrt__413, Execution Time: 0.000071 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds

Add Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000447 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_8/attention/self/value/MatMul, Execution Time: 0.000658 seconds

Skipping already processed Node: bert/encoder/layer_8/attention/self/value/BiasAdd

Node: bert/encoder/layer_8/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_8/attention/self/transpose_2, Execution Time: 0.000448 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_8/attention/self/query/MatMul, Execution Time: 0.000630 seconds

Skipping already processed Node: bert/encoder/layer_8/attention/self/query/BiasAdd

Node: bert/encoder/layer_8/attention/self/Reshape, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_8/attention/self/transpose, Execution Time: 0.000449 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_8/attention/self/key/MatMul, Execution Time: 0.000614 seconds

Skipping already processed Node: bert/encoder/layer_8/attention/self/key/BiasAdd

Node: bert/encoder/layer_8/attention/self/Reshape_1, Execution Time: 0.000008 seconds

Node: bert/encoder/layer_8/attention/self/MatMul__418, Execution Time: 0.000443 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_8/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_8/attention/self/MatMul, Execution Time: 0.000495 seconds

Node: bert/encoder/layer_8/attention/self/Mul, Execution Time: 0.001312 seconds

Add Node: bert/encoder/layer_8/attention/self/add, Execution Time: 0.001359 seconds

Node: bert/encoder/layer_8/attention/self/Softmax, Execution Time: 0.001416 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_8/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_8/attention/self/MatMul_1, Execution Time: 0.000587 seconds

Node: bert/encoder/layer_8/attention/self/transpose_3, Execution Time: 0.000445 seconds

Node: bert/encoder/layer_8/attention/self/Reshape_3, Execution Time: 0.000051 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_8/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_8/attention/output/dense/MatMul, Execution Time: 0.000746 seconds

Skipping already processed Node: bert/encoder/layer_8/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_8/attention/output/add

Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000469 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/SquaredDifference__421, Execution Time: 0.000466 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/Rsqrt__423, Execution Time: 0.000066 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000059 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000054 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000055 seconds

Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000446 seconds

Add Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000448 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_8/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_8/intermediate/dense/MatMul, Execution Time: 0.000650 seconds

Skipping already processed Node: bert/encoder/layer_8/intermediate/dense/BiasAdd

Node: bert/encoder/layer_8/intermediate/dense/Pow, Execution Time: 0.001652 seconds

Node: bert/encoder/layer_8/intermediate/dense/mul, Execution Time: 0.001383 seconds

Add Node: bert/encoder/layer_8/intermediate/dense/add, Execution Time: 0.001327 seconds

Node: bert/encoder/layer_8/intermediate/dense/mul_1, Execution Time: 0.001308 seconds

Node: bert/encoder/layer_8/intermediate/dense/Tanh, Execution Time: 0.001390 seconds

Add Node: bert/encoder/layer_8/intermediate/dense/add_1, Execution Time: 0.001313 seconds

Node: bert/encoder/layer_8/intermediate/dense/mul_2, Execution Time: 0.001375 seconds

Node: bert/encoder/layer_8/intermediate/dense/mul_3, Execution Time: 0.001365 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_8/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_8/output/dense/MatMul, Execution Time: 0.000986 seconds

Skipping already processed Node: bert/encoder/layer_8/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_8/output/add

Node: bert/encoder/layer_8/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds

Node: bert/encoder/layer_8/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000489 seconds

Node: bert/encoder/layer_8/output/LayerNorm/moments/SquaredDifference__425, Execution Time: 0.000483 seconds

Node: bert/encoder/layer_8/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/Rsqrt__427, Execution Time: 0.000073 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000444 seconds

Add Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000456 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_9/attention/self/value/MatMul, Execution Time: 0.000708 seconds

Skipping already processed Node: bert/encoder/layer_9/attention/self/value/BiasAdd

Node: bert/encoder/layer_9/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_9/attention/self/transpose_2, Execution Time: 0.000458 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_9/attention/self/query/MatMul, Execution Time: 0.000642 seconds

Skipping already processed Node: bert/encoder/layer_9/attention/self/query/BiasAdd

Node: bert/encoder/layer_9/attention/self/Reshape, Execution Time: 0.000010 seconds

Node: bert/encoder/layer_9/attention/self/transpose, Execution Time: 0.000452 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_9/attention/self/key/MatMul, Execution Time: 0.000621 seconds

Skipping already processed Node: bert/encoder/layer_9/attention/self/key/BiasAdd

Node: bert/encoder/layer_9/attention/self/Reshape_1, Execution Time: 0.000010 seconds

Node: bert/encoder/layer_9/attention/self/MatMul__432, Execution Time: 0.000462 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_9/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_9/attention/self/MatMul, Execution Time: 0.000492 seconds

Node: bert/encoder/layer_9/attention/self/Mul, Execution Time: 0.001414 seconds

Add Node: bert/encoder/layer_9/attention/self/add, Execution Time: 0.001318 seconds

Node: bert/encoder/layer_9/attention/self/Softmax, Execution Time: 0.001571 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_9/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_9/attention/self/MatMul_1, Execution Time: 0.000562 seconds

Node: bert/encoder/layer_9/attention/self/transpose_3, Execution Time: 0.000447 seconds

Node: bert/encoder/layer_9/attention/self/Reshape_3, Execution Time: 0.000038 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_9/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_9/attention/output/dense/MatMul, Execution Time: 0.000661 seconds

Skipping already processed Node: bert/encoder/layer_9/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_9/attention/output/add

Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000456 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/SquaredDifference__435, Execution Time: 0.000499 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/variance, Execution Time: 0.000067 seconds

Add Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/Rsqrt__437, Execution Time: 0.000076 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000051 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000524 seconds

Add Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000565 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_9/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_9/intermediate/dense/MatMul, Execution Time: 0.000738 seconds

Skipping already processed Node: bert/encoder/layer_9/intermediate/dense/BiasAdd

Node: bert/encoder/layer_9/intermediate/dense/Pow, Execution Time: 0.001530 seconds

Node: bert/encoder/layer_9/intermediate/dense/mul, Execution Time: 0.001426 seconds

Add Node: bert/encoder/layer_9/intermediate/dense/add, Execution Time: 0.001411 seconds

Node: bert/encoder/layer_9/intermediate/dense/mul_1, Execution Time: 0.001332 seconds

Node: bert/encoder/layer_9/intermediate/dense/Tanh, Execution Time: 0.001435 seconds

Add Node: bert/encoder/layer_9/intermediate/dense/add_1, Execution Time: 0.001343 seconds

Node: bert/encoder/layer_9/intermediate/dense/mul_2, Execution Time: 0.001372 seconds

Node: bert/encoder/layer_9/intermediate/dense/mul_3, Execution Time: 0.001386 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_9/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_9/output/dense/MatMul, Execution Time: 0.001089 seconds

Skipping already processed Node: bert/encoder/layer_9/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_9/output/add

Node: bert/encoder/layer_9/output/LayerNorm/moments/mean, Execution Time: 0.000101 seconds

Node: bert/encoder/layer_9/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000596 seconds

Node: bert/encoder/layer_9/output/LayerNorm/moments/SquaredDifference__439, Execution Time: 0.000592 seconds

Node: bert/encoder/layer_9/output/LayerNorm/moments/variance, Execution Time: 0.000066 seconds

Add Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/add, Execution Time: 0.000058 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000059 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/Rsqrt__441, Execution Time: 0.000091 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000061 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000564 seconds

Add Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000584 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_10/attention/self/value/MatMul, Execution Time: 0.001988 seconds

Skipping already processed Node: bert/encoder/layer_10/attention/self/value/BiasAdd

Node: bert/encoder/layer_10/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_10/attention/self/transpose_2, Execution Time: 0.000438 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_10/attention/self/query/MatMul, Execution Time: 0.000623 seconds

Skipping already processed Node: bert/encoder/layer_10/attention/self/query/BiasAdd

Node: bert/encoder/layer_10/attention/self/Reshape, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_10/attention/self/transpose, Execution Time: 0.000460 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_10/attention/self/key/MatMul, Execution Time: 0.000663 seconds

Skipping already processed Node: bert/encoder/layer_10/attention/self/key/BiasAdd

Node: bert/encoder/layer_10/attention/self/Reshape_1, Execution Time: 0.000009 seconds

Node: bert/encoder/layer_10/attention/self/MatMul__446, Execution Time: 0.000453 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_10/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_10/attention/self/MatMul, Execution Time: 0.000487 seconds

Node: bert/encoder/layer_10/attention/self/Mul, Execution Time: 0.001345 seconds

Add Node: bert/encoder/layer_10/attention/self/add, Execution Time: 0.001318 seconds

Node: bert/encoder/layer_10/attention/self/Softmax, Execution Time: 0.001414 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_10/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_10/attention/self/MatMul_1, Execution Time: 0.000694 seconds

Node: bert/encoder/layer_10/attention/self/transpose_3, Execution Time: 0.000443 seconds

Node: bert/encoder/layer_10/attention/self/Reshape_3, Execution Time: 0.000048 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_10/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_10/attention/output/dense/MatMul, Execution Time: 0.000693 seconds

Skipping already processed Node: bert/encoder/layer_10/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_10/attention/output/add

Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/mean, Execution Time: 0.000084 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000475 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/SquaredDifference__449, Execution Time: 0.000465 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/Rsqrt__451, Execution Time: 0.000067 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000531 seconds

Add Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000460 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_10/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_10/intermediate/dense/MatMul, Execution Time: 0.000681 seconds

Skipping already processed Node: bert/encoder/layer_10/intermediate/dense/BiasAdd

Node: bert/encoder/layer_10/intermediate/dense/Pow, Execution Time: 0.001327 seconds

Node: bert/encoder/layer_10/intermediate/dense/mul, Execution Time: 0.001411 seconds

Add Node: bert/encoder/layer_10/intermediate/dense/add, Execution Time: 0.001332 seconds

Node: bert/encoder/layer_10/intermediate/dense/mul_1, Execution Time: 0.001390 seconds

Node: bert/encoder/layer_10/intermediate/dense/Tanh, Execution Time: 0.001319 seconds

Add Node: bert/encoder/layer_10/intermediate/dense/add_1, Execution Time: 0.001312 seconds

Node: bert/encoder/layer_10/intermediate/dense/mul_2, Execution Time: 0.001759 seconds

Node: bert/encoder/layer_10/intermediate/dense/mul_3, Execution Time: 0.001331 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_10/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_10/output/dense/MatMul, Execution Time: 0.000994 seconds

Skipping already processed Node: bert/encoder/layer_10/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_10/output/add

Node: bert/encoder/layer_10/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_10/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000477 seconds

Node: bert/encoder/layer_10/output/LayerNorm/moments/SquaredDifference__453, Execution Time: 0.000459 seconds

Node: bert/encoder/layer_10/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds

Add Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/Rsqrt__455, Execution Time: 0.000067 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/sub, Execution Time: 0.000059 seconds

Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000454 seconds

Add Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000557 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/value/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_11/attention/self/value/MatMul, Execution Time: 0.000667 seconds

Skipping already processed Node: bert/encoder/layer_11/attention/self/value/BiasAdd

Node: bert/encoder/layer_11/attention/self/Reshape_2, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_11/attention/self/transpose_2, Execution Time: 0.000451 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/query/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_11/attention/self/query/MatMul, Execution Time: 0.000632 seconds

Skipping already processed Node: bert/encoder/layer_11/attention/self/query/BiasAdd

Node: bert/encoder/layer_11/attention/self/Reshape, Execution Time: 0.000020 seconds

Node: bert/encoder/layer_11/attention/self/transpose, Execution Time: 0.000466 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/key/MatMul
torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_11/attention/self/key/MatMul, Execution Time: 0.000609 seconds

Skipping already processed Node: bert/encoder/layer_11/attention/self/key/BiasAdd

Node: bert/encoder/layer_11/attention/self/Reshape_1, Execution Time: 0.000007 seconds

Node: bert/encoder/layer_11/attention/self/MatMul__460, Execution Time: 0.000451 seconds

Input size: (12, 256, 64, 256)
No Add node related to MatMul output: bert/encoder/layer_11/attention/self/MatMul. Executing regular MatMul.
MatMul Node: bert/encoder/layer_11/attention/self/MatMul, Execution Time: 0.000494 seconds

Node: bert/encoder/layer_11/attention/self/Mul, Execution Time: 0.001331 seconds

Add Node: bert/encoder/layer_11/attention/self/add, Execution Time: 0.001391 seconds

Node: bert/encoder/layer_11/attention/self/Softmax, Execution Time: 0.001305 seconds

Input size: (12, 256, 256, 64)
No Add node related to MatMul output: bert/encoder/layer_11/attention/self/MatMul_1. Executing regular MatMul.
MatMul Node: bert/encoder/layer_11/attention/self/MatMul_1, Execution Time: 0.000559 seconds

Node: bert/encoder/layer_11/attention/self/transpose_3, Execution Time: 0.000445 seconds

Node: bert/encoder/layer_11/attention/self/Reshape_3, Execution Time: 0.000047 seconds

Input size: (None, 256, 768, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_11/attention/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_11/attention/output/dense/MatMul, Execution Time: 0.000668 seconds

Skipping already processed Node: bert/encoder/layer_11/attention/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_11/attention/output/add

Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000474 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/SquaredDifference__463, Execution Time: 0.000541 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds

Add Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/Rsqrt__465, Execution Time: 0.000071 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000075 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds

Add Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000453 seconds

Input size: (None, 256, 768, 3072)
Fusing MatMul with Add for node: bert/encoder/layer_11/intermediate/dense/MatMul
torch.Size([256, 3072])
MatMul Fuse node: bert/encoder/layer_11/intermediate/dense/MatMul, Execution Time: 0.000818 seconds

Skipping already processed Node: bert/encoder/layer_11/intermediate/dense/BiasAdd

Node: bert/encoder/layer_11/intermediate/dense/Pow, Execution Time: 0.002038 seconds

Node: bert/encoder/layer_11/intermediate/dense/mul, Execution Time: 0.001370 seconds

Add Node: bert/encoder/layer_11/intermediate/dense/add, Execution Time: 0.001295 seconds

Node: bert/encoder/layer_11/intermediate/dense/mul_1, Execution Time: 0.001367 seconds

Node: bert/encoder/layer_11/intermediate/dense/Tanh, Execution Time: 0.001366 seconds

Add Node: bert/encoder/layer_11/intermediate/dense/add_1, Execution Time: 0.001344 seconds

Node: bert/encoder/layer_11/intermediate/dense/mul_2, Execution Time: 0.001409 seconds

Node: bert/encoder/layer_11/intermediate/dense/mul_3, Execution Time: 0.001320 seconds

Input size: (None, 256, 3072, 768)
Fusing MatMul with 2Add for node: bert/encoder/layer_11/output/dense/MatMul
torch.Size([256, 768]) ,  torch.Size([256, 768])
MatMul Fuse node: bert/encoder/layer_11/output/dense/MatMul, Execution Time: 0.000977 seconds

Skipping already processed Node: bert/encoder/layer_11/output/dense/BiasAdd

Skipping already processed Node: bert/encoder/layer_11/output/add

Node: bert/encoder/layer_11/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds

Node: bert/encoder/layer_11/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000461 seconds

Node: bert/encoder/layer_11/output/LayerNorm/moments/SquaredDifference__467, Execution Time: 0.000485 seconds

Node: bert/encoder/layer_11/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds

Add Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/add, Execution Time: 0.000049 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/Rsqrt__469, Execution Time: 0.000070 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul, Execution Time: 0.000045 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds

Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000533 seconds

Add Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000473 seconds

Input size: (None, 256, 768, 2)
Fusing MatMul with Add for node: MatMul
torch.Size([256, 2])
MatMul Fuse node: MatMul, Execution Time: 0.001725 seconds

Skipping already processed Node: BiasAdd

Node: Reshape_1, Execution Time: 0.000026 seconds

Node: transpose, Execution Time: 0.000045 seconds

Node: unstack, Execution Time: 0.000050 seconds

Node: unstack__490, Execution Time: 0.000020 seconds

Node: unstack__488, Execution Time: 0.000007 seconds


Node Execution Times:

Total Execution Time: 0.436412 seconds

Total Matmul + Add Execution Time: 0.163752 seconds
Execution complete.
Model outputs: {'unstack:1': array([[-4.9148726, -4.6251225, -4.132886 , -4.1499195, -4.7828836,
        -4.250844 , -4.77094  , -4.348463 , -2.7006364, -4.424177 ,
        -4.510866 , -4.39433  , -4.773833 , -4.480716 , -4.7714205,
        -4.6485815, -3.1330094, -4.7139587, -4.7148943, -4.7223635,
        -4.7008233, -4.6960616, -4.7121487, -4.708615 , -4.703374 ,
        -4.7024655, -4.687359 , -4.693113 , -4.698162 , -4.692563 ,
        -4.711712 , -4.7003703, -4.7027717, -4.7279253, -4.709934 ,
        -4.715551 , -4.7324576, -4.7294855, -4.7329216, -4.7218866,
        -4.7014203, -4.694692 , -4.6925716, -4.700892 , -4.7044754,
        -4.68252  , -4.679993 , -4.6824126, -4.6833754, -4.690988 ,
        -4.695919 , -4.6797957, -4.683871 , -4.6834297, -4.680781 ,
        -4.686977 , -4.681429 , -4.680897 , -4.694978 , -4.685382 ,
        -4.70324  , -4.7010674, -4.693331 , -4.7089696, -4.71908  ,
        -4.7188516, -4.70435  , -4.685466 , -4.6962924, -4.6972375,
        -4.691828 , -4.688009 , -4.691449 , -4.693622 , -4.6890097,
        -4.6876435, -4.684474 , -4.7056074, -4.6984677, -4.7068577,
        -4.689911 , -4.687499 , -4.6927333, -4.693831 , -4.6965637,
        -4.693646 , -4.693519 , -4.71067  , -4.722037 , -4.718479 ,
        -4.729904 , -4.721483 , -4.739112 , -4.7325935, -4.7295456,
        -4.712435 , -4.712704 , -4.7114053, -4.712399 , -4.704262 ,
        -4.6972833, -4.6926665, -4.717176 , -4.6937675, -4.694539 ,
        -4.711683 , -4.685275 , -4.6935816, -4.701117 , -4.6866083,
        -4.6843753, -4.6876745, -4.684178 , -4.694061 , -4.6890798,
        -4.6861553, -4.7003927, -4.7103863, -4.710601 , -4.7194986,
        -4.7016277, -4.718649 , -4.743214 , -4.7109504, -4.711556 ,
        -4.7007613, -4.7009783, -4.6995244, -4.7007017, -4.7026825,
        -4.706376 , -4.7061615, -4.7284904, -4.724841 , -4.7082043,
        -4.7080393, -4.7098503, -4.7207146, -4.733838 , -4.7125974,
        -4.7276387, -4.721991 , -4.7300687, -4.7229652, -4.7133346,
        -4.7109923, -4.71963  , -4.7312083, -4.733224 , -4.7362647,
        -4.739877 , -4.74243  , -4.727128 , -4.737834 , -4.74598  ,
        -4.738839 , -4.744508 , -4.728359 , -4.726734 , -4.7255516,
        -4.7363386, -4.73214  , -4.7196693, -4.721826 , -4.7047076,
        -4.7190104, -4.7156587, -4.706273 , -4.7116737, -4.701518 ,
        -4.6943965, -4.6903934, -4.6890545, -4.6862764, -4.6875463,
        -4.684304 , -4.688264 , -4.691186 , -4.7027955, -4.6910152,
        -4.6985803, -4.7152886, -4.723945 , -4.7293673, -4.7427354,
        -4.73977  , -4.7290154, -4.7378254, -4.7355986, -4.731869 ,
        -4.724579 , -4.7262163, -4.71887  , -4.7058587, -4.7122684,
        -4.7009015, -4.696829 , -4.7094407, -4.703914 , -4.703702 ,
        -4.7195215, -4.7118044, -4.709847 , -4.721358 , -4.723019 ,
        -4.71298  , -4.7218485, -4.724691 , -4.725982 , -4.726673 ,
        -4.7187834, -4.709004 , -4.7109466, -4.737439 , -4.7246385,
        -4.73252  , -4.7404885, -4.7261868, -4.734698 , -4.732445 ,
        -4.736647 , -4.724646 , -4.73208  , -4.7321663, -4.7037077,
        -4.718028 , -4.726786 , -4.7345347, -4.7328334, -4.7220054,
        -4.7327023, -4.7200413, -4.7459936, -4.728972 , -4.7290406,
        -4.7259574, -4.730495 , -4.723769 , -4.7380366, -4.7268267,
        -4.692981 , -4.718449 , -4.6935935, -4.6961823, -4.713647 ,
        -4.6950507, -4.700345 , -4.7232556, -4.708386 , -4.737004 ,
        -4.7273254, -4.716681 , -4.7106347, -4.714922 , -4.7030454,
        -4.7468524]], dtype=float32), 'unstack:0': array([[-5.339778 , -4.878685 , -4.312428 , -4.3309417, -5.125337 ,
        -4.442749 , -5.1271124, -4.5656004, -4.683339 , -4.6350813,
        -4.8042274, -4.6028423, -5.1304255, -4.7185884, -5.0999007,
        -4.9003377, -5.1724668, -5.1058035, -5.1073008, -5.1120396,
        -5.0958624, -5.092071 , -5.104314 , -5.1013465, -5.0973773,
        -5.0955014, -5.086265 , -5.089708 , -5.093198 , -5.089909 ,
        -5.1028776, -5.0938663, -5.0976443, -5.1154556, -5.102868 ,
        -5.1068664, -5.1185074, -5.1169963, -5.118672 , -5.1110716,
        -5.0957775, -5.0914636, -5.089892 , -5.096351 , -5.099577 ,
        -5.084194 , -5.082636 , -5.0841656, -5.0848293, -5.089616 ,
        -5.0918293, -5.083179 , -5.084272 , -5.0856056, -5.0826926,
        -5.087329 , -5.0841713, -5.0831146, -5.092702 , -5.084974 ,
        -5.0978565, -5.0952926, -5.090936 , -5.102818 , -5.110067 ,
        -5.1097775, -5.0976253, -5.0851665, -5.0931044, -5.093152 ,
        -5.089941 , -5.0872903, -5.0898356, -5.0923924, -5.0875926,
        -5.086853 , -5.085301 , -5.100186 , -5.094749 , -5.099969 ,
        -5.0874996, -5.0855126, -5.0895004, -5.09137  , -5.0918326,
        -5.0898056, -5.090782 , -5.1034665, -5.112412 , -5.109096 ,
        -5.1174197, -5.1111536, -5.1241746, -5.1188   , -5.116848 ,
        -5.1029363, -5.1041894, -5.103745 , -5.105212 , -5.098095 ,
        -5.093282 , -5.090341 , -5.1087084, -5.0905395, -5.0906925,
        -5.1039257, -5.084995 , -5.090868 , -5.0939407, -5.0842586,
        -5.0840406, -5.0855136, -5.08409  , -5.089621 , -5.0858765,
        -5.0852404, -5.09481  , -5.1036887, -5.1036325, -5.1107006,
        -5.0964427, -5.109834 , -5.128194 , -5.104343 , -5.10455  ,
        -5.0965843, -5.0981956, -5.0968714, -5.0971923, -5.096769 ,
        -5.1019425, -5.1022315, -5.119105 , -5.116201 , -5.102627 ,
        -5.102922 , -5.1034007, -5.111492 , -5.121706 , -5.1049304,
        -5.116994 , -5.111964 , -5.1179514, -5.1140733, -5.1069007,
        -5.1045523, -5.1113954, -5.119346 , -5.1202354, -5.1230803,
        -5.1247115, -5.125494 , -5.1167865, -5.1235557, -5.127506 ,
        -5.1223035, -5.124693 , -5.116798 , -5.1166444, -5.1148844,
        -5.1223955, -5.1191473, -5.111838 , -5.112754 , -5.1008034,
        -5.1111383, -5.1085505, -5.100999 , -5.1052284, -5.0974274,
        -5.0922704, -5.0895066, -5.089077 , -5.086511 , -5.0866723,
        -5.0855794, -5.0879817, -5.0893273, -5.0967927, -5.08802  ,
        -5.093814 , -5.1059337, -5.112577 , -5.1154685, -5.121607 ,
        -5.12036  , -5.114813 , -5.1212907, -5.1178846, -5.117335 ,
        -5.1129055, -5.1143084, -5.109348 , -5.100045 , -5.1053514,
        -5.0964003, -5.0934987, -5.102238 , -5.0983605, -5.0989766,
        -5.1099577, -5.10423  , -5.1023245, -5.1104093, -5.111489 ,
        -5.1045485, -5.110909 , -5.112187 , -5.1123652, -5.113932 ,
        -5.10867  , -5.0995913, -5.101586 , -5.1216726, -5.111117 ,
        -5.116669 , -5.12195  , -5.112778 , -5.1199346, -5.117032 ,
        -5.120798 , -5.11272  , -5.117168 , -5.1175523, -5.09827  ,
        -5.1082807, -5.1146145, -5.1200075, -5.1190424, -5.112625 ,
        -5.1200185, -5.1110024, -5.126168 , -5.1168666, -5.11615  ,
        -5.113571 , -5.118028 , -5.1132293, -5.122775 , -5.1154203,
        -5.091564 , -5.1100745, -5.0914884, -5.0932784, -5.105365 ,
        -5.092105 , -5.0959387, -5.1119223, -5.101221 , -5.1215677,
        -5.114091 , -5.10658  , -5.101732 , -5.105737 , -5.0961223,
        -5.1260395]], dtype=float32), 'unique_ids:0': array([0])}

Question: What is the capital of France?
Context: The capital of France is Paris.
Answer: 
Generating '/tmp/nsys-report-b145.qdstrm'

[1/8] [0%                          ] nsys-report-048e.nsys-rep
[1/8] [0%                          ] nsys-report-048e.nsys-rep
[1/8] [6%                          ] nsys-report-048e.nsys-rep
[1/8] [12%                         ] nsys-report-048e.nsys-rep
[1/8] [10%                         ] nsys-report-048e.nsys-rep
[1/8] [8%                          ] nsys-report-048e.nsys-rep
[1/8] [=====30%                    ] nsys-report-048e.nsys-rep
[1/8] [====26%                     ] nsys-report-048e.nsys-rep
[1/8] [===23%                      ] nsys-report-048e.nsys-rep
[1/8] [==20%                       ] nsys-report-048e.nsys-rep
[1/8] [==18%                       ] nsys-report-048e.nsys-rep
[1/8] [=16%                        ] nsys-report-048e.nsys-rep
[1/8] [=17%                        ] nsys-report-048e.nsys-rep
[1/8] [==18%                       ] nsys-report-048e.nsys-rep
[1/8] [==19%                       ] nsys-report-048e.nsys-rep
[1/8] [==20%                       ] nsys-report-048e.nsys-rep
[1/8] [==21%                       ] nsys-report-048e.nsys-rep
[1/8] [===22%                      ] nsys-report-048e.nsys-rep
[1/8] [===23%                      ] nsys-report-048e.nsys-rep
[1/8] [===24%                      ] nsys-report-048e.nsys-rep
[1/8] [====25%                     ] nsys-report-048e.nsys-rep
[1/8] [====26%                     ] nsys-report-048e.nsys-rep
[1/8] [====27%                     ] nsys-report-048e.nsys-rep
[1/8] [====28%                     ] nsys-report-048e.nsys-rep
[1/8] [=====31%                    ] nsys-report-048e.nsys-rep
[1/8] [======35%                   ] nsys-report-048e.nsys-rep
[1/8] [=========44%                ] nsys-report-048e.nsys-rep
[1/8] [===========53%              ] nsys-report-048e.nsys-rep
[1/8] [==============61%           ] nsys-report-048e.nsys-rep
[1/8] [===============66%          ] nsys-report-048e.nsys-rep
[1/8] [================70%         ] nsys-report-048e.nsys-rep
[1/8] [==================76%       ] nsys-report-048e.nsys-rep
[1/8] [==================77%       ] nsys-report-048e.nsys-rep
[1/8] [==================78%       ] nsys-report-048e.nsys-rep
[1/8] [===================79%      ] nsys-report-048e.nsys-rep
[1/8] [===================80%      ] nsys-report-048e.nsys-rep
[1/8] [=====================89%    ] nsys-report-048e.nsys-rep
[1/8] [=======================96%  ] nsys-report-048e.nsys-rep
[1/8] [========================98% ] nsys-report-048e.nsys-rep
[1/8] [========================100%] nsys-report-048e.nsys-rep
[1/8] [========================100%] nsys-report-048e.nsys-rep

[2/8] [0%                          ] nsys-report-b910.sqlite
[2/8] [1%                          ] nsys-report-b910.sqlite
[2/8] [2%                          ] nsys-report-b910.sqlite
[2/8] [3%                          ] nsys-report-b910.sqlite
[2/8] [4%                          ] nsys-report-b910.sqlite
[2/8] [5%                          ] nsys-report-b910.sqlite
[2/8] [6%                          ] nsys-report-b910.sqlite
[2/8] [7%                          ] nsys-report-b910.sqlite
[2/8] [8%                          ] nsys-report-b910.sqlite
[2/8] [9%                          ] nsys-report-b910.sqlite
[2/8] [10%                         ] nsys-report-b910.sqlite
[2/8] [11%                         ] nsys-report-b910.sqlite
[2/8] [12%                         ] nsys-report-b910.sqlite
[2/8] [13%                         ] nsys-report-b910.sqlite
[2/8] [14%                         ] nsys-report-b910.sqlite
[2/8] [=15%                        ] nsys-report-b910.sqlite
[2/8] [=16%                        ] nsys-report-b910.sqlite
[2/8] [=17%                        ] nsys-report-b910.sqlite
[2/8] [==18%                       ] nsys-report-b910.sqlite
[2/8] [==19%                       ] nsys-report-b910.sqlite
[2/8] [==20%                       ] nsys-report-b910.sqlite
[2/8] [==21%                       ] nsys-report-b910.sqlite
[2/8] [===22%                      ] nsys-report-b910.sqlite
[2/8] [===23%                      ] nsys-report-b910.sqlite
[2/8] [===24%                      ] nsys-report-b910.sqlite
[2/8] [====25%                     ] nsys-report-b910.sqlite
[2/8] [====26%                     ] nsys-report-b910.sqlite
[2/8] [====27%                     ] nsys-report-b910.sqlite
[2/8] [====28%                     ] nsys-report-b910.sqlite
[2/8] [=====29%                    ] nsys-report-b910.sqlite
[2/8] [=====30%                    ] nsys-report-b910.sqlite
[2/8] [=====31%                    ] nsys-report-b910.sqlite
[2/8] [=====32%                    ] nsys-report-b910.sqlite
[2/8] [======33%                   ] nsys-report-b910.sqlite
[2/8] [======34%                   ] nsys-report-b910.sqlite
[2/8] [======35%                   ] nsys-report-b910.sqlite
[2/8] [=======36%                  ] nsys-report-b910.sqlite
[2/8] [=======37%                  ] nsys-report-b910.sqlite
[2/8] [=======38%                  ] nsys-report-b910.sqlite
[2/8] [=======39%                  ] nsys-report-b910.sqlite
[2/8] [========40%                 ] nsys-report-b910.sqlite
[2/8] [========41%                 ] nsys-report-b910.sqlite
[2/8] [========42%                 ] nsys-report-b910.sqlite
[2/8] [=========43%                ] nsys-report-b910.sqlite
[2/8] [=========44%                ] nsys-report-b910.sqlite
[2/8] [=========45%                ] nsys-report-b910.sqlite
[2/8] [=========46%                ] nsys-report-b910.sqlite
[2/8] [==========47%               ] nsys-report-b910.sqlite
[2/8] [==========48%               ] nsys-report-b910.sqlite
[2/8] [==========49%               ] nsys-report-b910.sqlite
[2/8] [===========50%              ] nsys-report-b910.sqlite
[2/8] [===========51%              ] nsys-report-b910.sqlite
[2/8] [===========52%              ] nsys-report-b910.sqlite
[2/8] [===========53%              ] nsys-report-b910.sqlite
[2/8] [============54%             ] nsys-report-b910.sqlite
[2/8] [============55%             ] nsys-report-b910.sqlite
[2/8] [============56%             ] nsys-report-b910.sqlite
[2/8] [============57%             ] nsys-report-b910.sqlite
[2/8] [=============58%            ] nsys-report-b910.sqlite
[2/8] [=============59%            ] nsys-report-b910.sqlite
[2/8] [=============60%            ] nsys-report-b910.sqlite
[2/8] [==============61%           ] nsys-report-b910.sqlite
[2/8] [==============62%           ] nsys-report-b910.sqlite
[2/8] [==============63%           ] nsys-report-b910.sqlite
[2/8] [==============64%           ] nsys-report-b910.sqlite
[2/8] [===============65%          ] nsys-report-b910.sqlite
[2/8] [===============66%          ] nsys-report-b910.sqlite
[2/8] [===============67%          ] nsys-report-b910.sqlite
[2/8] [================68%         ] nsys-report-b910.sqlite
[2/8] [================69%         ] nsys-report-b910.sqlite
[2/8] [================70%         ] nsys-report-b910.sqlite
[2/8] [================71%         ] nsys-report-b910.sqlite
[2/8] [=================72%        ] nsys-report-b910.sqlite
[2/8] [=================73%        ] nsys-report-b910.sqlite
[2/8] [=================74%        ] nsys-report-b910.sqlite
[2/8] [==================75%       ] nsys-report-b910.sqlite
[2/8] [==================76%       ] nsys-report-b910.sqlite
[2/8] [==================77%       ] nsys-report-b910.sqlite
[2/8] [==================78%       ] nsys-report-b910.sqlite
[2/8] [===================79%      ] nsys-report-b910.sqlite
[2/8] [===================80%      ] nsys-report-b910.sqlite
[2/8] [===================81%      ] nsys-report-b910.sqlite
[2/8] [===================82%      ] nsys-report-b910.sqlite
[2/8] [====================83%     ] nsys-report-b910.sqlite
[2/8] [====================84%     ] nsys-report-b910.sqlite
[2/8] [====================85%     ] nsys-report-b910.sqlite
[2/8] [=====================86%    ] nsys-report-b910.sqlite
[2/8] [=====================87%    ] nsys-report-b910.sqlite
[2/8] [=====================88%    ] nsys-report-b910.sqlite
[2/8] [=====================89%    ] nsys-report-b910.sqlite
[2/8] [======================90%   ] nsys-report-b910.sqlite
[2/8] [======================91%   ] nsys-report-b910.sqlite
[2/8] [======================92%   ] nsys-report-b910.sqlite
[2/8] [=======================93%  ] nsys-report-b910.sqlite
[2/8] [=======================94%  ] nsys-report-b910.sqlite
[2/8] [=======================95%  ] nsys-report-b910.sqlite
[2/8] [=======================96%  ] nsys-report-b910.sqlite
[2/8] [========================97% ] nsys-report-b910.sqlite
[2/8] [========================98% ] nsys-report-b910.sqlite
[2/8] [========================99% ] nsys-report-b910.sqlite
[2/8] [========================100%] nsys-report-b910.sqlite
[2/8] [========================100%] nsys-report-b910.sqlite
[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------
     53.8    5,534,228,333         66   83,851,944.4  100,143,002.5        1,170  545,269,062  71,605,661.3  poll                  
     43.7    4,500,777,207          9  500,086,356.3  500,086,682.0  500,079,031  500,089,732       3,160.8  pthread_cond_timedwait
      1.6      169,470,404      5,645       30,021.3          800.0          290  156,067,077   2,077,185.0  read                  
      0.6       65,433,966      3,057       21,404.6        7,290.0          210   10,657,324     253,752.7  ioctl                 
      0.1        9,565,910      3,192        2,996.8        2,730.0        1,150       37,190       1,560.3  open64                
      0.0        5,062,319          1    5,062,319.0    5,062,319.0    5,062,319    5,062,319           0.0  nanosleep             
      0.0        3,515,399    133,713           26.3           20.0           20        7,690          37.9  pthread_cond_signal   
      0.0        3,019,548        138       21,880.8        5,050.0        2,120    1,585,212     135,490.4  mmap64                
      0.0          888,370         10       88,837.0       61,496.0       16,131      321,794      89,799.7  sem_timedwait         
      0.0          875,984         13       67,383.4       60,021.0       54,961       81,122      11,142.0  sleep                 
      0.0          507,661        583          870.8           50.0           20       57,101       5,351.7  fgets                 
      0.0          344,517         32       10,766.2        5,985.0          430       48,080      13,305.5  write                 
      0.0          339,116          8       42,389.5       38,491.0       23,730       62,011      14,666.4  pthread_create        
      0.0          303,824         27       11,252.7        7,160.0        1,910       78,201      14,616.3  mmap                  
      0.0          211,907         44        4,816.1        2,895.0        1,130       23,071       4,821.3  fopen                 
      0.0          187,553          9       20,839.2        4,420.0        2,370       83,491      31,534.9  munmap                
      0.0          167,402        173          967.6          820.0          500        3,971         515.7  pread64               
      0.0          124,571          1      124,571.0      124,571.0      124,571      124,571           0.0  pthread_cond_wait     
      0.0          100,471          1      100,471.0      100,471.0      100,471      100,471           0.0  waitpid               
      0.0           61,040      1,622           37.6           30.0           20        4,320         147.5  pthread_cond_broadcast
      0.0           57,899         41        1,412.2        1,150.0          660        4,790         867.4  fclose                
      0.0           54,840         15        3,656.0        3,270.0        1,820        6,590       1,615.8  open                  
      0.0           38,309          6        6,384.8        4,239.5        2,220       18,640       6,173.0  pipe2                 
      0.0           32,631          2       16,315.5       16,315.5        9,130       23,501      10,161.8  connect               
      0.0           31,867        133          239.6          250.0           20        1,480         163.8  sigaction             
      0.0           29,977      1,211           24.8           20.0           20          151           6.3  flockfile             
      0.0           29,391          4        7,347.8        7,470.0        3,370       11,081       4,026.6  socket                
      0.0           22,437         68          330.0          300.0          180        1,160         173.5  fcntl                 
      0.0           20,210          6        3,368.3        2,620.0        1,360        7,370       2,188.4  fopen64               
      0.0           16,430        192           85.6          100.0           20          550          66.3  pthread_mutex_trylock 
      0.0           15,540          3        5,180.0        5,620.0        1,600        8,320       3,381.5  fread                 
      0.0            8,140          2        4,070.0        4,070.0        2,350        5,790       2,432.4  bind                  
      0.0            3,480          2        1,740.0        1,740.0          800        2,680       1,329.4  fwrite                
      0.0            2,629         10          262.9          260.0          189          360          49.6  dup                   
      0.0            2,602         30           86.7           30.0           20          900         182.5  fflush                
      0.0            2,250          2        1,125.0        1,125.0          660        1,590         657.6  dup2                  
      0.0              769          1          769.0          769.0          769          769           0.0  getc                  
      0.0              680          1          680.0          680.0          680          680           0.0  listen                

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)   Max (ns)   StdDev (ns)                Name               
 --------  ---------------  ---------  ---------  ---------  --------  ----------  -----------  ---------------------------------
     66.8      458,889,319      1,804  254,373.2   53,460.5     2,211   2,177,000    394,198.6  cudaMemcpyAsync                  
     16.4      112,515,093      1,804   62,369.8   11,100.0       650     257,654     79,474.5  cudaStreamSynchronize            
     10.9       75,217,217        707  106,389.3    7,460.0     2,850  16,497,323    927,843.2  cudaLaunchKernel                 
      1.5       10,141,562         98  103,485.3   91,441.5     5,390     327,454     88,149.1  cuCtxSynchronize                 
      1.4        9,551,675      2,624    3,640.1    3,085.0       490      20,001      2,831.1  cudaDeviceSynchronize            
      1.0        6,839,815      2,624    2,606.6    1,560.0     1,190      32,571      2,225.5  cudaEventRecord                  
      0.9        6,327,816         26  243,377.5      715.0       290   6,308,675  1,237,082.8  cudaStreamIsCapturing_v10000     
      0.4        2,729,205         23  118,661.1  126,411.0    73,641     167,492     30,706.1  cudaMalloc                       
      0.3        1,776,952      2,624      677.2      600.0       240      18,670        548.3  cudaEventCreateWithFlags         
      0.2        1,274,525         98   13,005.4   12,935.0     7,760      27,621      1,979.2  cuLaunchKernel                   
      0.1          922,031      2,624      351.4      300.0       180       7,720        263.5  cudaEventDestroy                 
      0.1          361,385          5   72,277.0   70,091.0    56,771      89,731     12,660.4  cuModuleLoadData                 
      0.0          326,636      1,149      284.3      200.0        50       7,880        367.3  cuGetProcAddress_v2              
      0.0          262,753         50    5,255.1    5,465.0     3,130       9,450      1,868.7  cudaMemsetAsync                  
      0.0          171,663          1  171,663.0  171,663.0   171,663     171,663          0.0  cudaGetDeviceProperties_v2_v12000
      0.0            3,930          3    1,310.0    1,300.0       510       2,120        805.0  cuInit                           
      0.0            3,530          1    3,530.0    3,530.0     3,530       3,530          0.0  cuMemFree_v2                     
      0.0              950          3      316.7      240.0        60         650        302.4  cuModuleGetLoadingMode           
      0.0              840          1      840.0      840.0       840         840          0.0  cuCtxSetCurrent                  

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     84.2        9,532,306         97  98,271.2  84,480.0    11,072   322,784     89,061.0  cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align4                                               
      3.1          345,470        125   2,763.8   2,368.0     1,343     6,016      1,403.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
      2.8          315,425        121   2,606.8   2,304.0     1,280     4,288        724.6  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
      2.0          225,953         75   3,012.7   2,368.0     1,600     4,993      1,136.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
      1.9          217,087        123   1,764.9   1,280.0       800     3,136        758.5  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::…
      1.6          182,945         50   3,658.9   3,712.0     3,488     4,000        149.1  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
      0.9          104,833         12   8,736.1   8,688.0     8,608     9,248        171.0  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::…
      0.9          103,266         64   1,613.5     960.0       864     4,544      1,348.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, at::deta…
      0.9           96,288         37   2,602.4   1,824.0     1,728     4,384      1,188.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
      0.6           71,392         12   5,949.3   5,952.0     5,856     6,048         66.1  void <unnamed>::softmax_warp_forward<float, float, float, (int)8, (bool)0, (bool)0>(T2 *, const T1 …
      0.4           45,536         12   3,794.7   3,792.0     3,712     3,872         53.6  void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorItera…
      0.2           25,922         25   1,036.9   1,024.0     1,024     1,057         16.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::reciprocal_kernel_cuda(at::Tenso…
      0.2           25,664         25   1,026.6   1,024.0       992     1,056         12.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sqrt_kernel_cuda(at::TensorItera…
      0.2           22,911         25     916.4     928.0       895       928         15.7  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa…
      0.1            5,760          1   5,760.0   5,760.0     5,760     5,760          0.0  cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align2                                               
      0.0            1,600          1   1,600.0   1,600.0     1,600     1,600          0.0  void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<int, unsigned int, (int)1, (int)12…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Operation          
 --------  ---------------  -----  ---------  ---------  --------  ---------  -----------  ----------------------------
     56.6      187,729,069  1,157  162,255.0  119,617.0       287  2,133,603    254,865.0  [CUDA memcpy Host-to-Device]
     43.4      143,824,334    647  222,294.2  117,216.0     1,056  1,011,362    282,104.0  [CUDA memcpy Device-to-Host]
      0.0           24,355     50      487.1      320.0       288      1,088        294.8  [CUDA memset]               

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation          
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
  1,224.510  1,157     1.058     0.786     0.000     9.437        1.648  [CUDA memcpy Host-to-Device]
    707.510    647     1.094     0.786     0.000     3.146        1.206  [CUDA memcpy Device-to-Host]
      0.000     50     0.000     0.000     0.000     0.000        0.000  [CUDA memset]               

Generated:
    /tmp/nsys-report-048e.nsys-rep
    /tmp/nsys-report-b910.sqlite
Leave a Comment