Untitled
Model Input Name: unique_ids_raw_output___9:0, Shape: [0] Model Input Name: segment_ids:0, Shape: [0, 256] Model Input Name: input_mask:0, Shape: [0, 256] Model Input Name: input_ids:0, Shape: [0, 256] Starting model execution... Inputs Details: Input Name: input_ids:0 Shape: (1, 256) Data (first 10 values): [ 101 2054 2003 1996 3007 1997 2605 1029 102 1996]... -------------------------------------------------- Input Name: segment_ids:0 Shape: (1, 256) Data (first 10 values): [0 0 0 0 0 0 0 0 0 1]... -------------------------------------------------- Input Name: input_mask:0 Shape: (1, 256) Data (first 10 values): [1 1 1 1 1 1 1 1 1 1]... -------------------------------------------------- Input Name: unique_ids_raw_output___9:0 Shape: (1,) Data (first 10 values): [0]... -------------------------------------------------- Node: unique_ids_graph_outputs_Identity__10, Execution Time: 0.000511 seconds Node: bert/encoder/Shape, Execution Time: 0.000030 seconds Node: bert/encoder/Shape__12, Execution Time: 0.000038 seconds Node: bert/encoder/strided_slice, Execution Time: 0.000173 seconds Node: bert/encoder/strided_slice__16, Execution Time: 0.000029 seconds Node: bert/encoder/strided_slice__17, Execution Time: 0.000020 seconds Node: bert/encoder/ones/packed_Unsqueeze__18, Execution Time: 0.000035 seconds Node: bert/encoder/ones/packed_Concat__21, Execution Time: 0.004840 seconds Node: bert/encoder/ones__22, Execution Time: 0.000027 seconds Node: bert/encoder/ones, Execution Time: 0.000075 seconds Node: bert/encoder/Reshape, Execution Time: 0.000039 seconds Node: bert/encoder/Cast, Execution Time: 0.000020 seconds Node: bert/encoder/mul, Execution Time: 0.007645 seconds Node: bert/encoder/layer_9/attention/self/ExpandDims, Execution Time: 0.000020 seconds Node: bert/encoder/layer_9/attention/self/sub, Execution Time: 0.006671 seconds Node: bert/encoder/layer_9/attention/self/mul_1, Execution Time: 0.000213 seconds Node: bert/embeddings/Reshape_2, Execution Time: 0.000020 seconds Node: bert/embeddings/Reshape, Execution Time: 0.000005 seconds Node: bert/embeddings/GatherV2, Execution Time: 0.000162 seconds Node: bert/embeddings/Reshape_1, Execution Time: 0.000020 seconds Node: bert/embeddings/one_hot, Execution Time: 0.000219 seconds Input size: (None, 256, 2, 768) No Add node related to MatMul output: bert/embeddings/MatMul. Executing regular MatMul. MatMul Node: bert/embeddings/MatMul, Execution Time: 0.027465 seconds Node: bert/embeddings/Reshape_3, Execution Time: 0.000025 seconds Add Node: bert/embeddings/add, Execution Time: 0.000611 seconds Add Node: bert/embeddings/add_1, Execution Time: 0.000467 seconds Node: bert/embeddings/LayerNorm/moments/mean, Execution Time: 0.005089 seconds Node: bert/embeddings/LayerNorm/moments/SquaredDifference, Execution Time: 0.000502 seconds Node: bert/embeddings/LayerNorm/moments/SquaredDifference__72, Execution Time: 0.000517 seconds Node: bert/embeddings/LayerNorm/moments/variance, Execution Time: 0.000074 seconds Add Node: bert/embeddings/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/embeddings/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.010280 seconds Node: bert/embeddings/LayerNorm/batchnorm/Rsqrt__74, Execution Time: 0.005450 seconds Node: bert/embeddings/LayerNorm/batchnorm/mul, Execution Time: 0.000053 seconds Node: bert/embeddings/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/embeddings/LayerNorm/batchnorm/sub, Execution Time: 0.000069 seconds Node: bert/embeddings/LayerNorm/batchnorm/mul_1, Execution Time: 0.000455 seconds Add Node: bert/embeddings/LayerNorm/batchnorm/add_1, Execution Time: 0.000453 seconds Node: bert/encoder/Reshape_1, Execution Time: 0.000024 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_0/attention/self/value/MatMul, Execution Time: 0.001809 seconds Skipping already processed Node: bert/encoder/layer_0/attention/self/value/BiasAdd Node: bert/encoder/layer_0/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_0/attention/self/transpose_2, Execution Time: 0.000505 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_0/attention/self/query/MatMul, Execution Time: 0.000672 seconds Skipping already processed Node: bert/encoder/layer_0/attention/self/query/BiasAdd Node: bert/encoder/layer_0/attention/self/Reshape, Execution Time: 0.000020 seconds Node: bert/encoder/layer_0/attention/self/transpose, Execution Time: 0.000450 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_0/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_0/attention/self/key/MatMul, Execution Time: 0.000619 seconds Skipping already processed Node: bert/encoder/layer_0/attention/self/key/BiasAdd Node: bert/encoder/layer_0/attention/self/Reshape_1, Execution Time: 0.000009 seconds Node: bert/encoder/layer_0/attention/self/MatMul__306, Execution Time: 0.000444 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_0/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_0/attention/self/MatMul, Execution Time: 0.001491 seconds Node: bert/encoder/layer_0/attention/self/Mul, Execution Time: 0.001327 seconds Add Node: bert/encoder/layer_0/attention/self/add, Execution Time: 0.001349 seconds Node: bert/encoder/layer_0/attention/self/Softmax, Execution Time: 0.009065 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_0/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_0/attention/self/MatMul_1, Execution Time: 0.000635 seconds Node: bert/encoder/layer_0/attention/self/transpose_3, Execution Time: 0.000550 seconds Node: bert/encoder/layer_0/attention/self/Reshape_3, Execution Time: 0.000058 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_0/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_0/attention/output/dense/MatMul, Execution Time: 0.001760 seconds Skipping already processed Node: bert/encoder/layer_0/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_0/attention/output/add Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000634 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/SquaredDifference__309, Execution Time: 0.000473 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/Rsqrt__311, Execution Time: 0.000068 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000041 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000046 seconds Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000464 seconds Add Node: bert/encoder/layer_0/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000457 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_0/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_0/intermediate/dense/MatMul, Execution Time: 0.000690 seconds Skipping already processed Node: bert/encoder/layer_0/intermediate/dense/BiasAdd Node: bert/encoder/layer_0/intermediate/dense/Pow, Execution Time: 0.018049 seconds Node: bert/encoder/layer_0/intermediate/dense/mul, Execution Time: 0.001407 seconds Add Node: bert/encoder/layer_0/intermediate/dense/add, Execution Time: 0.001314 seconds Node: bert/encoder/layer_0/intermediate/dense/mul_1, Execution Time: 0.001507 seconds Node: bert/encoder/layer_0/intermediate/dense/Tanh, Execution Time: 0.003959 seconds Add Node: bert/encoder/layer_0/intermediate/dense/add_1, Execution Time: 0.001380 seconds Node: bert/encoder/layer_0/intermediate/dense/mul_2, Execution Time: 0.001314 seconds Node: bert/encoder/layer_0/intermediate/dense/mul_3, Execution Time: 0.001374 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_0/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_0/output/dense/MatMul, Execution Time: 0.001047 seconds Skipping already processed Node: bert/encoder/layer_0/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_0/output/add Node: bert/encoder/layer_0/output/LayerNorm/moments/mean, Execution Time: 0.000100 seconds Node: bert/encoder/layer_0/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000494 seconds Node: bert/encoder/layer_0/output/LayerNorm/moments/SquaredDifference__313, Execution Time: 0.000547 seconds Node: bert/encoder/layer_0/output/LayerNorm/moments/variance, Execution Time: 0.000057 seconds Add Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/Rsqrt__315, Execution Time: 0.000076 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul, Execution Time: 0.000056 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000051 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000486 seconds Add Node: bert/encoder/layer_0/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000471 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_1/attention/self/value/MatMul, Execution Time: 0.000654 seconds Skipping already processed Node: bert/encoder/layer_1/attention/self/value/BiasAdd Node: bert/encoder/layer_1/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_1/attention/self/transpose_2, Execution Time: 0.000449 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_1/attention/self/query/MatMul, Execution Time: 0.000632 seconds Skipping already processed Node: bert/encoder/layer_1/attention/self/query/BiasAdd Node: bert/encoder/layer_1/attention/self/Reshape, Execution Time: 0.000020 seconds Node: bert/encoder/layer_1/attention/self/transpose, Execution Time: 0.000474 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_1/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_1/attention/self/key/MatMul, Execution Time: 0.000604 seconds Skipping already processed Node: bert/encoder/layer_1/attention/self/key/BiasAdd Node: bert/encoder/layer_1/attention/self/Reshape_1, Execution Time: 0.000009 seconds Node: bert/encoder/layer_1/attention/self/MatMul__320, Execution Time: 0.000483 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_1/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_1/attention/self/MatMul, Execution Time: 0.000508 seconds Node: bert/encoder/layer_1/attention/self/Mul, Execution Time: 0.001349 seconds Add Node: bert/encoder/layer_1/attention/self/add, Execution Time: 0.001579 seconds Node: bert/encoder/layer_1/attention/self/Softmax, Execution Time: 0.001335 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_1/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_1/attention/self/MatMul_1, Execution Time: 0.000563 seconds Node: bert/encoder/layer_1/attention/self/transpose_3, Execution Time: 0.000447 seconds Node: bert/encoder/layer_1/attention/self/Reshape_3, Execution Time: 0.000047 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_1/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_1/attention/output/dense/MatMul, Execution Time: 0.000678 seconds Skipping already processed Node: bert/encoder/layer_1/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_1/attention/output/add Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000606 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/SquaredDifference__323, Execution Time: 0.000474 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds Add Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000050 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/Rsqrt__325, Execution Time: 0.000074 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000041 seconds Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000466 seconds Add Node: bert/encoder/layer_1/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000446 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_1/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_1/intermediate/dense/MatMul, Execution Time: 0.000661 seconds Skipping already processed Node: bert/encoder/layer_1/intermediate/dense/BiasAdd Node: bert/encoder/layer_1/intermediate/dense/Pow, Execution Time: 0.001371 seconds Node: bert/encoder/layer_1/intermediate/dense/mul, Execution Time: 0.001382 seconds Add Node: bert/encoder/layer_1/intermediate/dense/add, Execution Time: 0.001623 seconds Node: bert/encoder/layer_1/intermediate/dense/mul_1, Execution Time: 0.001303 seconds Node: bert/encoder/layer_1/intermediate/dense/Tanh, Execution Time: 0.001375 seconds Add Node: bert/encoder/layer_1/intermediate/dense/add_1, Execution Time: 0.001320 seconds Node: bert/encoder/layer_1/intermediate/dense/mul_2, Execution Time: 0.001378 seconds Node: bert/encoder/layer_1/intermediate/dense/mul_3, Execution Time: 0.001307 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_1/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_1/output/dense/MatMul, Execution Time: 0.001064 seconds Skipping already processed Node: bert/encoder/layer_1/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_1/output/add Node: bert/encoder/layer_1/output/LayerNorm/moments/mean, Execution Time: 0.000084 seconds Node: bert/encoder/layer_1/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000484 seconds Node: bert/encoder/layer_1/output/LayerNorm/moments/SquaredDifference__327, Execution Time: 0.000571 seconds Node: bert/encoder/layer_1/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds Add Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/add, Execution Time: 0.000055 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/Rsqrt__329, Execution Time: 0.000080 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul, Execution Time: 0.000052 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000042 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/sub, Execution Time: 0.000051 seconds Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds Add Node: bert/encoder/layer_1/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000466 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_2/attention/self/value/MatMul, Execution Time: 0.000678 seconds Skipping already processed Node: bert/encoder/layer_2/attention/self/value/BiasAdd Node: bert/encoder/layer_2/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_2/attention/self/transpose_2, Execution Time: 0.000461 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_2/attention/self/query/MatMul, Execution Time: 0.000645 seconds Skipping already processed Node: bert/encoder/layer_2/attention/self/query/BiasAdd Node: bert/encoder/layer_2/attention/self/Reshape, Execution Time: 0.000010 seconds Node: bert/encoder/layer_2/attention/self/transpose, Execution Time: 0.000476 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_2/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_2/attention/self/key/MatMul, Execution Time: 0.000615 seconds Skipping already processed Node: bert/encoder/layer_2/attention/self/key/BiasAdd Node: bert/encoder/layer_2/attention/self/Reshape_1, Execution Time: 0.000008 seconds Node: bert/encoder/layer_2/attention/self/MatMul__334, Execution Time: 0.000464 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_2/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_2/attention/self/MatMul, Execution Time: 0.000499 seconds Node: bert/encoder/layer_2/attention/self/Mul, Execution Time: 0.001384 seconds Add Node: bert/encoder/layer_2/attention/self/add, Execution Time: 0.001380 seconds Node: bert/encoder/layer_2/attention/self/Softmax, Execution Time: 0.001305 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_2/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_2/attention/self/MatMul_1, Execution Time: 0.000562 seconds Node: bert/encoder/layer_2/attention/self/transpose_3, Execution Time: 0.000456 seconds Node: bert/encoder/layer_2/attention/self/Reshape_3, Execution Time: 0.000037 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_2/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_2/attention/output/dense/MatMul, Execution Time: 0.000755 seconds Skipping already processed Node: bert/encoder/layer_2/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_2/attention/output/add Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/mean, Execution Time: 0.000100 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000583 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/SquaredDifference__337, Execution Time: 0.000602 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/moments/variance, Execution Time: 0.000071 seconds Add Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000054 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000078 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/Rsqrt__339, Execution Time: 0.000089 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000042 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000518 seconds Add Node: bert/encoder/layer_2/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000451 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_2/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_2/intermediate/dense/MatMul, Execution Time: 0.000782 seconds Skipping already processed Node: bert/encoder/layer_2/intermediate/dense/BiasAdd Node: bert/encoder/layer_2/intermediate/dense/Pow, Execution Time: 0.001319 seconds Node: bert/encoder/layer_2/intermediate/dense/mul, Execution Time: 0.001400 seconds Add Node: bert/encoder/layer_2/intermediate/dense/add, Execution Time: 0.001352 seconds Node: bert/encoder/layer_2/intermediate/dense/mul_1, Execution Time: 0.001411 seconds Node: bert/encoder/layer_2/intermediate/dense/Tanh, Execution Time: 0.001316 seconds Add Node: bert/encoder/layer_2/intermediate/dense/add_1, Execution Time: 0.001329 seconds Node: bert/encoder/layer_2/intermediate/dense/mul_2, Execution Time: 0.001370 seconds Node: bert/encoder/layer_2/intermediate/dense/mul_3, Execution Time: 0.001295 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_2/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_2/output/dense/MatMul, Execution Time: 0.000986 seconds Skipping already processed Node: bert/encoder/layer_2/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_2/output/add Node: bert/encoder/layer_2/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds Node: bert/encoder/layer_2/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000505 seconds Node: bert/encoder/layer_2/output/LayerNorm/moments/SquaredDifference__341, Execution Time: 0.000457 seconds Node: bert/encoder/layer_2/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds Add Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000070 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/Rsqrt__343, Execution Time: 0.000066 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/sub, Execution Time: 0.000056 seconds Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000513 seconds Add Node: bert/encoder/layer_2/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000452 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_3/attention/self/value/MatMul, Execution Time: 0.000684 seconds Skipping already processed Node: bert/encoder/layer_3/attention/self/value/BiasAdd Node: bert/encoder/layer_3/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_3/attention/self/transpose_2, Execution Time: 0.000478 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_3/attention/self/query/MatMul, Execution Time: 0.000721 seconds Skipping already processed Node: bert/encoder/layer_3/attention/self/query/BiasAdd Node: bert/encoder/layer_3/attention/self/Reshape, Execution Time: 0.000010 seconds Node: bert/encoder/layer_3/attention/self/transpose, Execution Time: 0.000443 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_3/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_3/attention/self/key/MatMul, Execution Time: 0.000608 seconds Skipping already processed Node: bert/encoder/layer_3/attention/self/key/BiasAdd Node: bert/encoder/layer_3/attention/self/Reshape_1, Execution Time: 0.000007 seconds Node: bert/encoder/layer_3/attention/self/MatMul__348, Execution Time: 0.000437 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_3/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_3/attention/self/MatMul, Execution Time: 0.000544 seconds Node: bert/encoder/layer_3/attention/self/Mul, Execution Time: 0.001320 seconds Add Node: bert/encoder/layer_3/attention/self/add, Execution Time: 0.001428 seconds Node: bert/encoder/layer_3/attention/self/Softmax, Execution Time: 0.001303 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_3/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_3/attention/self/MatMul_1, Execution Time: 0.000561 seconds Node: bert/encoder/layer_3/attention/self/transpose_3, Execution Time: 0.000469 seconds Node: bert/encoder/layer_3/attention/self/Reshape_3, Execution Time: 0.000038 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_3/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_3/attention/output/dense/MatMul, Execution Time: 0.000677 seconds Skipping already processed Node: bert/encoder/layer_3/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_3/attention/output/add Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/mean, Execution Time: 0.000088 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000476 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/SquaredDifference__351, Execution Time: 0.000554 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds Add Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000055 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/Rsqrt__353, Execution Time: 0.000072 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000056 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000458 seconds Add Node: bert/encoder/layer_3/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000449 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_3/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_3/intermediate/dense/MatMul, Execution Time: 0.000654 seconds Skipping already processed Node: bert/encoder/layer_3/intermediate/dense/BiasAdd Node: bert/encoder/layer_3/intermediate/dense/Pow, Execution Time: 0.001374 seconds Node: bert/encoder/layer_3/intermediate/dense/mul, Execution Time: 0.001344 seconds Add Node: bert/encoder/layer_3/intermediate/dense/add, Execution Time: 0.001312 seconds Node: bert/encoder/layer_3/intermediate/dense/mul_1, Execution Time: 0.001383 seconds Node: bert/encoder/layer_3/intermediate/dense/Tanh, Execution Time: 0.001316 seconds Add Node: bert/encoder/layer_3/intermediate/dense/add_1, Execution Time: 0.001338 seconds Node: bert/encoder/layer_3/intermediate/dense/mul_2, Execution Time: 0.001379 seconds Node: bert/encoder/layer_3/intermediate/dense/mul_3, Execution Time: 0.001310 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_3/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_3/output/dense/MatMul, Execution Time: 0.000992 seconds Skipping already processed Node: bert/encoder/layer_3/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_3/output/add Node: bert/encoder/layer_3/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds Node: bert/encoder/layer_3/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000485 seconds Node: bert/encoder/layer_3/output/LayerNorm/moments/SquaredDifference__355, Execution Time: 0.000449 seconds Node: bert/encoder/layer_3/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/Rsqrt__357, Execution Time: 0.000070 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul, Execution Time: 0.000061 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000054 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000545 seconds Add Node: bert/encoder/layer_3/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000445 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_4/attention/self/value/MatMul, Execution Time: 0.000668 seconds Skipping already processed Node: bert/encoder/layer_4/attention/self/value/BiasAdd Node: bert/encoder/layer_4/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_4/attention/self/transpose_2, Execution Time: 0.000548 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_4/attention/self/query/MatMul, Execution Time: 0.000681 seconds Skipping already processed Node: bert/encoder/layer_4/attention/self/query/BiasAdd Node: bert/encoder/layer_4/attention/self/Reshape, Execution Time: 0.000009 seconds Node: bert/encoder/layer_4/attention/self/transpose, Execution Time: 0.000567 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_4/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_4/attention/self/key/MatMul, Execution Time: 0.000655 seconds Skipping already processed Node: bert/encoder/layer_4/attention/self/key/BiasAdd Node: bert/encoder/layer_4/attention/self/Reshape_1, Execution Time: 0.000007 seconds Node: bert/encoder/layer_4/attention/self/MatMul__362, Execution Time: 0.000541 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_4/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_4/attention/self/MatMul, Execution Time: 0.000483 seconds Node: bert/encoder/layer_4/attention/self/Mul, Execution Time: 0.001326 seconds Add Node: bert/encoder/layer_4/attention/self/add, Execution Time: 0.001472 seconds Node: bert/encoder/layer_4/attention/self/Softmax, Execution Time: 0.001326 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_4/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_4/attention/self/MatMul_1, Execution Time: 0.000573 seconds Node: bert/encoder/layer_4/attention/self/transpose_3, Execution Time: 0.000484 seconds Node: bert/encoder/layer_4/attention/self/Reshape_3, Execution Time: 0.000037 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_4/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_4/attention/output/dense/MatMul, Execution Time: 0.000743 seconds Skipping already processed Node: bert/encoder/layer_4/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_4/attention/output/add Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000565 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/SquaredDifference__365, Execution Time: 0.000463 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/moments/variance, Execution Time: 0.000060 seconds Add Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/Rsqrt__367, Execution Time: 0.000067 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000457 seconds Add Node: bert/encoder/layer_4/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000459 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_4/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_4/intermediate/dense/MatMul, Execution Time: 0.000646 seconds Skipping already processed Node: bert/encoder/layer_4/intermediate/dense/BiasAdd Node: bert/encoder/layer_4/intermediate/dense/Pow, Execution Time: 0.001339 seconds Node: bert/encoder/layer_4/intermediate/dense/mul, Execution Time: 0.001356 seconds Add Node: bert/encoder/layer_4/intermediate/dense/add, Execution Time: 0.001398 seconds Node: bert/encoder/layer_4/intermediate/dense/mul_1, Execution Time: 0.001317 seconds Node: bert/encoder/layer_4/intermediate/dense/Tanh, Execution Time: 0.001311 seconds Add Node: bert/encoder/layer_4/intermediate/dense/add_1, Execution Time: 0.001370 seconds Node: bert/encoder/layer_4/intermediate/dense/mul_2, Execution Time: 0.001508 seconds Node: bert/encoder/layer_4/intermediate/dense/mul_3, Execution Time: 0.001303 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_4/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_4/output/dense/MatMul, Execution Time: 0.000987 seconds Skipping already processed Node: bert/encoder/layer_4/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_4/output/add Node: bert/encoder/layer_4/output/LayerNorm/moments/mean, Execution Time: 0.000072 seconds Node: bert/encoder/layer_4/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000470 seconds Node: bert/encoder/layer_4/output/LayerNorm/moments/SquaredDifference__369, Execution Time: 0.000466 seconds Node: bert/encoder/layer_4/output/LayerNorm/moments/variance, Execution Time: 0.000052 seconds Add Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/add, Execution Time: 0.000048 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000052 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/Rsqrt__371, Execution Time: 0.000066 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul, Execution Time: 0.000055 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000466 seconds Add Node: bert/encoder/layer_4/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000463 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_5/attention/self/value/MatMul, Execution Time: 0.001840 seconds Skipping already processed Node: bert/encoder/layer_5/attention/self/value/BiasAdd Node: bert/encoder/layer_5/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_5/attention/self/transpose_2, Execution Time: 0.000459 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_5/attention/self/query/MatMul, Execution Time: 0.000622 seconds Skipping already processed Node: bert/encoder/layer_5/attention/self/query/BiasAdd Node: bert/encoder/layer_5/attention/self/Reshape, Execution Time: 0.000009 seconds Node: bert/encoder/layer_5/attention/self/transpose, Execution Time: 0.000436 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_5/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_5/attention/self/key/MatMul, Execution Time: 0.000607 seconds Skipping already processed Node: bert/encoder/layer_5/attention/self/key/BiasAdd Node: bert/encoder/layer_5/attention/self/Reshape_1, Execution Time: 0.000009 seconds Node: bert/encoder/layer_5/attention/self/MatMul__376, Execution Time: 0.000448 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_5/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_5/attention/self/MatMul, Execution Time: 0.000485 seconds Node: bert/encoder/layer_5/attention/self/Mul, Execution Time: 0.001392 seconds Add Node: bert/encoder/layer_5/attention/self/add, Execution Time: 0.001310 seconds Node: bert/encoder/layer_5/attention/self/Softmax, Execution Time: 0.001333 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_5/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_5/attention/self/MatMul_1, Execution Time: 0.000640 seconds Node: bert/encoder/layer_5/attention/self/transpose_3, Execution Time: 0.000455 seconds Node: bert/encoder/layer_5/attention/self/Reshape_3, Execution Time: 0.000037 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_5/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_5/attention/output/dense/MatMul, Execution Time: 0.000660 seconds Skipping already processed Node: bert/encoder/layer_5/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_5/attention/output/add Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000477 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/SquaredDifference__379, Execution Time: 0.000461 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds Add Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/Rsqrt__381, Execution Time: 0.000068 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000046 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000055 seconds Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000468 seconds Add Node: bert/encoder/layer_5/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000451 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_5/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_5/intermediate/dense/MatMul, Execution Time: 0.000666 seconds Skipping already processed Node: bert/encoder/layer_5/intermediate/dense/BiasAdd Node: bert/encoder/layer_5/intermediate/dense/Pow, Execution Time: 0.001391 seconds Node: bert/encoder/layer_5/intermediate/dense/mul, Execution Time: 0.001312 seconds Add Node: bert/encoder/layer_5/intermediate/dense/add, Execution Time: 0.001391 seconds Node: bert/encoder/layer_5/intermediate/dense/mul_1, Execution Time: 0.001297 seconds Node: bert/encoder/layer_5/intermediate/dense/Tanh, Execution Time: 0.001306 seconds Add Node: bert/encoder/layer_5/intermediate/dense/add_1, Execution Time: 0.001386 seconds Node: bert/encoder/layer_5/intermediate/dense/mul_2, Execution Time: 0.001291 seconds Node: bert/encoder/layer_5/intermediate/dense/mul_3, Execution Time: 0.001279 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_5/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_5/output/dense/MatMul, Execution Time: 0.001012 seconds Skipping already processed Node: bert/encoder/layer_5/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_5/output/add Node: bert/encoder/layer_5/output/LayerNorm/moments/mean, Execution Time: 0.000083 seconds Node: bert/encoder/layer_5/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000461 seconds Node: bert/encoder/layer_5/output/LayerNorm/moments/SquaredDifference__383, Execution Time: 0.000457 seconds Node: bert/encoder/layer_5/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds Add Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/add, Execution Time: 0.000053 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000049 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/Rsqrt__385, Execution Time: 0.000066 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000465 seconds Add Node: bert/encoder/layer_5/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000463 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_6/attention/self/value/MatMul, Execution Time: 0.000639 seconds Skipping already processed Node: bert/encoder/layer_6/attention/self/value/BiasAdd Node: bert/encoder/layer_6/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_6/attention/self/transpose_2, Execution Time: 0.000466 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_6/attention/self/query/MatMul, Execution Time: 0.000643 seconds Skipping already processed Node: bert/encoder/layer_6/attention/self/query/BiasAdd Node: bert/encoder/layer_6/attention/self/Reshape, Execution Time: 0.000009 seconds Node: bert/encoder/layer_6/attention/self/transpose, Execution Time: 0.000510 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_6/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_6/attention/self/key/MatMul, Execution Time: 0.000669 seconds Skipping already processed Node: bert/encoder/layer_6/attention/self/key/BiasAdd Node: bert/encoder/layer_6/attention/self/Reshape_1, Execution Time: 0.000008 seconds Node: bert/encoder/layer_6/attention/self/MatMul__390, Execution Time: 0.000553 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_6/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_6/attention/self/MatMul, Execution Time: 0.000546 seconds Node: bert/encoder/layer_6/attention/self/Mul, Execution Time: 0.002146 seconds Add Node: bert/encoder/layer_6/attention/self/add, Execution Time: 0.001294 seconds Node: bert/encoder/layer_6/attention/self/Softmax, Execution Time: 0.001295 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_6/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_6/attention/self/MatMul_1, Execution Time: 0.000554 seconds Node: bert/encoder/layer_6/attention/self/transpose_3, Execution Time: 0.000507 seconds Node: bert/encoder/layer_6/attention/self/Reshape_3, Execution Time: 0.000047 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_6/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_6/attention/output/dense/MatMul, Execution Time: 0.000683 seconds Skipping already processed Node: bert/encoder/layer_6/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_6/attention/output/add Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/mean, Execution Time: 0.000087 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000460 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/SquaredDifference__393, Execution Time: 0.000455 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/moments/variance, Execution Time: 0.000062 seconds Add Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/Rsqrt__395, Execution Time: 0.000072 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000443 seconds Add Node: bert/encoder/layer_6/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000454 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_6/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_6/intermediate/dense/MatMul, Execution Time: 0.000655 seconds Skipping already processed Node: bert/encoder/layer_6/intermediate/dense/BiasAdd Node: bert/encoder/layer_6/intermediate/dense/Pow, Execution Time: 0.001311 seconds Node: bert/encoder/layer_6/intermediate/dense/mul, Execution Time: 0.001315 seconds Add Node: bert/encoder/layer_6/intermediate/dense/add, Execution Time: 0.001377 seconds Node: bert/encoder/layer_6/intermediate/dense/mul_1, Execution Time: 0.001305 seconds Node: bert/encoder/layer_6/intermediate/dense/Tanh, Execution Time: 0.001307 seconds Add Node: bert/encoder/layer_6/intermediate/dense/add_1, Execution Time: 0.001387 seconds Node: bert/encoder/layer_6/intermediate/dense/mul_2, Execution Time: 0.001303 seconds Node: bert/encoder/layer_6/intermediate/dense/mul_3, Execution Time: 0.001365 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_6/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_6/output/dense/MatMul, Execution Time: 0.000988 seconds Skipping already processed Node: bert/encoder/layer_6/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_6/output/add Node: bert/encoder/layer_6/output/LayerNorm/moments/mean, Execution Time: 0.000092 seconds Node: bert/encoder/layer_6/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000490 seconds Node: bert/encoder/layer_6/output/LayerNorm/moments/SquaredDifference__397, Execution Time: 0.000460 seconds Node: bert/encoder/layer_6/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds Add Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000051 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/Rsqrt__399, Execution Time: 0.000071 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000045 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000481 seconds Add Node: bert/encoder/layer_6/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000447 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_7/attention/self/value/MatMul, Execution Time: 0.000656 seconds Skipping already processed Node: bert/encoder/layer_7/attention/self/value/BiasAdd Node: bert/encoder/layer_7/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_7/attention/self/transpose_2, Execution Time: 0.000444 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_7/attention/self/query/MatMul, Execution Time: 0.000674 seconds Skipping already processed Node: bert/encoder/layer_7/attention/self/query/BiasAdd Node: bert/encoder/layer_7/attention/self/Reshape, Execution Time: 0.000009 seconds Node: bert/encoder/layer_7/attention/self/transpose, Execution Time: 0.000441 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_7/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_7/attention/self/key/MatMul, Execution Time: 0.000600 seconds Skipping already processed Node: bert/encoder/layer_7/attention/self/key/BiasAdd Node: bert/encoder/layer_7/attention/self/Reshape_1, Execution Time: 0.000008 seconds Node: bert/encoder/layer_7/attention/self/MatMul__404, Execution Time: 0.000440 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_7/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_7/attention/self/MatMul, Execution Time: 0.000509 seconds Node: bert/encoder/layer_7/attention/self/Mul, Execution Time: 0.001363 seconds Add Node: bert/encoder/layer_7/attention/self/add, Execution Time: 0.001514 seconds Node: bert/encoder/layer_7/attention/self/Softmax, Execution Time: 0.001384 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_7/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_7/attention/self/MatMul_1, Execution Time: 0.000567 seconds Node: bert/encoder/layer_7/attention/self/transpose_3, Execution Time: 0.000458 seconds Node: bert/encoder/layer_7/attention/self/Reshape_3, Execution Time: 0.000047 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_7/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_7/attention/output/dense/MatMul, Execution Time: 0.000650 seconds Skipping already processed Node: bert/encoder/layer_7/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_7/attention/output/add Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/mean, Execution Time: 0.000081 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000473 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/SquaredDifference__407, Execution Time: 0.000465 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds Add Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000045 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/Rsqrt__409, Execution Time: 0.000066 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000054 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000051 seconds Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000451 seconds Add Node: bert/encoder/layer_7/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000458 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_7/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_7/intermediate/dense/MatMul, Execution Time: 0.000650 seconds Skipping already processed Node: bert/encoder/layer_7/intermediate/dense/BiasAdd Node: bert/encoder/layer_7/intermediate/dense/Pow, Execution Time: 0.001369 seconds Node: bert/encoder/layer_7/intermediate/dense/mul, Execution Time: 0.001377 seconds Add Node: bert/encoder/layer_7/intermediate/dense/add, Execution Time: 0.001498 seconds Node: bert/encoder/layer_7/intermediate/dense/mul_1, Execution Time: 0.001320 seconds Node: bert/encoder/layer_7/intermediate/dense/Tanh, Execution Time: 0.001377 seconds Add Node: bert/encoder/layer_7/intermediate/dense/add_1, Execution Time: 0.001314 seconds Node: bert/encoder/layer_7/intermediate/dense/mul_2, Execution Time: 0.001305 seconds Node: bert/encoder/layer_7/intermediate/dense/mul_3, Execution Time: 0.002071 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_7/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_7/output/dense/MatMul, Execution Time: 0.001035 seconds Skipping already processed Node: bert/encoder/layer_7/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_7/output/add Node: bert/encoder/layer_7/output/LayerNorm/moments/mean, Execution Time: 0.000083 seconds Node: bert/encoder/layer_7/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000452 seconds Node: bert/encoder/layer_7/output/LayerNorm/moments/SquaredDifference__411, Execution Time: 0.000452 seconds Node: bert/encoder/layer_7/output/LayerNorm/moments/variance, Execution Time: 0.000056 seconds Add Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000045 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/Rsqrt__413, Execution Time: 0.000071 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul, Execution Time: 0.000053 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds Add Node: bert/encoder/layer_7/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000447 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_8/attention/self/value/MatMul, Execution Time: 0.000658 seconds Skipping already processed Node: bert/encoder/layer_8/attention/self/value/BiasAdd Node: bert/encoder/layer_8/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_8/attention/self/transpose_2, Execution Time: 0.000448 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_8/attention/self/query/MatMul, Execution Time: 0.000630 seconds Skipping already processed Node: bert/encoder/layer_8/attention/self/query/BiasAdd Node: bert/encoder/layer_8/attention/self/Reshape, Execution Time: 0.000020 seconds Node: bert/encoder/layer_8/attention/self/transpose, Execution Time: 0.000449 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_8/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_8/attention/self/key/MatMul, Execution Time: 0.000614 seconds Skipping already processed Node: bert/encoder/layer_8/attention/self/key/BiasAdd Node: bert/encoder/layer_8/attention/self/Reshape_1, Execution Time: 0.000008 seconds Node: bert/encoder/layer_8/attention/self/MatMul__418, Execution Time: 0.000443 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_8/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_8/attention/self/MatMul, Execution Time: 0.000495 seconds Node: bert/encoder/layer_8/attention/self/Mul, Execution Time: 0.001312 seconds Add Node: bert/encoder/layer_8/attention/self/add, Execution Time: 0.001359 seconds Node: bert/encoder/layer_8/attention/self/Softmax, Execution Time: 0.001416 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_8/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_8/attention/self/MatMul_1, Execution Time: 0.000587 seconds Node: bert/encoder/layer_8/attention/self/transpose_3, Execution Time: 0.000445 seconds Node: bert/encoder/layer_8/attention/self/Reshape_3, Execution Time: 0.000051 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_8/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_8/attention/output/dense/MatMul, Execution Time: 0.000746 seconds Skipping already processed Node: bert/encoder/layer_8/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_8/attention/output/add Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000469 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/SquaredDifference__421, Execution Time: 0.000466 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000063 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/Rsqrt__423, Execution Time: 0.000066 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000059 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000054 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000055 seconds Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000446 seconds Add Node: bert/encoder/layer_8/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000448 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_8/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_8/intermediate/dense/MatMul, Execution Time: 0.000650 seconds Skipping already processed Node: bert/encoder/layer_8/intermediate/dense/BiasAdd Node: bert/encoder/layer_8/intermediate/dense/Pow, Execution Time: 0.001652 seconds Node: bert/encoder/layer_8/intermediate/dense/mul, Execution Time: 0.001383 seconds Add Node: bert/encoder/layer_8/intermediate/dense/add, Execution Time: 0.001327 seconds Node: bert/encoder/layer_8/intermediate/dense/mul_1, Execution Time: 0.001308 seconds Node: bert/encoder/layer_8/intermediate/dense/Tanh, Execution Time: 0.001390 seconds Add Node: bert/encoder/layer_8/intermediate/dense/add_1, Execution Time: 0.001313 seconds Node: bert/encoder/layer_8/intermediate/dense/mul_2, Execution Time: 0.001375 seconds Node: bert/encoder/layer_8/intermediate/dense/mul_3, Execution Time: 0.001365 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_8/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_8/output/dense/MatMul, Execution Time: 0.000986 seconds Skipping already processed Node: bert/encoder/layer_8/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_8/output/add Node: bert/encoder/layer_8/output/LayerNorm/moments/mean, Execution Time: 0.000085 seconds Node: bert/encoder/layer_8/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000489 seconds Node: bert/encoder/layer_8/output/LayerNorm/moments/SquaredDifference__425, Execution Time: 0.000483 seconds Node: bert/encoder/layer_8/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/Rsqrt__427, Execution Time: 0.000073 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000444 seconds Add Node: bert/encoder/layer_8/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000456 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_9/attention/self/value/MatMul, Execution Time: 0.000708 seconds Skipping already processed Node: bert/encoder/layer_9/attention/self/value/BiasAdd Node: bert/encoder/layer_9/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_9/attention/self/transpose_2, Execution Time: 0.000458 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_9/attention/self/query/MatMul, Execution Time: 0.000642 seconds Skipping already processed Node: bert/encoder/layer_9/attention/self/query/BiasAdd Node: bert/encoder/layer_9/attention/self/Reshape, Execution Time: 0.000010 seconds Node: bert/encoder/layer_9/attention/self/transpose, Execution Time: 0.000452 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_9/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_9/attention/self/key/MatMul, Execution Time: 0.000621 seconds Skipping already processed Node: bert/encoder/layer_9/attention/self/key/BiasAdd Node: bert/encoder/layer_9/attention/self/Reshape_1, Execution Time: 0.000010 seconds Node: bert/encoder/layer_9/attention/self/MatMul__432, Execution Time: 0.000462 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_9/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_9/attention/self/MatMul, Execution Time: 0.000492 seconds Node: bert/encoder/layer_9/attention/self/Mul, Execution Time: 0.001414 seconds Add Node: bert/encoder/layer_9/attention/self/add, Execution Time: 0.001318 seconds Node: bert/encoder/layer_9/attention/self/Softmax, Execution Time: 0.001571 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_9/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_9/attention/self/MatMul_1, Execution Time: 0.000562 seconds Node: bert/encoder/layer_9/attention/self/transpose_3, Execution Time: 0.000447 seconds Node: bert/encoder/layer_9/attention/self/Reshape_3, Execution Time: 0.000038 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_9/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_9/attention/output/dense/MatMul, Execution Time: 0.000661 seconds Skipping already processed Node: bert/encoder/layer_9/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_9/attention/output/add Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000456 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/SquaredDifference__435, Execution Time: 0.000499 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/moments/variance, Execution Time: 0.000067 seconds Add Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000051 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/Rsqrt__437, Execution Time: 0.000076 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000052 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000051 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000524 seconds Add Node: bert/encoder/layer_9/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000565 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_9/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_9/intermediate/dense/MatMul, Execution Time: 0.000738 seconds Skipping already processed Node: bert/encoder/layer_9/intermediate/dense/BiasAdd Node: bert/encoder/layer_9/intermediate/dense/Pow, Execution Time: 0.001530 seconds Node: bert/encoder/layer_9/intermediate/dense/mul, Execution Time: 0.001426 seconds Add Node: bert/encoder/layer_9/intermediate/dense/add, Execution Time: 0.001411 seconds Node: bert/encoder/layer_9/intermediate/dense/mul_1, Execution Time: 0.001332 seconds Node: bert/encoder/layer_9/intermediate/dense/Tanh, Execution Time: 0.001435 seconds Add Node: bert/encoder/layer_9/intermediate/dense/add_1, Execution Time: 0.001343 seconds Node: bert/encoder/layer_9/intermediate/dense/mul_2, Execution Time: 0.001372 seconds Node: bert/encoder/layer_9/intermediate/dense/mul_3, Execution Time: 0.001386 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_9/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_9/output/dense/MatMul, Execution Time: 0.001089 seconds Skipping already processed Node: bert/encoder/layer_9/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_9/output/add Node: bert/encoder/layer_9/output/LayerNorm/moments/mean, Execution Time: 0.000101 seconds Node: bert/encoder/layer_9/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000596 seconds Node: bert/encoder/layer_9/output/LayerNorm/moments/SquaredDifference__439, Execution Time: 0.000592 seconds Node: bert/encoder/layer_9/output/LayerNorm/moments/variance, Execution Time: 0.000066 seconds Add Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/add, Execution Time: 0.000058 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000059 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/Rsqrt__441, Execution Time: 0.000091 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul, Execution Time: 0.000063 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000061 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000564 seconds Add Node: bert/encoder/layer_9/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000584 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_10/attention/self/value/MatMul, Execution Time: 0.001988 seconds Skipping already processed Node: bert/encoder/layer_10/attention/self/value/BiasAdd Node: bert/encoder/layer_10/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_10/attention/self/transpose_2, Execution Time: 0.000438 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_10/attention/self/query/MatMul, Execution Time: 0.000623 seconds Skipping already processed Node: bert/encoder/layer_10/attention/self/query/BiasAdd Node: bert/encoder/layer_10/attention/self/Reshape, Execution Time: 0.000009 seconds Node: bert/encoder/layer_10/attention/self/transpose, Execution Time: 0.000460 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_10/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_10/attention/self/key/MatMul, Execution Time: 0.000663 seconds Skipping already processed Node: bert/encoder/layer_10/attention/self/key/BiasAdd Node: bert/encoder/layer_10/attention/self/Reshape_1, Execution Time: 0.000009 seconds Node: bert/encoder/layer_10/attention/self/MatMul__446, Execution Time: 0.000453 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_10/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_10/attention/self/MatMul, Execution Time: 0.000487 seconds Node: bert/encoder/layer_10/attention/self/Mul, Execution Time: 0.001345 seconds Add Node: bert/encoder/layer_10/attention/self/add, Execution Time: 0.001318 seconds Node: bert/encoder/layer_10/attention/self/Softmax, Execution Time: 0.001414 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_10/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_10/attention/self/MatMul_1, Execution Time: 0.000694 seconds Node: bert/encoder/layer_10/attention/self/transpose_3, Execution Time: 0.000443 seconds Node: bert/encoder/layer_10/attention/self/Reshape_3, Execution Time: 0.000048 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_10/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_10/attention/output/dense/MatMul, Execution Time: 0.000693 seconds Skipping already processed Node: bert/encoder/layer_10/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_10/attention/output/add Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/mean, Execution Time: 0.000084 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000475 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/SquaredDifference__449, Execution Time: 0.000465 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000047 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000047 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/Rsqrt__451, Execution Time: 0.000067 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000057 seconds Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000531 seconds Add Node: bert/encoder/layer_10/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000460 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_10/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_10/intermediate/dense/MatMul, Execution Time: 0.000681 seconds Skipping already processed Node: bert/encoder/layer_10/intermediate/dense/BiasAdd Node: bert/encoder/layer_10/intermediate/dense/Pow, Execution Time: 0.001327 seconds Node: bert/encoder/layer_10/intermediate/dense/mul, Execution Time: 0.001411 seconds Add Node: bert/encoder/layer_10/intermediate/dense/add, Execution Time: 0.001332 seconds Node: bert/encoder/layer_10/intermediate/dense/mul_1, Execution Time: 0.001390 seconds Node: bert/encoder/layer_10/intermediate/dense/Tanh, Execution Time: 0.001319 seconds Add Node: bert/encoder/layer_10/intermediate/dense/add_1, Execution Time: 0.001312 seconds Node: bert/encoder/layer_10/intermediate/dense/mul_2, Execution Time: 0.001759 seconds Node: bert/encoder/layer_10/intermediate/dense/mul_3, Execution Time: 0.001331 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_10/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_10/output/dense/MatMul, Execution Time: 0.000994 seconds Skipping already processed Node: bert/encoder/layer_10/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_10/output/add Node: bert/encoder/layer_10/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_10/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000477 seconds Node: bert/encoder/layer_10/output/LayerNorm/moments/SquaredDifference__453, Execution Time: 0.000459 seconds Node: bert/encoder/layer_10/output/LayerNorm/moments/variance, Execution Time: 0.000053 seconds Add Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/add, Execution Time: 0.000064 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000046 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/Rsqrt__455, Execution Time: 0.000067 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul, Execution Time: 0.000057 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/sub, Execution Time: 0.000059 seconds Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000454 seconds Add Node: bert/encoder/layer_10/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000557 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/value/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_11/attention/self/value/MatMul, Execution Time: 0.000667 seconds Skipping already processed Node: bert/encoder/layer_11/attention/self/value/BiasAdd Node: bert/encoder/layer_11/attention/self/Reshape_2, Execution Time: 0.000020 seconds Node: bert/encoder/layer_11/attention/self/transpose_2, Execution Time: 0.000451 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/query/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_11/attention/self/query/MatMul, Execution Time: 0.000632 seconds Skipping already processed Node: bert/encoder/layer_11/attention/self/query/BiasAdd Node: bert/encoder/layer_11/attention/self/Reshape, Execution Time: 0.000020 seconds Node: bert/encoder/layer_11/attention/self/transpose, Execution Time: 0.000466 seconds Input size: (None, 256, 768, 768) Fusing MatMul with Add for node: bert/encoder/layer_11/attention/self/key/MatMul torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_11/attention/self/key/MatMul, Execution Time: 0.000609 seconds Skipping already processed Node: bert/encoder/layer_11/attention/self/key/BiasAdd Node: bert/encoder/layer_11/attention/self/Reshape_1, Execution Time: 0.000007 seconds Node: bert/encoder/layer_11/attention/self/MatMul__460, Execution Time: 0.000451 seconds Input size: (12, 256, 64, 256) No Add node related to MatMul output: bert/encoder/layer_11/attention/self/MatMul. Executing regular MatMul. MatMul Node: bert/encoder/layer_11/attention/self/MatMul, Execution Time: 0.000494 seconds Node: bert/encoder/layer_11/attention/self/Mul, Execution Time: 0.001331 seconds Add Node: bert/encoder/layer_11/attention/self/add, Execution Time: 0.001391 seconds Node: bert/encoder/layer_11/attention/self/Softmax, Execution Time: 0.001305 seconds Input size: (12, 256, 256, 64) No Add node related to MatMul output: bert/encoder/layer_11/attention/self/MatMul_1. Executing regular MatMul. MatMul Node: bert/encoder/layer_11/attention/self/MatMul_1, Execution Time: 0.000559 seconds Node: bert/encoder/layer_11/attention/self/transpose_3, Execution Time: 0.000445 seconds Node: bert/encoder/layer_11/attention/self/Reshape_3, Execution Time: 0.000047 seconds Input size: (None, 256, 768, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_11/attention/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_11/attention/output/dense/MatMul, Execution Time: 0.000668 seconds Skipping already processed Node: bert/encoder/layer_11/attention/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_11/attention/output/add Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000474 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/SquaredDifference__463, Execution Time: 0.000541 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/moments/variance, Execution Time: 0.000054 seconds Add Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/add, Execution Time: 0.000048 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/Rsqrt__465, Execution Time: 0.000071 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul, Execution Time: 0.000075 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000053 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/sub, Execution Time: 0.000052 seconds Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000450 seconds Add Node: bert/encoder/layer_11/attention/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000453 seconds Input size: (None, 256, 768, 3072) Fusing MatMul with Add for node: bert/encoder/layer_11/intermediate/dense/MatMul torch.Size([256, 3072]) MatMul Fuse node: bert/encoder/layer_11/intermediate/dense/MatMul, Execution Time: 0.000818 seconds Skipping already processed Node: bert/encoder/layer_11/intermediate/dense/BiasAdd Node: bert/encoder/layer_11/intermediate/dense/Pow, Execution Time: 0.002038 seconds Node: bert/encoder/layer_11/intermediate/dense/mul, Execution Time: 0.001370 seconds Add Node: bert/encoder/layer_11/intermediate/dense/add, Execution Time: 0.001295 seconds Node: bert/encoder/layer_11/intermediate/dense/mul_1, Execution Time: 0.001367 seconds Node: bert/encoder/layer_11/intermediate/dense/Tanh, Execution Time: 0.001366 seconds Add Node: bert/encoder/layer_11/intermediate/dense/add_1, Execution Time: 0.001344 seconds Node: bert/encoder/layer_11/intermediate/dense/mul_2, Execution Time: 0.001409 seconds Node: bert/encoder/layer_11/intermediate/dense/mul_3, Execution Time: 0.001320 seconds Input size: (None, 256, 3072, 768) Fusing MatMul with 2Add for node: bert/encoder/layer_11/output/dense/MatMul torch.Size([256, 768]) , torch.Size([256, 768]) MatMul Fuse node: bert/encoder/layer_11/output/dense/MatMul, Execution Time: 0.000977 seconds Skipping already processed Node: bert/encoder/layer_11/output/dense/BiasAdd Skipping already processed Node: bert/encoder/layer_11/output/add Node: bert/encoder/layer_11/output/LayerNorm/moments/mean, Execution Time: 0.000082 seconds Node: bert/encoder/layer_11/output/LayerNorm/moments/SquaredDifference, Execution Time: 0.000461 seconds Node: bert/encoder/layer_11/output/LayerNorm/moments/SquaredDifference__467, Execution Time: 0.000485 seconds Node: bert/encoder/layer_11/output/LayerNorm/moments/variance, Execution Time: 0.000055 seconds Add Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/add, Execution Time: 0.000049 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/Rsqrt, Execution Time: 0.000048 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/Rsqrt__469, Execution Time: 0.000070 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul, Execution Time: 0.000045 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul_2, Execution Time: 0.000052 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/sub, Execution Time: 0.000053 seconds Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/mul_1, Execution Time: 0.000533 seconds Add Node: bert/encoder/layer_11/output/LayerNorm/batchnorm/add_1, Execution Time: 0.000473 seconds Input size: (None, 256, 768, 2) Fusing MatMul with Add for node: MatMul torch.Size([256, 2]) MatMul Fuse node: MatMul, Execution Time: 0.001725 seconds Skipping already processed Node: BiasAdd Node: Reshape_1, Execution Time: 0.000026 seconds Node: transpose, Execution Time: 0.000045 seconds Node: unstack, Execution Time: 0.000050 seconds Node: unstack__490, Execution Time: 0.000020 seconds Node: unstack__488, Execution Time: 0.000007 seconds Node Execution Times: Total Execution Time: 0.436412 seconds Total Matmul + Add Execution Time: 0.163752 seconds Execution complete. Model outputs: {'unstack:1': array([[-4.9148726, -4.6251225, -4.132886 , -4.1499195, -4.7828836, -4.250844 , -4.77094 , -4.348463 , -2.7006364, -4.424177 , -4.510866 , -4.39433 , -4.773833 , -4.480716 , -4.7714205, -4.6485815, -3.1330094, -4.7139587, -4.7148943, -4.7223635, -4.7008233, -4.6960616, -4.7121487, -4.708615 , -4.703374 , -4.7024655, -4.687359 , -4.693113 , -4.698162 , -4.692563 , -4.711712 , -4.7003703, -4.7027717, -4.7279253, -4.709934 , -4.715551 , -4.7324576, -4.7294855, -4.7329216, -4.7218866, -4.7014203, -4.694692 , -4.6925716, -4.700892 , -4.7044754, -4.68252 , -4.679993 , -4.6824126, -4.6833754, -4.690988 , -4.695919 , -4.6797957, -4.683871 , -4.6834297, -4.680781 , -4.686977 , -4.681429 , -4.680897 , -4.694978 , -4.685382 , -4.70324 , -4.7010674, -4.693331 , -4.7089696, -4.71908 , -4.7188516, -4.70435 , -4.685466 , -4.6962924, -4.6972375, -4.691828 , -4.688009 , -4.691449 , -4.693622 , -4.6890097, -4.6876435, -4.684474 , -4.7056074, -4.6984677, -4.7068577, -4.689911 , -4.687499 , -4.6927333, -4.693831 , -4.6965637, -4.693646 , -4.693519 , -4.71067 , -4.722037 , -4.718479 , -4.729904 , -4.721483 , -4.739112 , -4.7325935, -4.7295456, -4.712435 , -4.712704 , -4.7114053, -4.712399 , -4.704262 , -4.6972833, -4.6926665, -4.717176 , -4.6937675, -4.694539 , -4.711683 , -4.685275 , -4.6935816, -4.701117 , -4.6866083, -4.6843753, -4.6876745, -4.684178 , -4.694061 , -4.6890798, -4.6861553, -4.7003927, -4.7103863, -4.710601 , -4.7194986, -4.7016277, -4.718649 , -4.743214 , -4.7109504, -4.711556 , -4.7007613, -4.7009783, -4.6995244, -4.7007017, -4.7026825, -4.706376 , -4.7061615, -4.7284904, -4.724841 , -4.7082043, -4.7080393, -4.7098503, -4.7207146, -4.733838 , -4.7125974, -4.7276387, -4.721991 , -4.7300687, -4.7229652, -4.7133346, -4.7109923, -4.71963 , -4.7312083, -4.733224 , -4.7362647, -4.739877 , -4.74243 , -4.727128 , -4.737834 , -4.74598 , -4.738839 , -4.744508 , -4.728359 , -4.726734 , -4.7255516, -4.7363386, -4.73214 , -4.7196693, -4.721826 , -4.7047076, -4.7190104, -4.7156587, -4.706273 , -4.7116737, -4.701518 , -4.6943965, -4.6903934, -4.6890545, -4.6862764, -4.6875463, -4.684304 , -4.688264 , -4.691186 , -4.7027955, -4.6910152, -4.6985803, -4.7152886, -4.723945 , -4.7293673, -4.7427354, -4.73977 , -4.7290154, -4.7378254, -4.7355986, -4.731869 , -4.724579 , -4.7262163, -4.71887 , -4.7058587, -4.7122684, -4.7009015, -4.696829 , -4.7094407, -4.703914 , -4.703702 , -4.7195215, -4.7118044, -4.709847 , -4.721358 , -4.723019 , -4.71298 , -4.7218485, -4.724691 , -4.725982 , -4.726673 , -4.7187834, -4.709004 , -4.7109466, -4.737439 , -4.7246385, -4.73252 , -4.7404885, -4.7261868, -4.734698 , -4.732445 , -4.736647 , -4.724646 , -4.73208 , -4.7321663, -4.7037077, -4.718028 , -4.726786 , -4.7345347, -4.7328334, -4.7220054, -4.7327023, -4.7200413, -4.7459936, -4.728972 , -4.7290406, -4.7259574, -4.730495 , -4.723769 , -4.7380366, -4.7268267, -4.692981 , -4.718449 , -4.6935935, -4.6961823, -4.713647 , -4.6950507, -4.700345 , -4.7232556, -4.708386 , -4.737004 , -4.7273254, -4.716681 , -4.7106347, -4.714922 , -4.7030454, -4.7468524]], dtype=float32), 'unstack:0': array([[-5.339778 , -4.878685 , -4.312428 , -4.3309417, -5.125337 , -4.442749 , -5.1271124, -4.5656004, -4.683339 , -4.6350813, -4.8042274, -4.6028423, -5.1304255, -4.7185884, -5.0999007, -4.9003377, -5.1724668, -5.1058035, -5.1073008, -5.1120396, -5.0958624, -5.092071 , -5.104314 , -5.1013465, -5.0973773, -5.0955014, -5.086265 , -5.089708 , -5.093198 , -5.089909 , -5.1028776, -5.0938663, -5.0976443, -5.1154556, -5.102868 , -5.1068664, -5.1185074, -5.1169963, -5.118672 , -5.1110716, -5.0957775, -5.0914636, -5.089892 , -5.096351 , -5.099577 , -5.084194 , -5.082636 , -5.0841656, -5.0848293, -5.089616 , -5.0918293, -5.083179 , -5.084272 , -5.0856056, -5.0826926, -5.087329 , -5.0841713, -5.0831146, -5.092702 , -5.084974 , -5.0978565, -5.0952926, -5.090936 , -5.102818 , -5.110067 , -5.1097775, -5.0976253, -5.0851665, -5.0931044, -5.093152 , -5.089941 , -5.0872903, -5.0898356, -5.0923924, -5.0875926, -5.086853 , -5.085301 , -5.100186 , -5.094749 , -5.099969 , -5.0874996, -5.0855126, -5.0895004, -5.09137 , -5.0918326, -5.0898056, -5.090782 , -5.1034665, -5.112412 , -5.109096 , -5.1174197, -5.1111536, -5.1241746, -5.1188 , -5.116848 , -5.1029363, -5.1041894, -5.103745 , -5.105212 , -5.098095 , -5.093282 , -5.090341 , -5.1087084, -5.0905395, -5.0906925, -5.1039257, -5.084995 , -5.090868 , -5.0939407, -5.0842586, -5.0840406, -5.0855136, -5.08409 , -5.089621 , -5.0858765, -5.0852404, -5.09481 , -5.1036887, -5.1036325, -5.1107006, -5.0964427, -5.109834 , -5.128194 , -5.104343 , -5.10455 , -5.0965843, -5.0981956, -5.0968714, -5.0971923, -5.096769 , -5.1019425, -5.1022315, -5.119105 , -5.116201 , -5.102627 , -5.102922 , -5.1034007, -5.111492 , -5.121706 , -5.1049304, -5.116994 , -5.111964 , -5.1179514, -5.1140733, -5.1069007, -5.1045523, -5.1113954, -5.119346 , -5.1202354, -5.1230803, -5.1247115, -5.125494 , -5.1167865, -5.1235557, -5.127506 , -5.1223035, -5.124693 , -5.116798 , -5.1166444, -5.1148844, -5.1223955, -5.1191473, -5.111838 , -5.112754 , -5.1008034, -5.1111383, -5.1085505, -5.100999 , -5.1052284, -5.0974274, -5.0922704, -5.0895066, -5.089077 , -5.086511 , -5.0866723, -5.0855794, -5.0879817, -5.0893273, -5.0967927, -5.08802 , -5.093814 , -5.1059337, -5.112577 , -5.1154685, -5.121607 , -5.12036 , -5.114813 , -5.1212907, -5.1178846, -5.117335 , -5.1129055, -5.1143084, -5.109348 , -5.100045 , -5.1053514, -5.0964003, -5.0934987, -5.102238 , -5.0983605, -5.0989766, -5.1099577, -5.10423 , -5.1023245, -5.1104093, -5.111489 , -5.1045485, -5.110909 , -5.112187 , -5.1123652, -5.113932 , -5.10867 , -5.0995913, -5.101586 , -5.1216726, -5.111117 , -5.116669 , -5.12195 , -5.112778 , -5.1199346, -5.117032 , -5.120798 , -5.11272 , -5.117168 , -5.1175523, -5.09827 , -5.1082807, -5.1146145, -5.1200075, -5.1190424, -5.112625 , -5.1200185, -5.1110024, -5.126168 , -5.1168666, -5.11615 , -5.113571 , -5.118028 , -5.1132293, -5.122775 , -5.1154203, -5.091564 , -5.1100745, -5.0914884, -5.0932784, -5.105365 , -5.092105 , -5.0959387, -5.1119223, -5.101221 , -5.1215677, -5.114091 , -5.10658 , -5.101732 , -5.105737 , -5.0961223, -5.1260395]], dtype=float32), 'unique_ids:0': array([0])} Question: What is the capital of France? Context: The capital of France is Paris. Answer: Generating '/tmp/nsys-report-b145.qdstrm' [1/8] [0% ] nsys-report-048e.nsys-rep [1/8] [0% ] nsys-report-048e.nsys-rep [1/8] [6% ] nsys-report-048e.nsys-rep [1/8] [12% ] nsys-report-048e.nsys-rep [1/8] [10% ] nsys-report-048e.nsys-rep [1/8] [8% ] nsys-report-048e.nsys-rep [1/8] [=====30% ] nsys-report-048e.nsys-rep [1/8] [====26% ] nsys-report-048e.nsys-rep [1/8] [===23% ] nsys-report-048e.nsys-rep [1/8] [==20% ] nsys-report-048e.nsys-rep [1/8] [==18% ] nsys-report-048e.nsys-rep [1/8] [=16% ] nsys-report-048e.nsys-rep [1/8] [=17% ] nsys-report-048e.nsys-rep [1/8] [==18% ] nsys-report-048e.nsys-rep [1/8] [==19% ] nsys-report-048e.nsys-rep [1/8] [==20% ] nsys-report-048e.nsys-rep [1/8] [==21% ] nsys-report-048e.nsys-rep [1/8] [===22% ] nsys-report-048e.nsys-rep [1/8] [===23% ] nsys-report-048e.nsys-rep [1/8] [===24% ] nsys-report-048e.nsys-rep [1/8] [====25% ] nsys-report-048e.nsys-rep [1/8] [====26% ] nsys-report-048e.nsys-rep [1/8] [====27% ] nsys-report-048e.nsys-rep [1/8] [====28% ] nsys-report-048e.nsys-rep [1/8] [=====31% ] nsys-report-048e.nsys-rep [1/8] [======35% ] nsys-report-048e.nsys-rep [1/8] [=========44% ] nsys-report-048e.nsys-rep [1/8] [===========53% ] nsys-report-048e.nsys-rep [1/8] [==============61% ] nsys-report-048e.nsys-rep [1/8] [===============66% ] nsys-report-048e.nsys-rep [1/8] [================70% ] nsys-report-048e.nsys-rep [1/8] [==================76% ] nsys-report-048e.nsys-rep [1/8] [==================77% ] nsys-report-048e.nsys-rep [1/8] [==================78% ] nsys-report-048e.nsys-rep [1/8] [===================79% ] nsys-report-048e.nsys-rep [1/8] [===================80% ] nsys-report-048e.nsys-rep [1/8] [=====================89% ] nsys-report-048e.nsys-rep [1/8] [=======================96% ] nsys-report-048e.nsys-rep [1/8] [========================98% ] nsys-report-048e.nsys-rep [1/8] [========================100%] nsys-report-048e.nsys-rep [1/8] [========================100%] nsys-report-048e.nsys-rep [2/8] [0% ] nsys-report-b910.sqlite [2/8] [1% ] nsys-report-b910.sqlite [2/8] [2% ] nsys-report-b910.sqlite [2/8] [3% ] nsys-report-b910.sqlite [2/8] [4% ] nsys-report-b910.sqlite [2/8] [5% ] nsys-report-b910.sqlite [2/8] [6% ] nsys-report-b910.sqlite [2/8] [7% ] nsys-report-b910.sqlite [2/8] [8% ] nsys-report-b910.sqlite [2/8] [9% ] nsys-report-b910.sqlite [2/8] [10% ] nsys-report-b910.sqlite [2/8] [11% ] nsys-report-b910.sqlite [2/8] [12% ] nsys-report-b910.sqlite [2/8] [13% ] nsys-report-b910.sqlite [2/8] [14% ] nsys-report-b910.sqlite [2/8] [=15% ] nsys-report-b910.sqlite [2/8] [=16% ] nsys-report-b910.sqlite [2/8] [=17% ] nsys-report-b910.sqlite [2/8] [==18% ] nsys-report-b910.sqlite [2/8] [==19% ] nsys-report-b910.sqlite [2/8] [==20% ] nsys-report-b910.sqlite [2/8] [==21% ] nsys-report-b910.sqlite [2/8] [===22% ] nsys-report-b910.sqlite [2/8] [===23% ] nsys-report-b910.sqlite [2/8] [===24% ] nsys-report-b910.sqlite [2/8] [====25% ] nsys-report-b910.sqlite [2/8] [====26% ] nsys-report-b910.sqlite [2/8] [====27% ] nsys-report-b910.sqlite [2/8] [====28% ] nsys-report-b910.sqlite [2/8] [=====29% ] nsys-report-b910.sqlite [2/8] [=====30% ] nsys-report-b910.sqlite [2/8] [=====31% ] nsys-report-b910.sqlite [2/8] [=====32% ] nsys-report-b910.sqlite [2/8] [======33% ] nsys-report-b910.sqlite [2/8] [======34% ] nsys-report-b910.sqlite [2/8] [======35% ] nsys-report-b910.sqlite [2/8] [=======36% ] nsys-report-b910.sqlite [2/8] [=======37% ] nsys-report-b910.sqlite [2/8] [=======38% ] nsys-report-b910.sqlite [2/8] [=======39% ] nsys-report-b910.sqlite [2/8] [========40% ] nsys-report-b910.sqlite [2/8] [========41% ] nsys-report-b910.sqlite [2/8] [========42% ] nsys-report-b910.sqlite [2/8] [=========43% ] nsys-report-b910.sqlite [2/8] [=========44% ] nsys-report-b910.sqlite [2/8] [=========45% ] nsys-report-b910.sqlite [2/8] [=========46% ] nsys-report-b910.sqlite [2/8] [==========47% ] nsys-report-b910.sqlite [2/8] [==========48% ] nsys-report-b910.sqlite [2/8] [==========49% ] nsys-report-b910.sqlite [2/8] [===========50% ] nsys-report-b910.sqlite [2/8] [===========51% ] nsys-report-b910.sqlite [2/8] [===========52% ] nsys-report-b910.sqlite [2/8] [===========53% ] nsys-report-b910.sqlite [2/8] [============54% ] nsys-report-b910.sqlite [2/8] [============55% ] nsys-report-b910.sqlite [2/8] [============56% ] nsys-report-b910.sqlite [2/8] [============57% ] nsys-report-b910.sqlite [2/8] [=============58% ] nsys-report-b910.sqlite [2/8] [=============59% ] nsys-report-b910.sqlite [2/8] [=============60% ] nsys-report-b910.sqlite [2/8] [==============61% ] nsys-report-b910.sqlite [2/8] [==============62% ] nsys-report-b910.sqlite [2/8] [==============63% ] nsys-report-b910.sqlite [2/8] [==============64% ] nsys-report-b910.sqlite [2/8] [===============65% ] nsys-report-b910.sqlite [2/8] [===============66% ] nsys-report-b910.sqlite [2/8] [===============67% ] nsys-report-b910.sqlite [2/8] [================68% ] nsys-report-b910.sqlite [2/8] [================69% ] nsys-report-b910.sqlite [2/8] [================70% ] nsys-report-b910.sqlite [2/8] [================71% ] nsys-report-b910.sqlite [2/8] [=================72% ] nsys-report-b910.sqlite [2/8] [=================73% ] nsys-report-b910.sqlite [2/8] [=================74% ] nsys-report-b910.sqlite [2/8] [==================75% ] nsys-report-b910.sqlite [2/8] [==================76% ] nsys-report-b910.sqlite [2/8] [==================77% ] nsys-report-b910.sqlite [2/8] [==================78% ] nsys-report-b910.sqlite [2/8] [===================79% ] nsys-report-b910.sqlite [2/8] [===================80% ] nsys-report-b910.sqlite [2/8] [===================81% ] nsys-report-b910.sqlite [2/8] [===================82% ] nsys-report-b910.sqlite [2/8] [====================83% ] nsys-report-b910.sqlite [2/8] [====================84% ] nsys-report-b910.sqlite [2/8] [====================85% ] nsys-report-b910.sqlite [2/8] [=====================86% ] nsys-report-b910.sqlite [2/8] [=====================87% ] nsys-report-b910.sqlite [2/8] [=====================88% ] nsys-report-b910.sqlite [2/8] [=====================89% ] nsys-report-b910.sqlite [2/8] [======================90% ] nsys-report-b910.sqlite [2/8] [======================91% ] nsys-report-b910.sqlite [2/8] [======================92% ] nsys-report-b910.sqlite [2/8] [=======================93% ] nsys-report-b910.sqlite [2/8] [=======================94% ] nsys-report-b910.sqlite [2/8] [=======================95% ] nsys-report-b910.sqlite [2/8] [=======================96% ] nsys-report-b910.sqlite [2/8] [========================97% ] nsys-report-b910.sqlite [2/8] [========================98% ] nsys-report-b910.sqlite [2/8] [========================99% ] nsys-report-b910.sqlite [2/8] [========================100%] nsys-report-b910.sqlite [2/8] [========================100%] nsys-report-b910.sqlite [3/8] Executing 'nvtx_sum' stats report [4/8] Executing 'osrt_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ---------------------- 53.8 5,534,228,333 66 83,851,944.4 100,143,002.5 1,170 545,269,062 71,605,661.3 poll 43.7 4,500,777,207 9 500,086,356.3 500,086,682.0 500,079,031 500,089,732 3,160.8 pthread_cond_timedwait 1.6 169,470,404 5,645 30,021.3 800.0 290 156,067,077 2,077,185.0 read 0.6 65,433,966 3,057 21,404.6 7,290.0 210 10,657,324 253,752.7 ioctl 0.1 9,565,910 3,192 2,996.8 2,730.0 1,150 37,190 1,560.3 open64 0.0 5,062,319 1 5,062,319.0 5,062,319.0 5,062,319 5,062,319 0.0 nanosleep 0.0 3,515,399 133,713 26.3 20.0 20 7,690 37.9 pthread_cond_signal 0.0 3,019,548 138 21,880.8 5,050.0 2,120 1,585,212 135,490.4 mmap64 0.0 888,370 10 88,837.0 61,496.0 16,131 321,794 89,799.7 sem_timedwait 0.0 875,984 13 67,383.4 60,021.0 54,961 81,122 11,142.0 sleep 0.0 507,661 583 870.8 50.0 20 57,101 5,351.7 fgets 0.0 344,517 32 10,766.2 5,985.0 430 48,080 13,305.5 write 0.0 339,116 8 42,389.5 38,491.0 23,730 62,011 14,666.4 pthread_create 0.0 303,824 27 11,252.7 7,160.0 1,910 78,201 14,616.3 mmap 0.0 211,907 44 4,816.1 2,895.0 1,130 23,071 4,821.3 fopen 0.0 187,553 9 20,839.2 4,420.0 2,370 83,491 31,534.9 munmap 0.0 167,402 173 967.6 820.0 500 3,971 515.7 pread64 0.0 124,571 1 124,571.0 124,571.0 124,571 124,571 0.0 pthread_cond_wait 0.0 100,471 1 100,471.0 100,471.0 100,471 100,471 0.0 waitpid 0.0 61,040 1,622 37.6 30.0 20 4,320 147.5 pthread_cond_broadcast 0.0 57,899 41 1,412.2 1,150.0 660 4,790 867.4 fclose 0.0 54,840 15 3,656.0 3,270.0 1,820 6,590 1,615.8 open 0.0 38,309 6 6,384.8 4,239.5 2,220 18,640 6,173.0 pipe2 0.0 32,631 2 16,315.5 16,315.5 9,130 23,501 10,161.8 connect 0.0 31,867 133 239.6 250.0 20 1,480 163.8 sigaction 0.0 29,977 1,211 24.8 20.0 20 151 6.3 flockfile 0.0 29,391 4 7,347.8 7,470.0 3,370 11,081 4,026.6 socket 0.0 22,437 68 330.0 300.0 180 1,160 173.5 fcntl 0.0 20,210 6 3,368.3 2,620.0 1,360 7,370 2,188.4 fopen64 0.0 16,430 192 85.6 100.0 20 550 66.3 pthread_mutex_trylock 0.0 15,540 3 5,180.0 5,620.0 1,600 8,320 3,381.5 fread 0.0 8,140 2 4,070.0 4,070.0 2,350 5,790 2,432.4 bind 0.0 3,480 2 1,740.0 1,740.0 800 2,680 1,329.4 fwrite 0.0 2,629 10 262.9 260.0 189 360 49.6 dup 0.0 2,602 30 86.7 30.0 20 900 182.5 fflush 0.0 2,250 2 1,125.0 1,125.0 660 1,590 657.6 dup2 0.0 769 1 769.0 769.0 769 769 0.0 getc 0.0 680 1 680.0 680.0 680 680 0.0 listen [5/8] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------- --------- -------- ---------- ----------- --------------------------------- 66.8 458,889,319 1,804 254,373.2 53,460.5 2,211 2,177,000 394,198.6 cudaMemcpyAsync 16.4 112,515,093 1,804 62,369.8 11,100.0 650 257,654 79,474.5 cudaStreamSynchronize 10.9 75,217,217 707 106,389.3 7,460.0 2,850 16,497,323 927,843.2 cudaLaunchKernel 1.5 10,141,562 98 103,485.3 91,441.5 5,390 327,454 88,149.1 cuCtxSynchronize 1.4 9,551,675 2,624 3,640.1 3,085.0 490 20,001 2,831.1 cudaDeviceSynchronize 1.0 6,839,815 2,624 2,606.6 1,560.0 1,190 32,571 2,225.5 cudaEventRecord 0.9 6,327,816 26 243,377.5 715.0 290 6,308,675 1,237,082.8 cudaStreamIsCapturing_v10000 0.4 2,729,205 23 118,661.1 126,411.0 73,641 167,492 30,706.1 cudaMalloc 0.3 1,776,952 2,624 677.2 600.0 240 18,670 548.3 cudaEventCreateWithFlags 0.2 1,274,525 98 13,005.4 12,935.0 7,760 27,621 1,979.2 cuLaunchKernel 0.1 922,031 2,624 351.4 300.0 180 7,720 263.5 cudaEventDestroy 0.1 361,385 5 72,277.0 70,091.0 56,771 89,731 12,660.4 cuModuleLoadData 0.0 326,636 1,149 284.3 200.0 50 7,880 367.3 cuGetProcAddress_v2 0.0 262,753 50 5,255.1 5,465.0 3,130 9,450 1,868.7 cudaMemsetAsync 0.0 171,663 1 171,663.0 171,663.0 171,663 171,663 0.0 cudaGetDeviceProperties_v2_v12000 0.0 3,930 3 1,310.0 1,300.0 510 2,120 805.0 cuInit 0.0 3,530 1 3,530.0 3,530.0 3,530 3,530 0.0 cuMemFree_v2 0.0 950 3 316.7 240.0 60 650 302.4 cuModuleGetLoadingMode 0.0 840 1 840.0 840.0 840 840 0.0 cuCtxSetCurrent [6/8] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 84.2 9,532,306 97 98,271.2 84,480.0 11,072 322,784 89,061.0 cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align4 3.1 345,470 125 2,763.8 2,368.0 1,343 6,016 1,403.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::… 2.8 315,425 121 2,606.8 2,304.0 1,280 4,288 724.6 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::… 2.0 225,953 75 3,012.7 2,368.0 1,600 4,993 1,136.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::… 1.9 217,087 123 1,764.9 1,280.0 800 3,136 758.5 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::… 1.6 182,945 50 3,658.9 3,712.0 3,488 4,000 149.1 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl… 0.9 104,833 12 8,736.1 8,688.0 8,608 9,248 171.0 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::… 0.9 103,266 64 1,613.5 960.0 864 4,544 1,348.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, at::deta… 0.9 96,288 37 2,602.4 1,824.0 1,728 4,384 1,188.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa… 0.6 71,392 12 5,949.3 5,952.0 5,856 6,048 66.1 void <unnamed>::softmax_warp_forward<float, float, float, (int)8, (bool)0, (bool)0>(T2 *, const T1 … 0.4 45,536 12 3,794.7 3,792.0 3,712 3,872 53.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::tanh_kernel_cuda(at::TensorItera… 0.2 25,922 25 1,036.9 1,024.0 1,024 1,057 16.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::reciprocal_kernel_cuda(at::Tenso… 0.2 25,664 25 1,026.6 1,024.0 992 1,056 12.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sqrt_kernel_cuda(at::TensorItera… 0.2 22,911 25 916.4 928.0 895 928 15.7 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa… 0.1 5,760 1 5,760.0 5,760.0 5,760 5,760 0.0 cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align2 0.0 1,600 1 1,600.0 1,600.0 1,600 1,600 0.0 void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<int, unsigned int, (int)1, (int)12… [7/8] Executing 'cuda_gpu_mem_time_sum' stats report Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation -------- --------------- ----- --------- --------- -------- --------- ----------- ---------------------------- 56.6 187,729,069 1,157 162,255.0 119,617.0 287 2,133,603 254,865.0 [CUDA memcpy Host-to-Device] 43.4 143,824,334 647 222,294.2 117,216.0 1,056 1,011,362 282,104.0 [CUDA memcpy Device-to-Host] 0.0 24,355 50 487.1 320.0 288 1,088 294.8 [CUDA memset] [8/8] Executing 'cuda_gpu_mem_size_sum' stats report Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ----- -------- -------- -------- -------- ----------- ---------------------------- 1,224.510 1,157 1.058 0.786 0.000 9.437 1.648 [CUDA memcpy Host-to-Device] 707.510 647 1.094 0.786 0.000 3.146 1.206 [CUDA memcpy Device-to-Host] 0.000 50 0.000 0.000 0.000 0.000 0.000 [CUDA memset] Generated: /tmp/nsys-report-048e.nsys-rep /tmp/nsys-report-b910.sqlite
Leave a Comment