Inferentia上で推論性能最適化

章5.3で得られた結果を検証し、Neuronコア上での推論処理の最適化を試みます。

Step 1. 推論スクリプトの変更

以下の内容で infer_bert_perf2.py というファイル名の推論実行 Python スクリプトを作成します。

本スクリプトでは、Inferentia推論チップ上に搭載される 4つの Neuron コアを効率的に活用すべく、Python の ThreadPoolExecutor を用いて並列にスレッドを実行し、Neuronコアの負荷を最大まで高めるテストを行います。

import os
import time
import numpy as np
import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from concurrent import futures

model_path = 'bert_neuron.pt'

num_neuroncore = 4
num_thread = 2
batch_size = 1

throughput_time = 90
throughput_interval = 10
latency_window_size = 1000

# added for utilizing 4 neuron cores
os.environ['NEURON_RT_VISIBLE_CORES'] = '0-3'

# Build tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

# Load TorchScript back
model_list = [torch.jit.load(model_path) for _ in range(num_neuroncore)]
model_mp_list = []
for model in model_list:
    model_mp_list.extend(model for _ in range(num_thread))

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=128, padding='max_length', truncation=True, return_tensors="pt")

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase_tuple = (
    torch.cat([paraphrase['input_ids']] * batch_size, 0),
    torch.cat([paraphrase['attention_mask']] * batch_size, 0),
    torch.cat([paraphrase['token_type_ids']] * batch_size, 0)
)


live = True
num_infer = 0
latency_list = []

def one_thread(model, example_inputs_paraphrase_tuple):
    global latency_list
    global num_infer
    global live

    while True:
        start = time.time()
        paraphrase_classification_logits = model(*example_inputs_paraphrase_tuple)
        latency = time.time() - start
        latency_list.append(latency)
        num_infer += batch_size
        if not live:
            break

def current_performance():
    last_num_infer = num_infer
    for _ in range(throughput_time // throughput_interval):
        current_num_infer = num_infer
        throughput = (current_num_infer - last_num_infer) / throughput_interval
        p50 = 0.0
        p90 = 0.0
        p95 = 0.0
        if latency_list:
            p50 = np.percentile(latency_list[-latency_window_size:], 50)
            p90 = np.percentile(latency_list[-latency_window_size:], 90)
            p95 = np.percentile(latency_list[-latency_window_size:], 95)
        print('current throughput {}, latency p50={:.5f} p90={:.5f} p95={:.5f}'.format(throughput, p50, p90, p95))
        last_num_infer = current_num_infer
        time.sleep(throughput_interval)
    global live
    live = False

executor = futures.ThreadPoolExecutor(max_workers=len(model_mp_list)+1)
executor.submit(current_performance)

for model in model_mp_list:
   executor.submit(one_thread, model, example_inputs_paraphrase_tuple)

Step 2. 変更後の推論スクリプトを実行

変更後の推論実行スクリプト infer_bert_perf2.py を実行します。

python infer_bert_perf2.py

次の結果が取得されます。Neuronコア上での処理性能が大幅に向上している事が確認できます。

current throughput 0.0, latency p50=0.00000 p90=0.00000 p95=0.00000
current throughput 256.8, latency p50=0.03100 p90=0.03108 p95=0.03111
current throughput 257.9, latency p50=0.03101 p90=0.03105 p95=0.03106
current throughput 258.0, latency p50=0.03101 p90=0.03105 p95=0.03107
current throughput 258.1, latency p50=0.03101 p90=0.03106 p95=0.03107
current throughput 258.0, latency p50=0.03100 p90=0.03105 p95=0.03106
current throughput 258.0, latency p50=0.03100 p90=0.03105 p95=0.03107
current throughput 258.0, latency p50=0.03101 p90=0.03105 p95=0.03107
current throughput 258.4, latency p50=0.03101 p90=0.03105 p95=0.03107

Step 3. Neuronコアの負荷測定

章5.3 Step 3.と同様に Neuronコアの利用率を確認します。

neuron-top

出力結果から 4つの Neuron コアそれぞれが、~100%の高負荷で利用されていることが確認できます。

 NeuronCore Utilization
                  NC0                          NC1                         NC2                         NC3
 ND0  |||||||||||||||[99.09%] ||||||||||||||||||||[99.05%] ||||||||||||||||||||[99.12%] ||||||||||||||||||||[98.98%]

 vCPU and Memory Info
 System vCPU Usage ||||||||||||||||||||||||[12.92%, 9.27%]    Runtime vCPU Usage |||||||||||||||||||||||[ 8.29%,11.52%]|
|||||||e Memory Host ||||||||||||||||||[   3.1MB/  15.2GB]    Runtime Memory Device  718.0MB

 Loaded Models
                                                                                                    Model ID
  [+] ND 0

Step 4. CPU上での実行

infer_bert_perf2.py 内の以下の部分を変更し、実行することで、CPU上での性能を確認します。

model_path = 'bert_cpu.pt'

num_neuroncore = 1
num_thread = 8
batch_size = 1

次の結果が取得されます。章5.3の結果と比較し、性能向上は見られません。

current throughput 0.0, latency p50=0.00000 p90=0.00000 p95=0.00000
current throughput 11.5, latency p50=0.62904 p90=0.93557 p95=0.97669
current throughput 12.9, latency p50=0.62257 p90=0.77736 p95=0.93456
current throughput 13.1, latency p50=0.60946 p90=0.74768 p95=0.81387
current throughput 13.0, latency p50=0.60979 p90=0.73267 p95=0.79491
current throughput 13.5, latency p50=0.60466 p90=0.72485 p95=0.77868
current throughput 13.7, latency p50=0.60152 p90=0.71854 p95=0.76954
current throughput 13.1, latency p50=0.60238 p90=0.71630 p95=0.75955
current throughput 13.0, latency p50=0.60148 p90=0.70095 p95=0.72949

本ワークショップでは利用していませんが、Neuron 1.16.0 では新しい機能として、Inferentia 上でのデータパラレル推論を実現するための torch.neuron.DataParallel() というAPIを提供しています。 詳細は、AWS Neuron ドキュメント Data Parallel Inference on Torch Neuron を参照ください。