章5.3で得られた結果を検証し、Neuronコア上での推論処理の最適化を試みます。
以下の内容で infer_bert_perf2.py
というファイル名の推論実行 Python スクリプトを作成します。
本スクリプトでは、Inferentia推論チップ上に搭載される 4つの Neuron コアを効率的に活用すべく、Python の ThreadPoolExecutor を用いて並列にスレッドを実行し、Neuronコアの負荷を最大まで高めるテストを行います。
import os
import time
import numpy as np
import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from concurrent import futures
model_path = 'bert_neuron.pt'
num_neuroncore = 4
num_thread = 2
batch_size = 1
throughput_time = 90
throughput_interval = 10
latency_window_size = 1000
# added for utilizing 4 neuron cores
os.environ['NEURON_RT_VISIBLE_CORES'] = '0-3'
# Build tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
# Load TorchScript back
model_list = [torch.jit.load(model_path) for _ in range(num_neuroncore)]
model_mp_list = []
for model in model_list:
model_mp_list.extend(model for _ in range(num_thread))
# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=128, padding='max_length', truncation=True, return_tensors="pt")
# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase_tuple = (
torch.cat([paraphrase['input_ids']] * batch_size, 0),
torch.cat([paraphrase['attention_mask']] * batch_size, 0),
torch.cat([paraphrase['token_type_ids']] * batch_size, 0)
)
live = True
num_infer = 0
latency_list = []
def one_thread(model, example_inputs_paraphrase_tuple):
global latency_list
global num_infer
global live
while True:
start = time.time()
paraphrase_classification_logits = model(*example_inputs_paraphrase_tuple)
latency = time.time() - start
latency_list.append(latency)
num_infer += batch_size
if not live:
break
def current_performance():
last_num_infer = num_infer
for _ in range(throughput_time // throughput_interval):
current_num_infer = num_infer
throughput = (current_num_infer - last_num_infer) / throughput_interval
p50 = 0.0
p90 = 0.0
p95 = 0.0
if latency_list:
p50 = np.percentile(latency_list[-latency_window_size:], 50)
p90 = np.percentile(latency_list[-latency_window_size:], 90)
p95 = np.percentile(latency_list[-latency_window_size:], 95)
print('current throughput {}, latency p50={:.5f} p90={:.5f} p95={:.5f}'.format(throughput, p50, p90, p95))
last_num_infer = current_num_infer
time.sleep(throughput_interval)
global live
live = False
executor = futures.ThreadPoolExecutor(max_workers=len(model_mp_list)+1)
executor.submit(current_performance)
for model in model_mp_list:
executor.submit(one_thread, model, example_inputs_paraphrase_tuple)
変更後の推論実行スクリプト infer_bert_perf2.py
を実行します。
python infer_bert_perf2.py
次の結果が取得されます。Neuronコア上での処理性能が大幅に向上している事が確認できます。
current throughput 0.0, latency p50=0.00000 p90=0.00000 p95=0.00000
current throughput 256.8, latency p50=0.03100 p90=0.03108 p95=0.03111
current throughput 257.9, latency p50=0.03101 p90=0.03105 p95=0.03106
current throughput 258.0, latency p50=0.03101 p90=0.03105 p95=0.03107
current throughput 258.1, latency p50=0.03101 p90=0.03106 p95=0.03107
current throughput 258.0, latency p50=0.03100 p90=0.03105 p95=0.03106
current throughput 258.0, latency p50=0.03100 p90=0.03105 p95=0.03107
current throughput 258.0, latency p50=0.03101 p90=0.03105 p95=0.03107
current throughput 258.4, latency p50=0.03101 p90=0.03105 p95=0.03107
章5.3 Step 3.と同様に Neuronコアの利用率を確認します。
neuron-top
出力結果から 4つの Neuron コアそれぞれが、~100%の高負荷で利用されていることが確認できます。
NeuronCore Utilization
NC0 NC1 NC2 NC3
ND0 |||||||||||||||[99.09%] ||||||||||||||||||||[99.05%] ||||||||||||||||||||[99.12%] ||||||||||||||||||||[98.98%]
vCPU and Memory Info
System vCPU Usage ||||||||||||||||||||||||[12.92%, 9.27%] Runtime vCPU Usage |||||||||||||||||||||||[ 8.29%,11.52%]|
|||||||e Memory Host ||||||||||||||||||[ 3.1MB/ 15.2GB] Runtime Memory Device 718.0MB
Loaded Models
Model ID
[+] ND 0
infer_bert_perf2.py
内の以下の部分を変更し、実行することで、CPU上での性能を確認します。
model_path = 'bert_cpu.pt'
num_neuroncore = 1
num_thread = 8
batch_size = 1
次の結果が取得されます。章5.3の結果と比較し、性能向上は見られません。
current throughput 0.0, latency p50=0.00000 p90=0.00000 p95=0.00000
current throughput 11.5, latency p50=0.62904 p90=0.93557 p95=0.97669
current throughput 12.9, latency p50=0.62257 p90=0.77736 p95=0.93456
current throughput 13.1, latency p50=0.60946 p90=0.74768 p95=0.81387
current throughput 13.0, latency p50=0.60979 p90=0.73267 p95=0.79491
current throughput 13.5, latency p50=0.60466 p90=0.72485 p95=0.77868
current throughput 13.7, latency p50=0.60152 p90=0.71854 p95=0.76954
current throughput 13.1, latency p50=0.60238 p90=0.71630 p95=0.75955
current throughput 13.0, latency p50=0.60148 p90=0.70095 p95=0.72949
本ワークショップでは利用していませんが、Neuron 1.16.0 では新しい機能として、Inferentia 上でのデータパラレル推論を実現するための torch.neuron.DataParallel()
というAPIを提供しています。
詳細は、AWS Neuron ドキュメント Data Parallel Inference on Torch Neuron
を参照ください。