2021-03-27

PyTorchの実行速度 CUDA on Windows10 vs WSL2

Windows10 Insider Program ではWSL2からCUDAが使えるとのことで、下記を参考にインストールしました。qiita.com

その上で、WindowsからCUDAを叩くのと、WSL2から叩くのとでどれぐらい実行時間に差があるのか試しました。

環境

Windows10 Pro Insider Preview
- Dev Channel: Build 21343.rs_prerelease.210320-1757
32GB RAM
Intel Core i9-10900K
NVIDIA GeForce RTX 3070
PyTorch 1.7.1+cu110

コード

torch_test.py

#!python3
import torch
import torchvision.models as models
import argparse
import time

def run(use_gpu: bool, dtype: torch.dtype):
  # Prepare model and data
  model = models.resnet18()
  tensor = torch.rand((40, 3, 480, 640), dtype = dtype)

  # To GPU
  start_time_total = time.time()
  if use_gpu:
    start_time_gpu = time.time()
    model = model.cuda()
    tensor = tensor.cuda()
    elapsed_time_gpu = time.time() - start_time_gpu
    print('toGPU time = {:3f}[s] ({})'.format(
      elapsed_time_gpu,
      dtype
      )
    )

  # Process
  start_time_proc = time.time()
  _ = model(tensor)
  finish_time = time.time()
  elapsed_time_proc = finish_time - start_time_proc
  elapsed_time_total = finish_time - start_time_total
  
  print('{} proc time = {:3f}[s] ({}), total time = {:3f}[s]'.format(
    'GPU' if use_gpu else 'CPU',
    elapsed_time_proc,
    dtype,
    elapsed_time_total
    )
  )

if __name__ == '__main__':
  print('PyTorch %s' % torch.__version__)

  device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
  print('device: %s' % device)

  run(False, torch.float32)
  run(True, torch.float32)

実行結果

それぞれ3回実行しました。

Windows

>python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.412968[s] (torch.float32), total time = 2.412968[s]
toGPU time = 1.028965[s] (torch.float32)
GPU proc time = 1.586547[s] (torch.float32), total time = 2.615512[s]

>python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.465967[s] (torch.float32), total time = 2.465967[s]
toGPU time = 1.031965[s] (torch.float32)
GPU proc time = 1.586000[s] (torch.float32), total time = 2.617966[s]

>python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.976967[s] (torch.float32), total time = 2.976967[s]
toGPU time = 1.005968[s] (torch.float32)
GPU proc time = 1.590999[s] (torch.float32), total time = 2.596968[s]

WSL2

$ python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.276961[s] (torch.float32), total time = 2.276962[s]
toGPU time = 2.526399[s] (torch.float32)
GPU proc time = 0.220261[s] (torch.float32), total time = 2.746719[s]

$ python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.329590[s] (torch.float32), total time = 2.329591[s]
toGPU time = 2.428853[s] (torch.float32)
GPU proc time = 0.219387[s] (torch.float32), total time = 2.648300[s]

$ python torch_test.py
PyTorch 1.7.1+cu110
device: cuda:0
CPU proc time = 2.325431[s] (torch.float32), total time = 2.325431[s]
toGPU time = 2.907129[s] (torch.float32)
GPU proc time = 0.219199[s] (torch.float32), total time = 3.126382[s]

結論

おおむね所要時間は同じでした。
ただ細かいことを言うと、CUDA使用時は

cuda()を呼ぶとき
推論するとき

の所要時間が異なっていました（2つを合計すればほぼ一緒のようです）
WinとWSL2（またはLinux全般？）で書き方の違いで時間がかかるタイミングが異なるかもしれないことに注意しないといけないのですね…

2020-05-09

Raspberry Pi 4B の GPU で OpenCL (VC4CL)

Raspberry Pi OpenCL VC4CL

RaspberryPiの映像出力用のGPUをOpenCLから利用できると知ったので、興味本位でトライ。
結果から言うと、行列演算サンプルを実行したところ、GPU処理よりCPU処理の方が若干早かった…。

材料

Raspberry Pi 4B (4GB)
- 下記を参考にGPUメモリを512[MB]割り当て済
  - Raspberry Piの動画や３Dなどのグラフィック表示をスムーズにするOpenGLとGPUメモリ│FABSHOP.JP -デジタルでものづくり！ファブショップ！

- 64bitのつもりだったが32bitのRaspbian入れてた

$ getconf LONG_BIT
32

$ lsb_release -a
No LSB modules are available.
Distributor ID: Raspbian
Description:    Raspbian GNU/Linux 10 (buster)
Release:        10
Codename:       buster

参考サイト

Raspberry PiのGPUをOpenCLで使う方法
Raspberry PiでOpenCLを触ってみた
- これらのサイトでは Raspberry Pi Zero W や Raspberry Pi 2B で試している
- また、debパッケージをインストールしているが、私の環境では例のおまじない sudo apt --fix-broken install を実行しても全然ダメだった
- OpenCLのテスト実行のリポジトリ紹介あり。本記事でも試す。

ビルド＆インストール

というわけで、手元で必要物をビルドする。
具体的には下記サイトの"VC4CL installation"のセクションの通り実行する。

libllvmは見つからなかったので代わりにlibllvm7をインストール
VC4CL: Raspberry Pi OpenCL Implementation - AbhiTronix-Verse

確認

$ sudo clinfo

Number of platforms                               1
  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
  Platform Vendor                                 doe300
  Platform Version                                OpenCL 1.2 VC4CL 0.4
  Platform Profile                                EMBEDDED_PROFILE
  Platform Extensions                             cl_khr_il_program cl_khr_spir cl_altera_device_temperature cl_altera_live_object_tracking cl_khr_icd cl_vc4cl_performance_counters
  Platform Extensions function suffix             VC4CL

  Platform Name                                   OpenCL for the Raspberry Pi VideoCore IV GPU
Number of devices                                 1
  Device Name                                     VideoCore IV GPU
  Device Vendor                                   Broadcom
  Device Vendor ID                                0xa5c
  Device Version                                  OpenCL 1.2 VC4CL 0.4
  Driver Version                                  0.4
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Profile                                  EMBEDDED_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             500MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             12x12x12
  Max work group size                             12
  Preferred work group size multiple              1
  Preferred / native vector sizes
    char                                                16 / 16
    short                                               16 / 16
    int                                                 16 / 16
    long                                                 0 / 0
    half                                                 0 / 0        (n/a)
    float                                               16 / 16
    double                                               0 / 0        (n/a)
（以下略）

CPU vs GPU 速度比較

vc4clを触るときはsudoが必要

$ git clone git://github.com/HandsOnOpenCL/Exercises-Solutions.git
$ cd Exercises-Solutions/Solutions/Exercise06/C
$ make
$ sudo ./mult
Using OpenCL device: VideoCore IV GPU

===== Sequential, matrix mult (dot prod), order 1024 on host CPU ======
 25.80 seconds at 83.3 MFLOPS

===== OpenCL, matrix mult, C(i,j) per work item, order 1024 ======
 30.00 seconds at 71.6 MFLOPS

 Errors in multiplication: 246368803225600.000000
terminate called without an active exception
中止

CPUの方が若干早かった。
途中で落ちてしまったが、原因究明未。

更に調べていると

Raspberry Pi 4 のGPUについて言及されている記事を発見。

Idein Ideas — GPGPUの観点から見る VideoCore VI と VideoVore IV の違い

RPi4になってGPUが変わったようですが (VideoCore VI, VC6)、本記事で入れたライブラリはRPi3以前向け (VideoCore IV, VC4) ということでした。
VC6向けのライブラリが待たれる…。

2020-05-08

Google Colab 上で Keras 入門（SegNet-Basic を実装・学習・推論）

Deep Learning Keras SegNet Python

4年前ぐらいに Deep Learning 入門をしてからしばらく触っていなかったが、コロナ外出自粛で時間が余っていることもあって久しぶりにトライ。
Google Colab でやる手順をメモ。試したのはSegNetの軽量版のSegNet-Basic。

(1)でSegNet-Basicの情報源として(2)にリンクを貼っているが、(2)のHow-to が書きかけだったので、(2)のページを参考にしながら下の手順でKerasで実装したところ、うまくいった。

ちなみに、何故SegNetではなくてSegNet-Basicかというと、実はSegNetも試してみたが、Google Colabではリソースの制約上クラッシュして仕方がなかったので諦めただけ。

環境確認

以下の記事は、下記の環境で実行した。

import tensorflow as tf
import keras

print("tf.__version__ is", tf.__version__)
print("tf.keras.__version__ is ", tf.keras.__version__)
print("keras.__version__ is ", keras.__version__)

Using TensorFlow backend.
tf.__version__ is 2.2.0-rc4
tf.keras.__version__ is  2.3.0-tf
keras.__version__ is  2.3.1

下準備

ノートブックを新規作成

「ファイル」 > 「ノートブックを新規作成」
ここではSegNetBasic.ipynbとした

Google Drive マウント

Google Drive を Google Colab にマウント
- 左端の「ファイル」から操作できる
- ↓こういう状態になるはず

f:id:presan:20200508144101p:plain — Google Drive をマウント

下記で公開されているSegNetリポジトリ内のデータ（CamVidフォルダ以下）をGoogle Drive にコピー。
- 今回は drive/My Drive/Colab Notebooks/にSegNetというフォルダを作り、その中にCamVidをコピーした。
SegNetフォルダ下に次のフォルダを作成
- data
- models
- weights
- valid　　(おまけ)

モデル準備

SegNetフォルダ下に下記をmodel.pyとして保存。

下記ページのSegNet実装を参考にSegNetBasicを実装。
- segnetをKerasで実装してみた – 情報学生のプログラム・趣味日記

import keras.models as models
from keras.models import Model
from keras.layers import Input
from keras.layers.core import Layer, Dense, Dropout, Activation, Flatten, Reshape, Permute
from keras.layers.convolutional import Convolution2D, MaxPooling2D, UpSampling2D, ZeroPadding2D
from keras.callbacks import ModelCheckpoint

def buildSegnetBasicModel(input_shape, n_labels, kernel=3, pool_size=(2, 2), pad=1, output_mode="softmax"):
    # encoder
    inputs = Input(shape=input_shape)

    conv_1 = ZeroPadding2D(padding=(pad,pad))(inputs)
    conv_1 = Convolution2D(64, (kernel, kernel), padding="valid")(conv_1)
    conv_1 = BatchNormalization()(conv_1)
    conv_1 = Activation("relu")(conv_1)

    pool_1 = MaxPooling2D(pool_size)(conv_1)

    conv_2 = ZeroPadding2D(padding=(pad,pad))(pool_1)
    conv_2 = Convolution2D(128, (kernel, kernel), padding="valid")(conv_2)
    conv_2 = BatchNormalization()(conv_2)
    conv_2 = Activation("relu")(conv_2)

    pool_2 = MaxPooling2D(pool_size)(conv_2)

    conv_3 = ZeroPadding2D(padding=(pad,pad))(pool_2)
    conv_3 = Convolution2D(256, (kernel, kernel), padding="valid")(conv_3)
    conv_3 = BatchNormalization()(conv_3)
    conv_3 = Activation("relu")(conv_3)
    
    pool_3 = MaxPooling2D(pool_size)(conv_3)

    conv_4 = ZeroPadding2D(padding=(pad,pad))(pool_3)
    conv_4 = Convolution2D(512, (kernel, kernel), padding="valid")(conv_4)
    conv_4 = BatchNormalization()(conv_4)
    conv_4 = Activation("relu")(conv_4)

    print("Build SegNet-Basic enceder done..")

    # decoder
    conv_5 = ZeroPadding2D(padding=(pad,pad))(conv_4)
    conv_5 = Convolution2D(512, (kernel, kernel), padding="valid")(conv_5)
    conv_5 = BatchNormalization()(conv_5)

    unpool_1 = UpSampling2D(pool_size)(conv_5)

    conv_6 = ZeroPadding2D(padding=(pad,pad))(unpool_1)
    conv_6 = Convolution2D(256, (kernel, kernel), padding="valid")(conv_6)
    conv_6 = BatchNormalization()(conv_6)
    
    unpool_2 = UpSampling2D(pool_size)(conv_6)

    conv_7 = ZeroPadding2D(padding=(pad,pad))(unpool_2)
    conv_7 = Convolution2D(128, (kernel, kernel), padding="valid")(conv_7)
    conv_7 = BatchNormalization()(conv_7)

    unpool_3 = UpSampling2D(pool_size)(conv_7)

    conv_8 = ZeroPadding2D(padding=(pad,pad))(unpool_3)
    conv_8 = Convolution2D(64, (kernel, kernel), padding="valid")(conv_8)
    conv_8 = BatchNormalization()(conv_8)

    conv_9 = Convolution2D(n_labels, (1, 1), padding="valid")(conv_8)
    conv_9 = Reshape(
        (input_shape[0] * input_shape[1], n_labels),
        input_shape=(input_shape[0], input_shape[1], n_labels),
    )(conv_9)

    outputs = Activation(output_mode)(conv_9)
    
    print("Build SegNet-Basic decoder done..")

    model = Model(inputs=inputs, outputs=outputs, name="SegNetBasic")

    return model

データ準備

次を実行して、画像データをnumpyバイナリ形式(.npy)に変換。SegNet/data内にnpyファイルが6個、合計4.5GB程度生成されているはず。

import cv2
import numpy as np

from helper import *

import os
import gc

# Copy the data to this dir here in the SegNet project /CamVid from here:
# https://github.com/alexgkendall/SegNet-Tutorial
RootPath = 'drive/My Drive/Colab Notebooks/SegNet'
DataPath = 'drive/My Drive/Colab Notebooks/SegNet/CamVid/'
data_shape = 360*480

def normalized(rgb):
    return rgb / 255.

def one_hot_it(labels):
    w, h = labels.shape[:2]
    x = np.zeros([w,h,12], dtype=np.uint8)
    for i in range(w):
        for j in range(h):
            x[i,j,labels[i][j]]=1
    return x

def load_data(mode):
    data = []
    label = []
    with open(DataPath + mode +'.txt') as f:
        txt = f.readlines()
        txt = [line.split(' ') for line in txt]
    for i in range(len(txt)):
        datapath = RootPath + txt[i][0][7:]
        print('(', i, '/', len(txt), ') Loading data: ', datapath)
        img = cv2.imread(datapath)
        data.append(np.rollaxis(normalized(img),2))

        labelpath = RootPath + txt[i][1][7:][:-1]
        print('(', i, '/', len(txt), ') Loading label: ', labelpath)
        img = cv2.imread(labelpath)
        label.append(one_hot_it(img[:,:,0]))
    return np.array(data), np.array(label)



train_data, train_label = load_data("train")
train_label = np.reshape(train_label,(367,data_shape,12))
np.save(RootPath + "/data/train_data", train_data)
np.save(RootPath + "/data/train_label", train_label)
del train_data
del train_label
gc.collect()

test_data, test_label = load_data("test")
test_label = np.reshape(test_label,(233,data_shape,12))
np.save(RootPath + "/data/test_data", test_data)
np.save(RootPath + "/data/test_label", test_label)
del test_data
del test_label
gc.collect()

val_data, val_label = load_data("val")
val_label = np.reshape(val_label,(101,data_shape,12))
np.save(RootPath + "/data/val_data", val_data)
np.save(RootPath + "/data/val_label", val_label)
del val_data
del val_label
gc.collect()

学習

モデルを作成

次を実行

# Reference:
# https://qiita.com/cyberailab/items/d11862852eccc17585e8
# https://github.com/0bserver07/Keras-SegNet-Basic

import keras.models as models
from keras.models import Model
from keras.layers import Input
from keras.layers.core import Layer, Dense, Dropout, Activation, Flatten, Reshape, Permute
from keras.layers.convolutional import Convolution2D, MaxPooling2D, UpSampling2D, ZeroPadding2D
from keras.layers.normalization import BatchNormalization
from keras.callbacks import ModelCheckpoint

import cv2
import numpy as np
import matplotlib.pyplot as plt
import time

# Import SegNet/SegNetBasic modules
import sys
RootPath = '/content/drive/My Drive/Colab Notebooks/SegNet/'
sys.path.append(RootPath)
from model import buildSegnetModel, buildSegnetBasicModel

# Start time
start_time = time.time()

# Fix seed for reproducibility
np.random.seed(0)

# Parameters
class_weighting= [0.2595, 0.1826, 4.5640, 0.1417, 0.5051, 0.3826, 9.6446, 1.8418, 6.6823, 6.2478, 3.0, 7.3614]
input_shape = (360, 480, 3)
n_labels =  12
kernel = 3
pool_size = 2
pad = 1
output_mode = 'softmax'
data_shape = input_shape[0] * input_shape[1]

# load the model:
print("Building model...")
model_segnet = buildSegnetBasicModel(input_shape, n_labels, kernel, pool_size, pad, output_mode)
print("done.")

print("Compiling model...")
model_segnet.compile(loss="categorical_crossentropy", optimizer='adadelta', metrics=["accuracy"])
print("done.")

# Visualize model
model_segnet.summary()

# Calculate erapsed time
model_load_time = time.time()
print('Erapsed time: ', (model_load_time - start_time), '[s]')

npyデータを読み込み

次を実行。5～10分程度時間がかかる。

# load the data
print("Loading data...")
print("- Loading train_data...")
train_data = np.load(RootPath + 'data/train_data.npy').transpose((0, 2, 3, 1)) # NCHW to NHWC
print("  -> shape", train_data.shape)
print("- Loading train_label...")
train_label = np.load(RootPath + 'data/train_label.npy')
print("  -> shape", train_label.shape)
print("- Loading test_data...")
test_data = np.load(RootPath + 'data/test_data.npy').transpose((0, 2, 3, 1)) # NCHW to NHWC
print("  -> shape", test_data.shape)
print("- Loading test_label...")
test_label = np.load(RootPath + 'data/test_label.npy')
print("  -> shape", test_label.shape)
print("done.")

# Calculate erapsed time
data_load_time = time.time()
print('Erapsed time: ', (data_load_time - model_load_time), '[s]')

学習開始

ランタイム選択

「ランタイム」 > 「ランタイムのタイプを変更」で GPU or TPU を選ぶ

学習開始

次を実行。

途中、標準割り当てメモリ(12.5[GB])ではメモリ不足でクラッシュするかも。
- その場合は2倍の25[GB]をGoogleから有難く割り当ててもらって再実行。
- 25[GB]あれば大丈夫
そのときの割り当てリソースにもよるが、ある日のGPU選択時は1epoch 110[s]程度だった。
- つまり100epochの学習に2[h]程度
- batch_sizeを2倍の12にすると1epoch 30[s]で100epoch成功するときもあれば、ResourceExhaustedErrorでクラッシュすることもあった
過去最高の結果になるたび、SegNet/weightsフォルダに重みを上書き保存。(checkpoint)

# Parameter
nb_epoch = 100
batch_size = 6

# checkpoint
print("Deifining callbacks...")
filepath = RootPath + "weights/segnet_weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
print("done.")

# Calculate erapsed time
def_cb_time = time.time()
print('Erapsed time: ', (def_cb_time - data_load_time), '[s]')
print('-----------------')

# Fit the model
print("Fitting model...")
hist = model_segnet.fit(train_data, train_label, callbacks=callbacks_list, batch_size=batch_size, epochs=nb_epoch,
                    verbose=2, class_weight=class_weighting, validation_data=(test_data, test_label), shuffle=True) # validation_split=0.33
print("done.")

# Calculate erapsed time
fit_model_time = time.time()
print('Erapsed time: ', (fit_model_time - def_cb_time), '[s]')
print('-----------------')

モデル・重み保存

次を実行して、推論時に使うためにモデルと100epoch目の重みをファイルに保存。SegNetフォルダ下のmodels, weightsフォルダにそれぞれ保存。
ただ、重みは100epoch目よりも、過去最高(checkpoint)の結果を使うべき
- checkpointと100epoch目の重みファイルのサイズがかなり違うのは何故だろう？

# This save the trained model weights to this file with number of epochs
print("Saving model and weights...")
model_segnet.save(RootPath + 'models/segnet_model.hdf5')
model_segnet.save_weights(RootPath + 'weights/segnet_weight_{}.hdf5'.format(nb_epoch))
print("done.")

# Calculate erapsed time
fit_model_time = time.time()
print('Erapsed time: ', (fit_model_time - data_load_time), '[s]')
print('-----------------')

** loss/accuracyの可視化
次を実行してloss/accuracyの推移をグラフ描画。

>|python|
# Visualize
epochs = range(1, len(hist.history['accuracy']) + 1)

plt.plot(epochs, hist.history['loss'], label='Training loss', ls='-')
plt.plot(epochs, hist.history['val_loss'], label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.plot(epochs, hist.history['accuracy'],  label='Training acc')
plt.plot(epochs, hist.history['val_accuracy'], label="Validation acc")
plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

学習結果

validationの精度がmaxで約80%ぐらいになった

f:id:presan:20200508141014p:plain — Loss

f:id:presan:20200508141032p:plain — Accuracy

推論

モデルと重みを読み込み

次を実行

from keras.models import load_model
from google.colab import files
from PIL import Image
import cv2
import numpy as np
import time

RootPath = '/content/drive/My Drive/Colab Notebooks/SegNet/'

# Parameter
input_shape = (360, 480, 3)

# Start time
start_time = time.time()

# Load model and weights
print('Loading model and weights...')
model_segnet = load_model(RootPath + 'models/segnet_model.hdf5')
model_segnet.load_weights(RootPath + 'weights/segnet_weights.best.hdf5')
print('done')

# Calculate erapsed time
model_load_time = time.time()
print('Erapsed time: ', (model_load_time - start_time), '[s]')

検証用データを読み込み

次を実行

val_data = np.load(RootPath + '/data/val_data.npy').transpose((0, 2, 3, 1))  # NCHW to NHWC
val_label = np.load(RootPath + '/data/val_label.npy')
print('done')

評価

次を実行

batch_size = 12

# estimate accuracy on whole dataset using loaded weights
scores = model_segnet.evaluate(val_data, val_label, verbose=0, batch_size=batch_size)
print("%s: %.2f%%" % (model_segnet.metrics_names[1], scores[1]*100))

評価結果

accuracy: 87.45%

可視化

次を実行。
検証用データ全部にやっては大変なので、最大10個まで可視化するようにした。

import matplotlib.pyplot as plt

Sky = [128,128,128]
Building = [128,0,0]
Pole = [192,192,128]
Road_marking = [255,69,0]
Road = [128,64,128]
Pavement = [60,40,222]
Tree = [128,128,0]
SignSymbol = [192,128,128]
Fence = [64,64,128]
Car = [64,0,128]
Pedestrian = [64,64,0]
Bicyclist = [0,128,192]
Unlabelled = [0,0,0]

label_colours = np.array([Sky, Building, Pole, Road, Pavement,
                          Tree, SignSymbol, Fence, Car, Pedestrian, Bicyclist, Unlabelled])

def visualize(temp, plot=True):
    r = temp.copy()
    g = temp.copy()
    b = temp.copy()
    for l in range(0,11):
        r[temp==l]=label_colours[l,0]
        g[temp==l]=label_colours[l,1]
        b[temp==l]=label_colours[l,2]

    rgb = np.zeros((temp.shape[0], temp.shape[1], 3))
    rgb[:,:,0] = (r/255.0)#[:,:,0]
    rgb[:,:,1] = (g/255.0)#[:,:,1]
    rgb[:,:,2] = (b/255.0)#[:,:,2]
    if plot:
        plt.imshow(rgb)
    else:
        return rgb

img_size = (input_shape[0], input_shape[1])

# Start time
start_time = time.time()

# Predict
print('Predicting...')
output = model_segnet.predict(val_data)
print('done')

# Calculate erapsed time
end_time = time.time()
print('Erapsed time: ', (end_time - start_time), '[s]')
print('-----------------')

# Visualize
count = min([10, len(output)])
for i in range(count):
  pred_class = np.argmax(output[i], axis=1).reshape(img_size)
  img_ret = visualize(pred_class, False)
  plt.figure(i * 2)
  plt.imshow(val_data[i])
  plt.figure(i * 2 + 1)
  plt.imshow(img_ret)
plt.show()

可視化結果

それっぽいsegmentation結果になった

f:id:presan:20200508142440p:plain — SegNetBasic結果

(おまけ) 自前の画像で可視化

SegNet/validフォルダに480x360のRGB画像をtest0~8.pngという名前で保存
次を実行
- さっきまでSegNet公開リポジトリの検証用データを読み込んでいた変数val_dataに自前の画像を読み込む

val_data = []
for i in range(9):
  # Load image
  img_path = RootPath + 'valid/test{}.png'.format(i)
  print('Loading ', img_path)
  img = cv2.imread(img_path)
  if img is None:
    print('Failed to load ', img_path)
    continue
  #img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  print('done')
  val_data.append(img)

val_data = np.array(val_data)

上述した可視化コードを実行

可視化結果

うーん。。

f:id:presan:20200508143234p:plain — SegNetBasic結果

2019-09-22

マウスカーソルを見失ったときに所定の場所に強制移動させる

Windows

マルチディスプレイしているとマウスカーソルを見失うことも多いが、そのたびにマウスをバタバタと左右に動かして画面のどこにあるか探す時間が勿体ない。

そこで、マウスカーソルが画面上の所定の位置に来るようなbatを作成した。これをpathが通るところにおいて、Winキー→bat名入力で即座にマウスカーソルを取り戻す。

@powershell -NoProfile -ExecutionPolicy Unrestricted "$s=[scriptblock]::create((gc \"%~f0\"|?{$_.readcount -gt 1})-join\"`n\");&$s" %*&goto:eof

add-type -AssemblyName System.Windows.Forms
[System.Windows.Forms.Cursor]::Position = new-object System.Drawing.Point(960, 540)

参考
- PowerShell - PowerShellでマウス操作は実現できるのでしょうか？｜teratail
- バッチファイルから PowerShell を呼び出す方法 - Qiita

2019-07-20

cv::cornerSubPix (サブピクセル推定) アルゴリズムの詳細

OpenCV 画像処理

OpenCVにはcv::cornerSubPixという関数があるが、これの原理に関する説明が非常に簡素でわかりづらい。

特徴検出 — opencv 2.2 documentation

上記ページに書いてある3つの式の意味についてまとめる。

$\displaystyle{\epsilon_{i}=DI^{T}_{p_{i}}\cdot(q-p_{i})}$
$\displaystyle{\sum_{i} (DI_{p_{i}}\cdot DI^{T}_{p_{i}})-\sum_{i} (DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot p_{i})}$
$\displaystyle{q=G^{-1}\cdot b}$

式1について

上記OpenCVの説明ページで「各点 $p$ に向かう画像勾配と直交する」にあたる部分。これについては下記記事がわかりやすかった。

Fundamentals of Features and Corners: Subpixel Corners: Increasing accuracy - AI Shack

式2について

式1から式2の間がちょっとぶっ飛びすぎているし、いろいろと間違っていると思う。

「 $\epsilon_{i}$ を0とすることで」（英語版では"A system of equations may be set up with $\epsilon_{i}$ set to zero"）は違うと思う
突然 $q$ が式2から消えてなくなっているが間違いだと思う

式1～式2の間には次の論理が省略されていると思う。一言で言えば「二乗誤差の最小化」である。

詳細

探索窓内のすべての点について式1で定義される $\epsilon_{i}$ の二乗の総和をEとする。

$\displaystyle{ E=\sum_{i} \epsilon_{i}^{2} }$

これに式1を代入すると、

$\displaystyle{ E=\sum_{i} \left(DI^{T}_{p_{i}}\cdot(q-p_{i})\right)^{2} }$

この総和 $E$ が最小値をとるような $q$ を求めればよい。それはつまり、 $q$ についての微分 $dE/dq$ がゼロになるときである。よって、

$\displaystyle{ \frac{dE}{dq}=2\sum_{i} \left(DI^{T}_{p_{i}}\cdot(q-p_{i})\right)DI^{T}_{p_{i}}=0 }$

$(a\cdot b)a=(aa^T)b$ と書き換えられるので、

$\displaystyle{ \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot(q-p_{i})=0 }$

これを展開すると

$\displaystyle{ \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot q - \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot p_{i}=0 }$ 　(★)

この左辺が式2である。繰り返しになるが、 $q$ が書かれてしかるべきだと思う。

式3について

突然1次勾配 $G$ 、2次勾配 $b$ という言葉が出てくるが、次の論理が省略されている。

詳細

上述の式(★)より

$\displaystyle{ \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot q = \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot p_{i} }$

$q$ は $\displaystyle{\sum_{i}}$ には依存しないので、

$\displaystyle{ \left(\sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\right) \cdot q = \sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot p_{i} }$

ここで、

$\displaystyle{ G=\sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}} }$

$\displaystyle{ b=\sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}\cdot p_{i} }$

と置くと、上述の式は次のように書き換えらえる。

$\displaystyle{ G\cdot q=b }$

ゆえに、今求めたい $q$ は

$\displaystyle{ q=G^{-1}\cdot b }$

となり、式3となる。

おまけ

ところで、 $G=\sum_{i} DI_{p_{i}}\cdot DI^{T}_{p_{i}}$ は計算すると近似的に次のようになるが、これを structure tensor matrix と呼ぶ。

$DI_{p_{i}}=(I_{x_{p_{i}}}, I_{y_{p_{i}}})^{T}$ とすると、

$\displaystyle{ G=\sum_{i} \begin{pmatrix} I_{x_{p_{i}}}I_{x_{p_{i}}} & I_{x_{p_{i}}}I_{y_{p_{i}}} \\ I_{x_{p_{i}}}I_{y_{p_{i}}} & I_{y_{p_{i}}}I_{y_{p_{i}}} \\ \end{pmatrix} \approx \sum_{i} \begin{pmatrix} I_{xx_{p_{i}}} & I_{x_{p_{i}}}I_{y_{p_{i}}} \\ I_{x_{p_{i}}}I_{y_{p_{i}}} & I_{yy_{p_{i}}} \\ \end{pmatrix} }$

2019-06-22

C, V, H キーが打てない問題の解消法 (When C, V, H keys don't work...)

Windows

TL;DR (English)

It might depend on each environment, but in my case, "Right Windows" key is regarded as pressed down.
- Of course my keyboard doesn't have such a key... So I can't press it.
To solve this problem, simulate "R Win" key (VK_RWIN) with some tool.
- You can create it by yourself using Win32API keybd_event()
My guess is that this problem might occur due to the malfunction of keyboards.

TL;DR (Japanese)

この問題の発生原因は人それぞれかもしれないが、私の場合は「右Windowsキー」が押しっぱなし状態と判定されていることが原因だった。
- ちなみに使っているキーボードには「左Windowsキー」しかなく、当然「右Windowsキー」を押した覚えはない。
  - たぶん今時「右Windowsキー」のついたキーボードの方が珍しいはず
- なおWindowsに標準でインストールされているソフトウェアキーボードも「左Windowsキー」しかないので、解消できない。
キーボード入力をシミュレートするツールで「右Windowsキー」を話したことにして対処。
- 例えば、キーボードシミュレータでVK_RWIN (0x5C)をWM_KEYUPさせる
- PowerShell
（恐らくだが）原因はキーボードの誤動作。無線キーボードを使っているのだが、伝送する信号を間違えたのかな…あくまで想像。

詳細

結構長い間、下記の問題に悩んでいた：

テキスト入力欄にC, V, Hキーが打てない
C, V, H, Aキーを押すと変な動作をする
etc

ネットで調べても同様の症状の人は世界中でいるようである。
検索欄に「c v h」と入力したら他の人も検索しているみたい。

f:id:presan:20190622222311p:plain — c v h suggestion

実際いくつかの質問サイトでQ&Aがあるが、人それぞれ原因は違うようでいろんな対策が載っている。

ちなみに私は検索して出てくるやつでは一向に改善しなかった。
これまでは、しばらく我慢してたらどういうわけか直ったりしていた。
（いつか直るまでどうにも我慢できないときは、一度ログオフしたり再起動したりしていた）

ちょっと真面目に原因を探ろうと思い、何か特殊キーが押された状態になっているのではないかと仮説を立ててキー入力状態を見てみたら、当たりだった。

2019-06-18

Direct Sparse Odometry (DSO) の要点まとめ

C++ DSO SLAM 画像処理 SFM

過去のエントリではWindowsへのインストール方法をまとめたが、今回は論文から理解したことをまとめる。
数学的な細かな話は最小限に抑え、私なりに要点だけまとめる。
# 間違ってるところがあるかもしれないので随時修正していくつもり。

論文

[1607.02565] Direct Sparse Odometry

従来手法

従来のSFM/SLAMはざっくり次の3ステップ：

複数画像から対応する特徴点を抽出
- オプティカルフロー方式／マッチング方式（前回エントリ参照）
各対応特徴点に対して三角測量を行い3D再構成
3D再構成した点やカメラ位置姿勢に対してバンドル調整 (Bundle Adjustment : BA)
- 3D点の画像上への再投影点と元の2D特徴点のx,yのズレ（再投影誤差）という空間的誤差 (geometric error) が最小化されるように、カメラ位置姿勢や3D点座標の微小修正を繰り返す

DSOの手法

これに対して、DSOは入力画像群に対して最初から直接バンドル調整(BA)を実施しているイメージ。特徴点は求めない。ざっくり次の3ステップ：

一方の画像の（勾配がある程度大きい）とある座標の画素値(A)はもう一方の画像のとある座標の画素値(B)に対応するはずという計算式を立てる
その計算式から、画素値(A)と(B)の差の計算式を立てる
その画素値誤差 (photometric error) が最小化されるようにカメラ位置姿勢や3D点(*1)の微小修正を繰り返す
- ここで、いわゆる疎バンドル調整 (Sparse BA : SBA) の考え方を導入するから処理が早くなる
  - Gauss-Newton 法での誤差最小化にはヘッセ行列が必要(*2)だが、ヘッセ行列は疎行列になる(*3)ため、計算を省ける (Schur補行列の計算でOKになる)
  - *1 : 従来手法では3D点1個につき座標(x,y,z)の3変数が存在したが、DSOでは逆深度 (inverse depth) 1変数で表しているので計算が効率的になっている
  - *2：参考 CVIM#11 3. 最小化のための数値計算
  - *3：参考バンドルアジャストメント（情報処理学会電子図書館）

ちなみに、処理対象とする画像：キーフレーム(KF)や3D点群を随時不要になったら削除するなど上手に管理するところにもテクニックがありそうだ。

※画像/画質処理屋さんのセンスのままで考えると理解できないかも。DSOは数学の塊で成立しているイメージ。
前回エントリを例にとると、私のイメージでは（あくまで個人の感想）：

マッチング方式　…　画像処理屋さん的センス
オプティカルフロー方式　…　数学屋さん的センス

前提条件

Webカメラやスマホカメラのような「人の目に美しく映るような画像」ではなく、「マシンビジョンにとって最適な画像（やカメラ）」であること。具体的には：

グローバルシャッター
- 論文では一言、ローリングシャッターの影響をいくらか緩和する手法もあるようなことも書かれてはいるが…
キャリブレーションその1 (geometric calibration) （※言わゆる内部パラメタ）
- 画像中心
- レンズ歪み
- 焦点距離
キャリブレーションその2 (photometric calibration)
- 露光時間
- レンズ減衰・口径食
- 放射照度の逆関数

DSOの嬉しいところ

従来の特徴点方式SFM/SLAMは所謂「コーナー」を3D化するに留まるが、DSOは処理方式の違いから「エッジ」も再構成しやすい
つまり従来方式より多くの3D点を生成しやすい

今後より技術的に詳しく見ていこうと思う。

環境

コード

実行結果

WSL2

結論

目次

材料

参考サイト

ビルド＆インストール

確認

CPU vs GPU 速度比較

更に調べていると

目次

環境確認

下準備

ノートブックを新規作成

Google Drive マウント

モデル準備

データ準備

学習

モデルを作成

npyデータを読み込み

学習開始

ランタイム選択

学習開始

モデル・重み保存

学習結果

推論

モデルと重みを読み込み

検証用データを読み込み

評価

評価結果

可視化

可視化結果

(おまけ) 自前の画像で可視化

可視化結果

式1について

式2について

詳細

式3について

詳細

おまけ

TL;DR (English)

TL;DR (Japanese)

詳細

論文

従来手法

DSOの手法

前提条件

DSOの嬉しいところ