MLC quick notes – Week 2

Other notes for MLC

本週進一步探討了在機器學習模型這樣的問題框架下，抽象化的表示至少需要哪些。最直觀來說是 Input / Output buffer representations (placeholders), Loop nests and Computation statements.

Primitive tensor function 就是在模型上最直接的那些 operator，諸如 linear, relu, softmax。而要編譯這些 operator：

一種最簡單的方式就是硬體都幫你做好好，直接切處這樣 coarse grain 的 API，model 來哪種 operator 就直接送給硬體做
更 fine grain 來說，對 operator 執行的程式碼（迴圈）做優化

最直接的例子，就是從大家最熟悉的 SIMD programming scheme 開始。像是 NEON 或是 AVX512 都可以作為實際例子。而從一個簡單的 for loop 要轉成較適用 SIMD 的程式，需要 Loop Splitting：

// From
for (int i=0; i<128; ++i) {
  c[i] = a[i] + b[i];
}
// To (Exploit parallelism with SIMD
for (int i=0; i<32; ++i) {
  for (int j=0; j<4; ++j) { // deal with 4 computations at a time
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

更甚至需要 core 上的平行話時，可以 Loop Interchange 成：

// To (Exploit parallelism through multiple cores
for (int j=0; j<4; ++j) { // 4 cores
  for (int i=0; i<32; ++i) {
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

這樣的 transformation，加上需要轉換到 CUDA 上的話，在抽象化來講可以寫成以下這樣：

// From p.16 of <https://mlc.ai/summer22/slides/2-TensorProgram.pdf>
x = get_loop("x")
xo, xi = split(x, 4)
reorder(xi, xo)
bind_thread(xo, "threadIdx.x")
bind_thread(xi, "blockIdx.x")

在 compiler 來說，對於各式各樣的 loop 當然是希望能夠被提供越多資訊越好，諸如 IBM / Intel 都有一些自家 pragma，在 ML 領域內我們也希望能夠被提供這樣的資訊，像是「有沒有 loop carried dependency」或是直接像 p.18 裡直接指名該 tensor 所有元素是 spatially parallelize-able。

總的來說，這週展示了 operator 底下的優化空間。
最後 20 分鐘就是講師在安立自家的 TVM XD

額外 brain storming：MLIR 跟 TVM 都幾？難道又像古時候的編譯器一樣，ML compilation 是否也要進入戰國時代了呢？

Author: eopXD

Hi 我是 eopXD ，希望人生過得有趣有挑戰。有任何問題都可以寄到 eopxdd[at]gmail.com。 View all posts by eopXD

分享此文：

Related

Author: eopXD

Leave a Reply Cancel reply