MLC quick notes – Week 4

過去兩週解釋了對 Tensor Primitive Function 的實作可能。這週退一步來看,給出了如何抵達要去實作 primitive function 的 high level workflow overview。

Continue reading “MLC quick notes – Week 4”

MLC quick notes – Week 3

這週用 TVM 中的 TensorIR 來實際展示 Tensor Computation 的必要元素——如何從較粗糙的 ML Operator 轉換成可以實際編譯執行(Deployable)的 Primitive Function。

課程中舉例。現在 ML Model 是簡單的矩陣相乘(Matrix Multiplication)之後接一個整流函數(ReLU)。表示成數學式為:

From notes under mlc.ai

這裡 TQ 直接切入 Python abstraction 底下,開始介紹 TVM 所提供的 API — axis_type, axis_range and mapped_value。並且展示了 TVM 提供 Loop Splitting, Loop Interchange 的 API 來做出 Tiling 的例子。

不過大家真正想學的應該也不是 TVM,整場聽你安麗就飽了啊 (☉д⊙)(?)。這裡的 take-away 應該是我們需要解決的問題——在高階語言後關心實際實作(如何分配 multithreading)與執行(環境下的 locality)的最佳化。

在課程中 TQ 用 TVM 所示範的 Tiling 例子,其實追根究底是複用了 Halide [1] 的最核心概念—— Decoupling Algorithm from Schedules [2]。 TVM 所展現的 modularity 就是透過這樣的 decoupling 產生的。經過解耦,可以展開出 multithreading 與 locality 的解平面來做最佳化。ML operator 的確也是這一概念很好的應用對象!

個人 OS:要是乖乖上課不查資料就真的只是 TVM 新生訓練而已了⋯⋯

[1] Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines
[2] Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines

RVV vector elements and its policy behaviors

The RISC-V Vector Extension (RVV) has been ratified lately and new to the world. It is a VLA (Vector Length Agnostic) vector extension and its behavior relies on the setting of vtype register. Two of the fields, vta and vma, determines the behaviors for tail elements and inactive elements respectively.

Definition of the extension is established by the riscv-v-spec. Even though the spec gives me a good top-down overview of the extension, several instructions behave differently than the generic behavior. This sometime leaves me confused and slow down my development.

This article wants to gather information regarding the policy configuration for RVV and hope to help people understand more of RVV.

The “agnostic” behavior

When a set is marked agnostic, the corresponding set of destination elements in any vector destination operand can either retain the value they previously held, or are overwritten with 1s. Within a single vector instruction, each destination element can be either left undisturbed or overwritten with 1s, in any combination, and the pattern of undisturbed or overwritten with 1s is not required to be deterministic when the instruction is executed with the same inputs.

risc-v v-spec

Tail policy behavior

Remaining spaces of destination register are treated as tail elements

Fractional LMUL may occur on generic vector instructions. In this case, the rest of the vector space is treated as tail elements and respects vta setting.

For vector segment instructions, since EMUL = (EEW / SEW) * LMUL, we may obtain fractional EMUL and the rest of the vector space is treated as tail elements and respects vta setting.

Instructions with mask destination registers are always tail-agnostic

Be noted of how the v-spec stated about the mask register layout:

Each element is allocated a single mask bit in a mask vector register. The mask bit for element i  is located in bit i  of the mask register, independent of SEW or LMUL.

Regarding an instruction that has a mask destination register, the tail elements ranges from the vl-th bit to the VLEN-1-th bit.

Store instructions are not affected by policy settings

The destination of store instructions is the memory. No vector register is involved, therefore the policy settings don’t affect them.

Mask policy behavior

Unmasked instructions don’t care about mask policy

Some instructions are always unmasked, meaning that the instructions have no inactive elements. They are not affected by mask policy.

  • Vector add-with-carry and subtract-with-borrow instructions
  • Vector merge and move instructions
  • Vector mask-register logical instructions
  • vcompress

Reduction instructions don’t care about mask policy

Additionally, vector reduction instructions don’t care about mask policy too because the inactive elements are excluded from reduction. The 0th element of the destination register will hold the result of the reduction and other elements in the destination vector register will respect the tail policy.

Ending

Hope this post saved someone’s time in the world from tangling details of RVV policies 😉

The RISC-V Vector Extension (RVV) has been ratified lately and new to the world. It is a VLA (Vector Length Agnostic) vector extension and its behavior relies on the setting of vtype register. Two of the fields, vta and vma, determines the behaviors for tail elements and inactive elements respectively.

Definition of the extension is established by the riscv-v-spec. Even though the spec gives me a good top-down overview of the extension, several instructions behave differently than the generic behavior. This sometime leaves me confused and slow down my development.

This article wants to gather information regarding the policy configuration for RVV and hope to help people understand more of RVV.

The “agnostic” behavior

When a set is marked agnostic, the corresponding set of destination elements in any vector destination operand can either retain the value they previously held, or are overwritten with 1s. Within a single vector instruction, each destination element can be either left undisturbed or overwritten with 1s, in any combination, and the pattern of undisturbed or overwritten with 1s is not required to be deterministic when the instruction is executed with the same inputs.

Tail policy behavior

Remaining spaces of destination register are treated as tail elements

Fractional LMUL may occur on generic vector instructions. In this case, the rest of the vector space is treated as tail elements and respects vta setting.

For vector segment instructions, since EMUL = (EEW / SEW) * LMUL, we may obtain fractional EMUL and the rest of the vector space is treated as tail elements and respects vta setting.

Instructions with mask destination registers are always tail-agnostic

Be noted of how the v-spec stated about the mask register layout:

Each element is allocated a single mask bit in a mask vector register. The mask bit for element i  is located in bit i  of the mask register, independent of SEW or LMUL.

Regarding an instruction that has a mask destination register, the tail elements ranges from the vl-th bit to the VLEN-1-th bit.

Store instructions are not affected by policy settings

The destination of store instructions is the memory. No vector register is involved, therefore the policy settings don’t affect them.

Mask policy behavior

Unmasked instructions don’t care about mask policy

Some instructions are always unmasked, meaning that the instructions have no inactive elements. They are not affected by mask policy.

  • Vector add-with-carry and subtract-with-borrow instructions
  • Vector merge and move instructions
  • Vector mask-register logical instructions
  • vcompress
Reduction instructions don’t care about mask policy

Additionally, vector reduction instructions don’t care about mask policy too because the inactive elements are excluded from reduction. The 0th element of the destination register will hold the result of the reduction and other elements in the destination vector register will respect the tail policy.

Ending

Hope this post saved someone’s time from the tangling details of RVV policies 😉

MLC quick notes – Week 2

Other notes for MLC

本週進一步探討了在機器學習模型這樣的問題框架下,抽象化的表示至少需要哪些。最直觀來說是 Input / Output buffer representations (placeholders), Loop nests and Computation statements.

Primitive tensor function 就是在模型上最直接的那些 operator,諸如 linear, relu, softmax。而要編譯這些 operator:

  • 一種最簡單的方式就是硬體都幫你做好好,直接切處這樣 coarse grain 的 API,model 來哪種 operator 就直接送給硬體做
  • 更 fine grain 來說,對 operator 執行的程式碼(迴圈)做優化

最直接的例子,就是從大家最熟悉的 SIMD programming scheme 開始。像是 NEON 或是 AVX512 都可以作為實際例子。而從一個簡單的 for loop 要轉成較適用 SIMD 的程式,需要 Loop Splitting:

// From
for (int i=0; i<128; ++i) {
  c[i] = a[i] + b[i];
}
// To (Exploit parallelism with SIMD
for (int i=0; i<32; ++i) {
  for (int j=0; j<4; ++j) { // deal with 4 computations at a time
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

更甚至需要 core 上的平行話時,可以 Loop Interchange 成:

// To (Exploit parallelism through multiple cores
for (int j=0; j<4; ++j) { // 4 cores
  for (int i=0; i<32; ++i) {
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

這樣的 transformation,加上需要轉換到 CUDA 上的話,在抽象化來講可以寫成以下這樣:

// From p.16 of <https://mlc.ai/summer22/slides/2-TensorProgram.pdf>
x = get_loop("x")
xo, xi = split(x, 4)
reorder(xi, xo)
bind_thread(xo, "threadIdx.x")
bind_thread(xi, "blockIdx.x")

在 compiler 來說,對於各式各樣的 loop 當然是希望能夠被提供越多資訊越好,諸如 IBM / Intel 都有一些自家 pragma,在 ML 領域內我們也希望能夠被提供這樣的資訊,像是「有沒有 loop carried dependency」或是直接像 p.18 裡直接指名該 tensor 所有元素是 spatially parallelize-able。

總的來說,這週展示了 operator 底下的優化空間。
最後 20 分鐘就是講師在安立自家的 TVM XD

額外 brain storming:MLIR 跟 TVM 都幾?難道又像古時候的編譯器一樣,ML compilation 是否也要進入戰國時代了呢?

MLC quick notes – Week 1

Other notes for MLC

基本上什麼也沒說。在這裡把問題情境說出來——大目標就是想要把機器學習模型放到各式各樣的硬體上。這應該就是各家 AI Compiler 公司都在做的事。

Key Questions to answer:

  • What level of abstraction to have?
    • Too high: lack of reuse, will have to rebuild different operator types if so coarse grained
    • Too low: too much verbosity, harder to identify high level informations (e.g. control flow)
  • How to address the process from “Development” to “Deployment”?
    • When developing… ML engineers train model in language that is more easier to configure — Python. Machine learning frameworks are mainly based on Python, like Tensorflow, PyTorch and JAX.
    • When deploying… the environment varies from end devices like holdable devices, tiny cameras, tiny microphones to individual GPU, or even large scale computing farms. The environment and hardwares are different in multiple senses. The problem of deployment is the next big question for this machine learning era.

Machine learning compilation goals

  • Integration and dependency minimization: 在 end device 上用最少的資源達到 deployment
  • Leverage hardware native acceleration: 善用硬體特性做加速,舉凡 SIMD lane, multiple core, cache friendly tiling 都是這階段會做的事。常數優化的累積不可小覷。

MLC is all about bridging the gap. 編譯器就是去追逐人類與 0/1 之間的巴別塔。

If your perf is not working…

I was trying to perf LLVM opt this week and got confused because the perf report showed mangled function calls on the stack traces. So I might as well write an article of it and hope it would get some Google juice and save others from confusion.

  • To build LLVM with complete stack traces, build with “CMAKE_BUILD_TYPE=Debug
  • To allow perf to work on LLVM, build with “LLVM_USE_PERF:=ON

At the very first place I thought the mangle-ed function call names exist because of perf not recognizing the symbols. It turned out I was wrong. If perf does not recognize the symbol it would simply show Unknown. The mixed-up symbol names are name mangling for function calls. (see more here)

You can remove the mangling simply with llvm-cxxfilt (see more here). For the exact correct mapping you may need to use llvm-cxxmap (see from here).

With the correct build I still get mangled-output. Turned out that the perf I apt-get-ed from is compiled with the de-mangling function turned off. It was a bug filed in linux since 2014 November and I’m still falling into this error on 2021 June 🤢. (on Ubuntu 16.04 linux 4.4.0-131 generic x86_64)

Mangled Output, visualization by FlameGraph

There’s 3 links in the bug thread above that may lead you to the solution. Personally I’ve download the perf tool in mirrors.edge.kernel.org, followed the instructions here to build the correct perf I needed.

Clean Output, visualization by FlameGraph

NOTE: you may need to apt-get some dependencies to enable some feature in perf. Be sure to checkout the compile messages when you build from source. You can checkout features enabled with perf version --build-options.

PS: Just in case you came in with your perf not working on your self-compiled code with clang or gcc, you may want to look at this stack-overflow answer.