MLC quick notes – Week 2

Other notes for MLC

本週進一步探討了在機器學習模型這樣的問題框架下,抽象化的表示至少需要哪些。最直觀來說是 Input / Output buffer representations (placeholders), Loop nests and Computation statements.

Primitive tensor function 就是在模型上最直接的那些 operator,諸如 linear, relu, softmax。而要編譯這些 operator:

  • 一種最簡單的方式就是硬體都幫你做好好,直接切處這樣 coarse grain 的 API,model 來哪種 operator 就直接送給硬體做
  • 更 fine grain 來說,對 operator 執行的程式碼(迴圈)做優化

最直接的例子,就是從大家最熟悉的 SIMD programming scheme 開始。像是 NEON 或是 AVX512 都可以作為實際例子。而從一個簡單的 for loop 要轉成較適用 SIMD 的程式,需要 Loop Splitting:

// From
for (int i=0; i<128; ++i) {
  c[i] = a[i] + b[i];
}
// To (Exploit parallelism with SIMD
for (int i=0; i<32; ++i) {
  for (int j=0; j<4; ++j) { // deal with 4 computations at a time
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

更甚至需要 core 上的平行話時,可以 Loop Interchange 成:

// To (Exploit parallelism through multiple cores
for (int j=0; j<4; ++j) { // 4 cores
  for (int i=0; i<32; ++i) {
    c[i * 4 + j] = a[i * 4 + j] + b[i * 4 + j];
  }
}

這樣的 transformation,加上需要轉換到 CUDA 上的話,在抽象化來講可以寫成以下這樣:

// From p.16 of <https://mlc.ai/summer22/slides/2-TensorProgram.pdf>
x = get_loop("x")
xo, xi = split(x, 4)
reorder(xi, xo)
bind_thread(xo, "threadIdx.x")
bind_thread(xi, "blockIdx.x")

在 compiler 來說,對於各式各樣的 loop 當然是希望能夠被提供越多資訊越好,諸如 IBM / Intel 都有一些自家 pragma,在 ML 領域內我們也希望能夠被提供這樣的資訊,像是「有沒有 loop carried dependency」或是直接像 p.18 裡直接指名該 tensor 所有元素是 spatially parallelize-able。

總的來說,這週展示了 operator 底下的優化空間。
最後 20 分鐘就是講師在安立自家的 TVM XD

額外 brain storming:MLIR 跟 TVM 都幾?難道又像古時候的編譯器一樣,ML compilation 是否也要進入戰國時代了呢?

MLC quick notes – Week 1

Other notes for MLC

基本上什麼也沒說。在這裡把問題情境說出來——大目標就是想要把機器學習模型放到各式各樣的硬體上。這應該就是各家 AI Compiler 公司都在做的事。

Key Questions to answer:

  • What level of abstraction to have?
    • Too high: lack of reuse, will have to rebuild different operator types if so coarse grained
    • Too low: too much verbosity, harder to identify high level informations (e.g. control flow)
  • How to address the process from “Development” to “Deployment”?
    • When developing… ML engineers train model in language that is more easier to configure — Python. Machine learning frameworks are mainly based on Python, like Tensorflow, PyTorch and JAX.
    • When deploying… the environment varies from end devices like holdable devices, tiny cameras, tiny microphones to individual GPU, or even large scale computing farms. The environment and hardwares are different in multiple senses. The problem of deployment is the next big question for this machine learning era.

Machine learning compilation goals

  • Integration and dependency minimization: 在 end device 上用最少的資源達到 deployment
  • Leverage hardware native acceleration: 善用硬體特性做加速,舉凡 SIMD lane, multiple core, cache friendly tiling 都是這階段會做的事。常數優化的累積不可小覷。

MLC is all about bridging the gap. 編譯器就是去追逐人類與 0/1 隻間的巴別塔。

畢業

畢業了。畢業對我來說其實不太像是離開台大。台大是何等模糊抽象的概念。更像是離開辛亥、溫州、泰順、汀州、大安。

在這裡經歷過不少荒謬的事,也因為這樣的地利之便才容納得下許多強說愁的騷年們。半夜的溫州仍然散發出蓬勃的生命力。

離開應該是代表新的階段的開始,但思緒還是常常把自己帶回這裡,或許就是這邊的穩定不變, 每次經過時的既視感才會如此強烈。告白、拒絕、牽手、滑板、熬夜、嘔吐、一起聞嘔吐過後的臭味、被老鼠嚇爛、醉到怎麼回青田街都忘記。

當時結交的朋友也都不再聯絡了。我該檢討為什麼至今的人生充滿告別,少了一些對維持的努力。這些痕跡為什麼除了在我們腦海裡之外沒能繼續留著。

我該多寫字,要不然又發生這種自以為片刻即永恆的慘劇。

的確是來來去去,但究竟什麼是長遠,什麼是從長計議,如今畢業似乎也沒有比較明瞭。如今心中仍然沒有一個穩定的狀態,伴隨而來勢必不是穩定的生活與交友圈,但似乎還不到真心為自己這樣擔心的時候,所幸前方仍然還是迷霧。

這份文章確實有存在的必要。是為了希望。希望來自未來,未來需要現在,現在需要這篇文章對過去的。到最後就算不留下這些外在性的事件,也要歸功於這些,才能打磨出通往內心的鑰匙。

疫情記事

人之間的疏離是一個過程,先是訊息的差異,差異構成了不同的敘事。不同的敘事導致不同的價值判斷,日積月累形成迥異的信仰。是什麼使我們得以立足於生活,而有了生活這樣的形狀與語境。不管如何社群與社會總是作為源源不絕的養份存在。當脫離人所組成所組成的社群和社會,失去生活中鮮活細節的激發與覺察,自然很難再站在同樣的脈絡上。

畢竟抽象化與推論似乎是根深蒂固於大腦內部的本能。經過了抽象化,道路中街景號誌等各式的細節,僅僅是數學模型一個小黑點。省去冗贅的細節可以幫助我們大腦思考,而過度的省略,資訊與細節的遺失就會發生。

社會運行的前進有它的節奏,怎麼樣也太大了,永遠超出了個體的感官所能覺察。即便還沒消化完今天的感受,明天仍要片刻不緩地出門,簡直與飛蛾撲火別無二致。金錢作為介質抽象化了將近所有的交換行為,生活的細節被換算成每日開銷,又進一步推估出年度預算。理性預先描繪了感官所能感知的範圍,寸步不逾。

疫情來了。機器停擺,機制必須改變。大家被推回各自的蝸居之中,每日推進自己的快節奏停下來了,被下來了。視野變小的情況下,也許生活可以被重新聚焦。行為的單向性不再遠超出自己的邏輯思考之外而被轉成結論。過多的物質勢必再次重新氾濫進生活之中。但總覺得這次的疫情對個人的精神來說是非常特別的過程。物質帶來的愉悅或許能不再只是金錢的增減。重新走過一次抽象化的過程,或許也能把生活的節奏重新放回自己雙手上,牢牢抓住。

If your perf is not working…

I was trying to perf LLVM opt this week and got confused because the perf report showed mangled function calls on the stack traces. So I might as well write an article of it and hope it would get some Google juice and save others from confusion.

  • To build LLVM with complete stack traces, build with “CMAKE_BUILD_TYPE=Debug
  • To allow perf to work on LLVM, build with “LLVM_USE_PERF:=ON

At the very first place I thought the mangle-ed function call names exist because of perf not recognizing the symbols. It turned out I was wrong. If perf does not recognize the symbol it would simply show Unknown. The mixed-up symbol names are name mangling for function calls. (see more here)

You can remove the mangling simply with llvm-cxxfilt (see more here). For the exact correct mapping you may need to use llvm-cxxmap (see from here).

With the correct build I still get mangled-output. Turned out that the perf I apt-get-ed from is compiled with the de-mangling function turned off. It was a bug filed in linux since 2014 November and I’m still falling into this error on 2021 June 🤢. (on Ubuntu 16.04 linux 4.4.0-131 generic x86_64)

Mangled Output, visualization by FlameGraph

There’s 3 links in the bug thread above that may lead you to the solution. Personally I’ve download the perf tool in mirrors.edge.kernel.org, followed the instructions here to build the correct perf I needed.

Clean Output, visualization by FlameGraph

NOTE: you may need to apt-get some dependencies to enable some feature in perf. Be sure to checkout the compile messages when you build from source. You can checkout features enabled with perf version --build-options.

PS: Just in case you came in with your perf not working on your self-compiled code with clang or gcc, you may want to look at this stack-overflow answer.