Introduction :
Finite Field Assembly is a programming language that lets you emulate GPUs on CPUs
It's a CUDA alternative that uses finite field theory to convert GPU kernels to prime number fields.
Finite Field is the primary data structure : FF-asm is a CUDA alternative designed for computations over finite fields.
Recursive computing support : not cache-aware vectorization, not parallelization, but performing a calculation inside a calculation inside another calculation.
Extension of C89 - runs everywhere gcc is available.
Context : I'm getting my math PhD and I built this language around my area of expertise, Number Theory and Finite Fields.
This is phrased in a kind of demanding way to an author who has been kind enough to share their novel work with us. Are you sure you spent enough time trying to understand?
It seems that pretty much everybody here is confused by this article. One user even accused it of LLM plagiarism, which is pretty telling in my opinion.
I for one have no clue what anything I read in there is supposed to mean. Emulating a GPU's semantics on a CPU is a topic which I thought I had a decent grasp on, but everything from the stated goals at the top of this article to the example code makes no sense to me.
I guess it's not clear to me why it's even interesting to talk about their LinkedIn or their PhD in the first place? It's not like not having a PhD will make the work any more true or not. Wouldn't it be more interesting to discuss the merits of the post? There's really little point in trying to say that their LinkedIn has different info than the comment therefore the submission is invalid.
But, suppose I did actually hold that belief for some reason, then it would seem fairly intellectually dishonest to withhold relevant info in my pointed inquisition wherein I just characterize them as someone lacking mathematical experience at all, let alone from a world class university. But maybe that's just me!
I think they're pointing it out because stating they're working towards a PhD in something when they've just graduated and don't seem to be (as far as we can tell) enrolled in a PhD program is misleading. Note that the parent isn't the one that brought up their PhD, the author is, presumably to head off the big question marks everyone reading this got as to what it is.
It's unclear whether this page is something that could be useful, and deserves attention. The fact that the author is at best making misleading statements is useful in determining whether you should take their claims at face value.
They claim "Finite Field Assembly is a programming language that lets you emulate GPUs on CPUs".
It's not a programming language, it's a handful of C macros, and it doesn't in any way emulate a GPU on the CPU. I'll be honest, I think the author is trying to fake it till they make it, they seem interested in mathematics but their claims are far beyond what they've demonstrated, and their post history reveals a series of similar submissions. In so far as they're curious and want to experiment I think it's reasonable to encourage, but they're also asking for money and don't seem to be delivering much.
Why would they post the 4th article in a series where the previous ones require you to pay?
> I guess it's not clear to me why it's even interesting to talk about their LinkedIn or their PhD in the first place?
Am I taking crazy pills? I didn't bring it up, the guy himself, here at the top of this very thread branch, wrote specifically explicitly that he's a PhD student working on number theory.
> Wouldn't it be more interesting to discuss the merits of the post?
There is no merit, nothing to discuss. I linked the corresponding GitHub below so you can judge for yourself.
Depending on what properties they sold, they certainly could have gotten valuable real-world expertise with finite fields. It's certainly easier to sell them than infinite ones!
I think I get it. You're using the Ring isomorphism from the Chinese Remainder Theorem to do "parallel computation". This is the same principle as how boolean algebra on binary strings computes the pairwise results of each bit in parallel. Unfortunately, there's no free lunch - if you want to perform K operations on N-bit integers in parallel, you still need to work with (K * N)-bit-wide vectors, which is essentially what SIMD does anyway.
I’m also unsure where finite fields are coming into play. Finite fields have orders that are prime powers, and the author is talking about a “finite field” of order 7x9x11. But if we aren’t dealing with fields, why is the author mentioning plans for implementing division? It definitely needs more explanation but I’m not sure if the idea is coherent.
One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .
Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.
A single CPU thread should be treated as basically a warp executing 4 simd vectors in parallel.
The naïve implementation of __syncthreads() would be an atomic mechanism shared across all threads that contribute to what is GPU workgroup.
Looks like this entire paper is just about how to move/remove these barriers.
yes, but in practice, I believe people spam __syncthreads() in GPU kernels just to ensure correctness. There is value in statically proving that you don't need a synchronization instruction at a certain point. Doubly more so in the transpilation case, when you now find your naive __syncthreads() being called multiple times due to it being present in CUDA code(or MLIR in this case).
An interesting add on to me would be the handling of conditionals. Because newer GPUs have independent thread scheduling which is not present in the older ones, you have to wonder what is the desired behaviour if you are using CPU execution as a debugger of sorts(or are just GPU poor). It'd be super cool to expose those semantics as a compiler flag for your transpiler, allowing me to potentially debug some code as if it ran on an ancient GPU like a K80 for some fast local debugging.
But the ambitious question here is this - if you take existing GPU code, run it through a transpiler and generate better code than handwritten OpenMP, do you need to maintain an OpenMP backend for the CPU in the first place? It'd be better to express everything in a more richer parallel model with support for nested synchronization right? And let the compiler handle the job of inter-converting between parallelism models. It's like saying if Pytorch 2.0 generates good Triton code, we could just transpile that to CPUs and get rid of the CPU backend. (of course triton doesn't support all patterns so you would fall back to aten, and this kind of goes for a toss)
It's a bit hard for me to tell the intention here. Is the idea that finite fields can take better advantage of CPU architecture than something like SIMD for parallel computation? Or is this just for experimentation?
Edit: this tickles my brain about some similar seeming sort of programming language experiment, where they were also trying to express concurrency (not inherently the same as parallelism) using some fancy math. I can't remember what it was though?
If matrix multiplication does get added to this, I imagine that there is some utility for game development. At that point, I'd be curious what the comparison would be from CPU to GPU. Like, given a clock speed of x, what would a comparable GPU (or set of GPU features) look like?
I know that's pretty abstract, but without that kind of "apples to apples" comparison, I have trouble contextualizing what kind of output is bring targeted with this kind of work.
Introduction : Finite Field Assembly is a programming language that lets you emulate GPUs on CPUs
It's a CUDA alternative that uses finite field theory to convert GPU kernels to prime number fields.
Finite Field is the primary data structure : FF-asm is a CUDA alternative designed for computations over finite fields.
Recursive computing support : not cache-aware vectorization, not parallelization, but performing a calculation inside a calculation inside another calculation.
Extension of C89 - runs everywhere gcc is available. Context : I'm getting my math PhD and I built this language around my area of expertise, Number Theory and Finite Fields.
I've read this and I've seen the site, and I still have no idea what it is, what's the application and why should I be interested.
Additionally I've tried earlier chapters and they are behind a paywall.
You need a better introduction.
This is phrased in a kind of demanding way to an author who has been kind enough to share their novel work with us. Are you sure you spent enough time trying to understand?
It seems that pretty much everybody here is confused by this article. One user even accused it of LLM plagiarism, which is pretty telling in my opinion.
I for one have no clue what anything I read in there is supposed to mean. Emulating a GPU's semantics on a CPU is a topic which I thought I had a decent grasp on, but everything from the stated goals at the top of this article to the example code makes no sense to me.
It just seems like residue numbering systems computation, which I'm already working with.
> I'm getting my math PhD and I built this language around my area of expertise, Number Theory and Finite Fields.
Your LinkedIn says you're an undergrad that took a gap year 10 months ago (before completing your senior year) to do sales for a real estate company.
Why bother doing a witch hunt and leaving out that they did Stats at Yale..
Because why does it matter? Are you suggesting undergrad stats at Yale is comparable to a PhD in number theory?
I guess it's not clear to me why it's even interesting to talk about their LinkedIn or their PhD in the first place? It's not like not having a PhD will make the work any more true or not. Wouldn't it be more interesting to discuss the merits of the post? There's really little point in trying to say that their LinkedIn has different info than the comment therefore the submission is invalid.
But, suppose I did actually hold that belief for some reason, then it would seem fairly intellectually dishonest to withhold relevant info in my pointed inquisition wherein I just characterize them as someone lacking mathematical experience at all, let alone from a world class university. But maybe that's just me!
I think they're pointing it out because stating they're working towards a PhD in something when they've just graduated and don't seem to be (as far as we can tell) enrolled in a PhD program is misleading. Note that the parent isn't the one that brought up their PhD, the author is, presumably to head off the big question marks everyone reading this got as to what it is.
It's unclear whether this page is something that could be useful, and deserves attention. The fact that the author is at best making misleading statements is useful in determining whether you should take their claims at face value.
They claim "Finite Field Assembly is a programming language that lets you emulate GPUs on CPUs".
It's not a programming language, it's a handful of C macros, and it doesn't in any way emulate a GPU on the CPU. I'll be honest, I think the author is trying to fake it till they make it, they seem interested in mathematics but their claims are far beyond what they've demonstrated, and their post history reveals a series of similar submissions. In so far as they're curious and want to experiment I think it's reasonable to encourage, but they're also asking for money and don't seem to be delivering much.
Why would they post the 4th article in a series where the previous ones require you to pay?
> I guess it's not clear to me why it's even interesting to talk about their LinkedIn or their PhD in the first place?
Am I taking crazy pills? I didn't bring it up, the guy himself, here at the top of this very thread branch, wrote specifically explicitly that he's a PhD student working on number theory.
> Wouldn't it be more interesting to discuss the merits of the post?
There is no merit, nothing to discuss. I linked the corresponding GitHub below so you can judge for yourself.
Depending on what properties they sold, they certainly could have gotten valuable real-world expertise with finite fields. It's certainly easier to sell them than infinite ones!
Are you sure that’s their LinkedIn?
Why wouldn't it be? All of the pics, names and details line up between GitHub, here, Reddit, and substack.
I think I get it. You're using the Ring isomorphism from the Chinese Remainder Theorem to do "parallel computation". This is the same principle as how boolean algebra on binary strings computes the pairwise results of each bit in parallel. Unfortunately, there's no free lunch - if you want to perform K operations on N-bit integers in parallel, you still need to work with (K * N)-bit-wide vectors, which is essentially what SIMD does anyway.
I’m also unsure where finite fields are coming into play. Finite fields have orders that are prime powers, and the author is talking about a “finite field” of order 7x9x11. But if we aren’t dealing with fields, why is the author mentioning plans for implementing division? It definitely needs more explanation but I’m not sure if the idea is coherent.
Yup that's exactly what this is and thus, notably, it is not actually about finite fields.
One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .
Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.
A single CPU thread should be treated as basically a warp executing 4 simd vectors in parallel. The naïve implementation of __syncthreads() would be an atomic mechanism shared across all threads that contribute to what is GPU workgroup.
Looks like this entire paper is just about how to move/remove these barriers.
yes, but in practice, I believe people spam __syncthreads() in GPU kernels just to ensure correctness. There is value in statically proving that you don't need a synchronization instruction at a certain point. Doubly more so in the transpilation case, when you now find your naive __syncthreads() being called multiple times due to it being present in CUDA code(or MLIR in this case).
An interesting add on to me would be the handling of conditionals. Because newer GPUs have independent thread scheduling which is not present in the older ones, you have to wonder what is the desired behaviour if you are using CPU execution as a debugger of sorts(or are just GPU poor). It'd be super cool to expose those semantics as a compiler flag for your transpiler, allowing me to potentially debug some code as if it ran on an ancient GPU like a K80 for some fast local debugging.
But the ambitious question here is this - if you take existing GPU code, run it through a transpiler and generate better code than handwritten OpenMP, do you need to maintain an OpenMP backend for the CPU in the first place? It'd be better to express everything in a more richer parallel model with support for nested synchronization right? And let the compiler handle the job of inter-converting between parallelism models. It's like saying if Pytorch 2.0 generates good Triton code, we could just transpile that to CPUs and get rid of the CPU backend. (of course triton doesn't support all patterns so you would fall back to aten, and this kind of goes for a toss)
Pretty sure this is just vectorization. You can pack some 8bit ints into a machine-length 32bit int and add them together, that is vectorization.
I don't think that's true when the add overflows. You wouldn't want a lane's overflow to carry into an adjacent lane.
It's a bit hard for me to tell the intention here. Is the idea that finite fields can take better advantage of CPU architecture than something like SIMD for parallel computation? Or is this just for experimentation?
Edit: this tickles my brain about some similar seeming sort of programming language experiment, where they were also trying to express concurrency (not inherently the same as parallelism) using some fancy math. I can't remember what it was though?
Well too late to edit, but I think I was thinking of this: https://news.ycombinator.com/item?id=40390287 the programming language Bend.
> Field order (the number of elements your field can hold). i.e you can store 8 * 9 * 11 elements in the field
I thought a finite field's order has to be a prime power.
Yes you are right. According to Wikipedia it was proven in 1893. https://en.wikipedia.org/wiki/Finite_field
I’m dubious of this project.
If matrix multiplication does get added to this, I imagine that there is some utility for game development. At that point, I'd be curious what the comparison would be from CPU to GPU. Like, given a clock speed of x, what would a comparable GPU (or set of GPU features) look like?
I know that's pretty abstract, but without that kind of "apples to apples" comparison, I have trouble contextualizing what kind of output is bring targeted with this kind of work.
I suspect this is just AI slop.
What are some problems where this approach has advantages?
It's hilarious how gullible hn is. All you gotta do is put GPU and math buzzwords in your README and you'll automatically be upvoted.
This was discussed on Reddit - this is not actually finite field arithmetic.
Also you can go to this dudes GitHub and see exactly how serious this project is.
https://github.com/LeetArxiv/Finite-Field-Assembly
Lol
This is nonsense.