Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲AMD Strix Halo RDMA Cluster Setup Guide (github.com)

87 points by jakogut 5 hours ago | 13 comments

pixelpoet 3 hours ago [-]

I have two 128gb Strix Halos and have been extremely excited about Antirez's (Redis author) work on DS4, especially with 4bit quant using two machines: https://github.com/antirez/ds4

Right now the speed isn't good for GLM 5.2, Deepseek V4 Flash speed is okay for me (actually reading the output) and quite usable. See kyuz0's great recent video here: https://www.youtube.com/watch?v=PkKXm_mKCCM

With a bit more speed and model improvements, local AI becomes a reasonable practical thing! The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

I got my Strix machines for ~2k eur each, best computers this 90s kid has ever owned, but those days are gone :(

rnewme 2 hours ago [-]

What's the advantage of ds4 over llama.cpp, esp if down the line they upstream his forked kernels?

gruez 2 hours ago [-]

>The biggest problem is all the tech companies making consumer hardware completely unaffordable, and I don't think this is accidental. Look at Micron's profits and share price lately...

You realize "tech companies" isn't a monolith? Micron charging inflated prices doesn't magically benefit OpenAI. The "high prices keep out competitors" theory doesn't make much sense either. It's like saying Dennys benefits from higher egg prices because it makes cooking eggs at home more expensive.

sdf4j 49 minutes ago [-]

You got it wrong. Use appliances instead of eggs. If getting an oven gets more expensive I rather keep going to Dennys.

It’s classic capex vs opex. I’d keep paying my openai subscription instead of dropping $3k to run a subpar model. If the thing costs $1k I would consider it.

mkj 46 minutes ago [-]

openai etc are going to have a higher utilisation of the hardware so can afford it more than small companies/people. Efficient resource use matters more when they're expensive.

jcastro 4 hours ago [-]

This is amazing!

I'm working on a three node strix halo agentic OS factory designed to be maintained by local agents: https://github.com/projectbluefin/testing-lab

This memory bandwidth combo is amazing for homelabbers. kyuz0's work on these containers has made the investment in this kit so valuable I hope Framework is sending you hardware!

https://projectbluefin.io/server/ is what I'm hoping to ship, designed to just ship setups like this ootb and things like this would be so much harder without kyuz0!

(Note: The 64GB ones are going for $1700-ish empty, the prices on the 128's are outrageous we can just keep making the labs more deterministic over time!)

mestadler 58 minutes ago [-]

Yep, nice write up, seems we are all doing this. Its as close as you can get to Provider level for essentially prosumer hardware. I'll share what I've got with this running under k0s and the npu work.

mestadler 55 minutes ago [-]

This is exactly the type of technical depth that makes a difference. I've been following all the work you have been doing.

jmyeet 1 hours ago [-]

So this is kind of fascinating. The main hardware costs here seem to be:

- 2x Framework Desktop AI Mainboards with 128GB of RAM for $3150 each

- 2x 100G Ethernet controllers for ~$500 each

So the Framework board has a single PCI-e 4.0 x4 slot, which amounts to 8GB/s or 64Gbps theoretical so you're not getting 100G. Also, the 100G cards all seem to be PCI-e x16 slots for obvious reasons so you need a riser or an adapter or something to even get them to work.

I don't know how hot a 100GbE copper NIC runs but, from experience, 10GbE NICs have been basically giant heatsinks, basically. So fiber might be advisable and I expect short fiber cables here probably aren't cost-prohibitive given everything else.

As an aside, if you are using Ethernet for clustering and you're clustering 2 devices, in an ideal world you'd be using simplex Ethernet but that's not an option here.

I wonder if the author considered USB 4.0 for clustering? I ask because I know people who have clustered Mac Studios over TB5 and that bandwidth is up to 120Gbps. The version of USB4 on the Ryzen AI 395 seems to be 40Gbps, which isn't that far off 8GB/s over PCI-e 4.0 x4.

But the limiting factor with Strix Halo (and DGX Spark for that matter) is memory bandwidth, both under 300GB/s. The obvious comparison is to the Mac Studio. Unfortunately the largest spec they currently sell is 96GB. It had been as high as 512GB. And 96GB is $6700+ but you're also getting way better performance AFAICT eg [1]. The M3 Ultra has ~900GB/s memory bandwidth.

You can alternatively buy a Macbook Pro with M5 Max and 128GB of RAM (now $8000, was $5500-6000 a few days ago) but that tops out at ~600GB/s, which is still double these mini AI boxes.

Oh and if you don't want to go the way of these Framework motherboards, you can buy a whole 128GB Strix Halo PC for $3k or less.

I think the main point here though is we're only a few years away from running 300B+ (or even 1T+) param models at useful speeds on enthusiast hardware.

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1u5mfaq/you_can...

kcb 1 hours ago [-]

No reason to use fiber on short runs like that. DAC cables are cheap and better in pretty much every way over short distances. You're probably thinking of RJ-45 NICs and SFP modules which are known to run pretty hot.

layla5alive 21 minutes ago [-]

+1 fiber over short distance just adds power/heat and latency compared to DAC - fiber is nice for ease of cabling and airflow, but not performance or cost when below a few meters.

32 minutes ago [-]

mestadler 52 minutes ago [-]

He did cover the Tb/USB4 ;)

gregoryl 32 minutes ago [-]

Indeed, here: https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/...

Rendered at 05:44:51 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.