Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)

Name: Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)
Item: Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)
Author: Dr. Ian Cutress

by Dr. Ian Cutress on August 24, 2021 2:25 PM EST

3 Comments | Add A Comment

3 Comments

02:28PM EDT - Welcome to Hot Chips! This is the annual conference all about the latest, greatest, and upcoming big silicon that gets us all excited. Stay tuned during Monday and Tuesday for our regular AnandTech Live Blogs.

02:30PM EDT - Start here in a couple minutes

02:30PM EDT - Friend of AT, David Kanter, is chair for this session

02:32PM EDT - 'ML is not the only game in town'

02:33PM EDT - First talk is CO-founder, CTO, Graphcore, Simon Knowles. Colossus MK2

02:34PM EDT - Designed for AI

02:34PM EDT - New structural type of processor - the IPU

02:34PM EDT - 'Why do we need new silicon for AI'

02:35PM EDT - Embracing graph data through AI

02:36PM EDT - Classic scaling has ended

02:36PM EDT - Creating hardware to solve graphs

02:37PM EDT - Control program can control the graph compute in the best way to run on specialized hardware

02:37PM EDT - Hardware abstraction - tiles with processors and memory with a IO interconnect

02:37PM EDT - bulk synchronous parallel compute

02:38PM EDT - thread fences for communication

02:38PM EDT - 'record for real transistors on a chip'

02:38PM EDT - This chip has more transistors on it than any other N7 chip from TSMC

02:38PM EDT - within one reticle

02:39PM EDT - 896 MiB of SRAM on N7

02:40PM EDT - 4 IPUs in a 1U

02:40PM EDT - Lightweight proxy host

02:41PM EDT - 1.2 Tb/s off-chassis IO

02:41PM EDT - 800-1200 W typical, 1500W peak

02:41PM EDT - Can use Pytorch, tensorflow, ONNX, but own Poplar software stack is preferred

02:43PM EDT - Half the die is memory

02:43PM EDT - 24 tiles, 23 are used to give redundancy

02:43PM EDT - 25 GHz global clock

02:43PM EDT - 823 mm2, TSMC N7

02:44PM EDT - 32 bit instructions, single or dual issue

02:44PM EDT - 6 execution threads, launch worker threads to do the heavy lifting

02:45PM EDT - Aim for load balancing

02:45PM EDT - 1.325 GHz* global clock

02:46PM EDT - 47 TB/s data-side SRAM access

02:46PM EDT - FP16 and FP32 MatMul and convolutions

02:47PM EDT - TPU relies too much on large matrices for high performance

02:48PM EDT - Each tile can generate 128 random bits per cycle

02:48PM EDT - can round down stochastically

02:48PM EDT - at full speed

02:48PM EDT - Avoid FP32 data with stochastic rounding. Helps minimize rounding and energy use

02:49PM EDT - Trace for program

02:49PM EDT - 60% cycles in compute, 30% in exchange, 10% in sync. Depends on the algorithm

02:50PM EDT - Compiler load balance the processors

02:50PM EDT - Exchange spine

02:50PM EDT - 3 cycle drift across chip

02:51PM EDT - Chip power

02:51PM EDT - pJ/flop

02:52PM EDT - 60/30/10 in the pie chart

02:52PM EDT - arithmetic energy dominates

02:52PM EDT - IPU more efficient in TFLOP/Watt

02:53PM EDT - Not using HBM - on die SRAM, low bandwidth DRAM

02:53PM EDT - DDR for model capacity

02:53PM EDT - HBM has a cost problem - IPU allows for DRAM

02:54PM EDT - 40 GB HBM triples the cost of a processor

02:54PM EDT - Added cost of CoWoS

02:54PM EDT - VEndor adds margin with CoWoS

02:54PM EDT - No such overhead with DDR

02:55PM EDT - Off-chip DDR bandwidth suffices for streaming weight states for large models

02:56PM EDT - More SRAM on chip means less DRAM bandwidth needed

02:58PM EDT - Q&A

03:00PM EDT - Q: Clocking is mesochrnous but static mesh - assume worst case clocking delays, or something else? A: Behaves as if syncronous. In practice, clocks and data chase each other. Fishbone layout of exchange it to make it straightforward

03:00PM EDT - Q: Are results deterministic? A: Yes because each thread and each tile has its own seed. Can manually set seeds

03:05PM EDT - Next Talk is Cerebras

03:05PM EDT - WSE-2 new system configurations

03:06PM EDT - 2016 started, 2019 WSE-1

03:06PM EDT - 2.6 trillion transistors

03:06PM EDT - 850k cores

03:07PM EDT - CS-2 system on sale today

03:07PM EDT - it costs a few million

03:07PM EDT - Traditional approaches can't keep up

03:08PM EDT - Up-coming multi-trillion parameter models

03:08PM EDT - something has to change in silicon - need a better approach

03:08PM EDT - but large models are hard to support

03:08PM EDT - Massive memory, massive compute, massive IO

03:09PM EDT - More partitioning of model across more chips

03:09PM EDT - More sync

03:09PM EDT - Becomes a distribution complexity problem, rather than a NN problem

03:09PM EDT - How to solve this problem Cerebras style

03:09PM EDT - Cerebras for Extreme Scale - new style of execution, support up to 120 trillion parameters

03:09PM EDT - same as synapses in brain

03:10PM EDT - also needs to run fast

03:10PM EDT - up to 192 WSE-2 with near linear perf scaling

03:10PM EDT - 10x weight sparsity speedup

03:10PM EDT - scales easily with push of button

03:10PM EDT - Use weight streaming, rather than data streaming

03:11PM EDT - disaggregate model memory from compute from dataset

03:11PM EDT - can scale memory or compute as needed

03:12PM EDT - Base compute unit is a single CS-2, 850k cores, 14 kW, 1.6 TB/s bandwidth

03:12PM EDT - Add memory store to hold parameters, weights

03:12PM EDT - MemoryX Technology

03:12PM EDT - custom memory store for weights

03:12PM EDT - independent of that, SwarmX interconnect to control

03:13PM EDT - Designed to scale NN training with near linear scaling

03:13PM EDT - Simple execution flow with Cerebras software stack

03:13PM EDT - Program cluster same way as a single system

03:13PM EDT - 'Easy as Pie'

03:13PM EDT - Rethink the execution model

03:14PM EDT - All model weights stored externally, streamed onto the CS2 system as needed

03:14PM EDT - As they stream through, the CS-2 performs the calculation

03:14PM EDT - backward pass, gradients are streamed. Weight update occurs on MemoryX, but swarmX can help

03:16PM EDT - Solving the latency problem

03:16PM EDT - weight streaming has no back-to-back dependencies

03:16PM EDT - ensure weight memory is not latency sensitive

03:16PM EDT - coarse grained pipeline - a pipeline of layers

03:17PM EDT - Stream out weights as the next stream comes in

03:17PM EDT - hide the extra latency from extra weights

03:17PM EDT - same performance as if weights were local

03:17PM EDT - now capacity

03:18PM EDT - two main capacity problems

03:18PM EDT - store the giant model

03:18PM EDT - All parameters in MemoryX up to 2.4 PB capacity

03:18PM EDT - 120 trillion weights. DRAM and flash hybrid storage

03:18PM EDT - Internal compute for weight update/optimizer

03:19PM EDT - MemoryX does the intelligent pipeline scaling

03:19PM EDT - Flexible capacity with MemoryX

03:20PM EDT - No need for partitioning with WSE2

03:21PM EDT - Support for 100kx100k MatMuls

03:22PM EDT - Cluster multiple CS-2 through SwarmX

03:22PM EDT - SwarmX is imdenpdent of CS-2 and MemoryX

03:23PM EDT - Gradients are reduced on the way back, weights are broadcast on the way forward

03:23PM EDT - modular and disaggregated

03:23PM EDT - Project near-linear to 192 CS-2 systems

03:25PM EDT - 'Is this enough?' No, need smarter models

03:25PM EDT - Outpacing moore's law by an order of magnitude

03:25PM EDT - Will need a football field of silicon to run a model

03:25PM EDT - Need sparse models to get same answers with less compute

03:25PM EDT - Creating sparsity in dense models

03:26PM EDT - No hardware to solve for this sparsity, except Cerebras

03:26PM EDT - Hardware data control for non-zero data compute

03:26PM EDT - accelerates all types of sparsity

03:26PM EDT - Full performance across all BLAS levels

03:28PM EDT - Sparsity is introduced in the MemoryX unit. Sparse weights are streamed, SwarmX broadcasts to CS-2. CS-2 does compute. Sparse gradients produced,. gradients are streamed back out, reduced through SwarmX, updated on MemoryX .All happens natively, same flow as for dense compute

03:29PM EDT - near linear speedup with sparsity

03:31PM EDT - enormous layer support for up to 100k hidden dimensions

03:32PM EDT - Don't need different software to go from 1 device to 192 - execution model is always the same

03:34PM EDT - Q&A

03:35PM EDT - Q: Bandwidth MemoryX to CS-2 A: MemoryX doesn't have to be in the same rack, can be cabled. BW is over 1 Tbit - not just through MemX but also SwarmX

03:35PM EDT - Q: Interconnect is custom? A: Standard, but not disclosing, but not directly exposed to the user. Intended to be integrated into the system and seemless from user point of view

03:36PM EDT - Q: Handle activations for skip connections - A - all activations are kept on wafer. Get picked up in a few layers when needed

03:36PM EDT - Next talk is SambaNova

03:39PM EDT - Cardinal SN10 RDU

03:39PM EDT - TSMC N7, 40B transistors

03:39PM EDT - BF16 focused AI chips for training

03:40PM EDT - base unit of compute is a 12 TB memory system in a quarter rack with 8 SN10 chips

03:40PM EDT - standard rack form factor

03:40PM EDT - pyTorch, Tensorflow, UserGraph, or User Kernel

03:41PM EDT - Dataflow pipe is SambaNova software stack

03:42PM EDT - graphs are rewriting the way we think about software

03:43PM EDT - current systems not suited for dataflow - goldilocks zone

03:43PM EDT - using dataflow to the max

03:44PM EDT - orange boxes here is compute

03:44PM EDT - high-level architecture

03:45PM EDT - four tiles of reconfigurable compute and memory

03:45PM EDT - resources can be managed or combined

03:45PM EDT - Direct access to TBs of DDR4 off-chip memory

03:45PM EDT - Pattern Memory Units, Pattern Compute Units, Switches

03:46PM EDT - AGUs are before compute and memory

03:46PM EDT - architecture allows for scale out

03:46PM EDT - supports systolic modes of execution

03:47PM EDT - Feed the PCUs

03:47PM EDT - support arbitrary memory access patterns

03:47PM EDT - Data align units

03:48PM EDT - Router is not just nearest neighbor - compiler can construct arbitrary routes

03:48PM EDT - Allows for transfer and transparent scaleout

03:49PM EDT - Here's how to map an operation

03:49PM EDT - and communications

03:49PM EDT - Fully pipelined softmax operation

03:49PM EDT - Here's something more complex - the LayerNorm

03:50PM EDT - Can also repurpose for tradeoff space/time

03:50PM EDT - Compiler takes advantage

03:51PM EDT - Kernel by Kernel in spatial

03:51PM EDT - Automatic kernel fusion - no need to manually hand fuse operations

03:52PM EDT - Use IO bandwidth more efficiently

03:52PM EDT - High performance high utilization

03:53PM EDT - Compiler can group sparse and dense multiplies to be executed on chip

03:54PM EDT - 1.5 TB of DDR4 per chip

03:55PM EDT - 12 TB of DRAM for 8 chips per quarter rack - smallest compute unit for sale

03:55PM EDT - Interleaving in a finegrained manner to be used in proportion

03:56PM EDT - Schedule compiler optimizations

03:58PM EDT - Run multiple applications on each node

03:58PM EDT - Simple scale out

03:59PM EDT - one quarter rack replaces 416 GPUs with 32 TB of HBM in 8 racks

03:59PM EDT - 1 trillion parameter NLP (Natural Language. Processing) training

04:00PM EDT - Scale up to 50k x 50k medical imaging, support any size model

04:01PM EDT - Direct analysis on SambaNova

04:01PM EDT - First models in late 2019

04:02PM EDT - full resolution with RDU

04:02PM EDT - raising problems across the board

04:03PM EDT - raising performance across the board*

04:04PM EDT - Q&A time

04:05PM EDT - Q: Bandwidth at switches A: enough to sustain high streaming throughput - more than what you thing, '150+ TB/s' - 50 km of wire just for that

04:05PM EDT - How long does it take to compile?

04:06PM EDT - A: Quick. BERT Large - a minute or two. GPT-175b, one segment and replicates, starting to go to same time

04:07PM EDT - Q: Mem bandwidth A: six channels per RDU, DDR4-2666 to DDR4-3200 - 48 channels total in a quarter rack

04:08PM EDT - Q: Train a 1T model estimate? A: Depends on the dataset. What matters to us is efficiency.

04:08PM EDT - Now time for Anton ASIC 3

04:09PM EDT - DE Shaw Resarch

04:10PM EDT - Fire breathing monster

04:10PM EDT - Molecular Dynamics simulationm

04:11PM EDT - Almost static snapshots - but atoms move

04:11PM EDT - molecules move!

04:11PM EDT - MD allows modelling

04:12PM EDT - Requires knowing the atom position of molecule and atoms in matrix

04:12PM EDT - discrete timesteps of a few femtoseconds

04:12PM EDT - Force computation described by a model

04:13PM EDT - Forces = bonds + vdW + electrostatics

04:13PM EDT - Intractable to rediculous compute

04:14PM EDT - millisecond scale protein simulation from Anton 1

04:14PM EDT - Anton 2 vastly increased performance

04:14PM EDT - It's all about the color of the logo

04:14PM EDT - Here's Anton 2, made on Samsung

04:14PM EDT - Custom ASIC

04:14PM EDT - Two kinds of computational tile

04:15PM EDT - Flex tile, high-throughput interaction subsystem

04:15PM EDT - PPIM have unrolled arithmetic pipelines

04:16PM EDT - Periphery is the serdes to connect multiple chips together

04:16PM EDT - To make it better, need to scale PPIMs and Geometry cores

04:16PM EDT - Also address performance bottlenecks - such as scaling off chip bandwidth

04:16PM EDT - Also increasing simulation size support

04:17PM EDT - Control the design and implementation

04:17PM EDT - Anton 3 core tile

04:17PM EDT - Central router

04:17PM EDT - Same GC and PPIM as Anton 2 but with evolutions

04:18PM EDT - co-location of specialized compute resources

04:18PM EDT - Sync functionality is disributed

04:19PM EDT - Bond length calculators and angle

04:19PM EDT - Dedicated hardware

04:20PM EDT - ANTON3 keeps bond calculation off the critical path

04:20PM EDT - Force calculations get hardware with size

04:21PM EDT - Solution is to partition near and far calculations

04:21PM EDT - dedicated hardware for both

04:22PM EDT - Also Edge tile

04:22PM EDT - simplifying communications

04:22PM EDT - separate edge network

04:22PM EDT - MD-specific compression

04:25PM EDT - global low skew clock mesh - engineered global routing

04:25PM EDT - Column level redundancy

04:25PM EDT - Robust power delivery

04:25PM EDT - MIMCAP

04:25PM EDT - Top layers almost exclusive for power

04:26PM EDT - 360 W for 451 mm2 on TSMC 7nm at 2.8 GHz

04:26PM EDT - 110k atoms per node, 528 cores

04:26PM EDT - 31.8 billion transistores

04:27PM EDT - Running simulations within 9 hours of first silicon

04:27PM EDT - Node boards

04:28PM EDT - 32 node boards in cages

04:29PM EDT - 128 nodes in a rack

04:29PM EDT - 512 nodes in 4 racks

04:29PM EDT - unique backplane

04:29PM EDT - 16 bidirectional links

04:29PM EDT - X dimension in the backplane

04:29PM EDT - 3D torus

04:30PM EDT - Liquid cooling with CDU and quick release fittings

04:30PM EDT - 100 kW per rack

04:31PM EDT - ASIC power of 500W

04:31PM EDT - Anton 3 is 20x faster than A100 using hand optimized NVIDIA code for the same simulation

04:32PM EDT - one Anton 3 beats a 512-node Anton 1

04:32PM EDT - 100 microseconds per day

04:33PM EDT - Multiple GPUs doesn't help, perf is lower!

04:33PM EDT - This is insane performance

04:35PM EDT - Q&A

04:36PM EDT - Q: Can you apply the hardware to other workloads? A: Broad set of MD related workloads. Haven't put much energy beyond that, some interesting internal projects though

04:37PM EDT - Q: What numerical formats used in Anton 3? A: 32 fixed point in general pipeline - specialised varies, some areas is 14-bit mantissa/5bit exponents, some is log

04:38PM EDT - Q: Power management. All of the pins are available for power and ground - DVFS control methods? A: No DVFS, do some ad-hoc dynamic frequency scaling through ramp limiters. Haven't needed to use them

04:39PM EDT - Q: is the mesh high power? A: difficult trade off vs mesochronous - we chose to do a unified common predriver clock tree onto a shared message for low latency. Out of 360W chip, 40-50W is mesh power

04:40PM EDT - Q: Advanced packaging? A: Ecosystem wasn't there yet when we were architecting - this took 8 years to build. Next time around, we're looking into it

04:41PM EDT - Q: Can you scale beyond 512 nodes? A: Hardware can scale more than 512 at network and link layer. Machine is designed to run at most 512 nodes. Larger installs could run multiple simulations and shared data.

04:42PM EDT - Q; What interconnect speeds? A: NRZ Serdez - dual unidirectional - 29-30 gigabit/sec, still being tuned. No need for FEC

04:45PM EDT - That's a wrap

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

3 Comments

View All Comments

WaltC - Tuesday, August 24, 2021 - link
Cerebras WSE-2: I can only wonder about the yields...;) And what you'd plug this behemoth into! 2.6 trillion transistors. Also, I would suggest the CEO might want to give serious thought to doing something about his last name in terms of his company, as "Lie" is not exactly a word which inspires trust, imo.
Ian Cutress - Thursday, August 26, 2021 - link
Yield is 100%, they have failover built into the silicon to absorb defects. Also don't be so culturally ignorant on names.
Speedfriend - Monday, August 30, 2021 - link
How do these all compare to the Tesla chip.announcement

Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)

Post Your Comment

3 Comments

View All Comments

WaltC - Tuesday, August 24, 2021 - link

Ian Cutress - Thursday, August 26, 2021 - link

Speedfriend - Monday, August 30, 2021 - link

Log in

Don't have an account? Sign up now