Founders' Review Editorial

Much has been said about repatriating certain aspects of the semiconductor industry back to the United States.

Why does it feel like there are more plane crashes than ever? An article exploring the perception of increased plane crashes and the reality behind aviation safety statistics. Here.

Robinhood Weekly Rundown. A weekly summary of financial news and market trends provided by Robinhood. Here.

Are Memecoins Securities? A legal blog discussing whether memecoins could be classified as securities under current regulations. Here.

WithWoz Platform. Make your app in minutes. A platform offering educational resources and tools for learners and educators. Here.

There are three primary use cases for semiconductor chips. Sensors are utilized to collect data or capture images. Compute integrated circuits, on the other hand, serve two distinct purposes: some are optimized for very high bandwidth processing, while others are designed to achieve extremely low latencies for specific applications.

Historically, server processors and operating systems have been optimized for bandwidth, whereas laptop and handheld processors have prioritized low latency performance.

Artificial intelligence (AI) has shifted this paradigm by significantly reducing design costs. Applications such as self-driving systems, user interfaces, and inference workloads demand low latencies. This necessitates larger chips capable of parallel processing at high speeds, often resulting in increased costs, more expensive memory requirements, larger chip surfaces, and higher power consumption.

AI training, video streaming, and data analytics workloads, however, are bandwidth-intensive over time. In these cases, overall bandwidth becomes the critical factor, especially when cost trade-offs are considered. Systems optimized for bandwidth typically operate at lower clock speeds, utilize larger and more unified chips, and are designed for efficient power usage per megabyte.

The topology and architecture of semiconductor chips can vary significantly. Some designers purchase pre-designed hardware blocks, known as intellectual property (IP), to assemble their chips. While IPs can be costly, they are thoroughly tested and reliable. Alternatively, some companies opt for chiplets, which are smaller modular blocks assembled into a single integrated circuit. Chiplets, though potentially more expensive, enable higher bandwidth interconnects—often in the hundreds of megabits range—making them particularly advantageous for AI training applications.

Chips are often designed with a personalized approach, meaning companies aim to include as many features as possible. Once a customer purchases a phone with a processor or system-on-a-chip (SoC), they expect it to handle a wide range of tasks. However, many processors are not well-suited for datacenter use. Expensive chip and datacenter space may be allocated to instructions or features that are rarely, if ever, utilized. For example, binary-coded decimals are largely obsolete but still occupy space that could otherwise be used for additional or faster cores in Intel and AMD processors.

This is one of the reasons ARM processors have gained significant traction. Their reduced instruction set architecture is more modern and better aligned with real-world use cases. ARM processors deliver greater efficiency, requiring less chip and datacenter space while providing excellent value. Even when manufactured using older technologies, they remain cost-effective for the performance they offer.

A similar scenario applies to NVIDIA processors. High-end GPUs like the H100 include CUDA and tensor cores alongside traditional shader cores, which are not utilized in AI training workloads. As much as 10-20% of chip space may be allocated to features that go unused. If this space were repurposed for additional CUDA or tensor cores, the cost of a 10-GPU H100 cluster could be significantly reduced.

Major cloud providers have responded by designing their own motherboards, processors, and even tensor cores. For example, Amazon Web Services optimizes performance to achieve better cost efficiency.

It is logical to tailor chip architectures to specific use cases. A typical datacenter web server, for instance, may handle a workload of 1 Gbps, requiring a combined memory-compute bandwidth of 5-8 Gbps from server processors. Such a core might also need data storage or cache sufficient for approximately ten seconds of video buffering. These cores are often paired with 1 GB of memory, which is ideal for buffered streaming workloads where multiple clients access the stream from different positions. Netflix and Roku are prime examples of this use case.

Database servers handling SQL or Spark workloads operate differently, often returning only 10% of the data they process. These workloads require a core with 1 GB of RAM paired with 10 GB of solid-state disk space. This storage is frequently part of a shared block disk on a storage array within the same physical rack, connected via Ethernet.

Artificial intelligence training presents unique challenges. It involves processing unified datasets, often sparse in nature, where most data points lack connections. Despite this, large volumes of data must be processed to identify weights or biases that need adjustment. GPU vector processors are particularly effective for handling such datasets. In contrast, traditional single-instruction single-data cores from ARM, AMD, or Intel are better suited for stream processing tasks.

Video compression applications typically require vector processor cores to identify hidden correlations within datasets, enabling better compression. This demands significant cross-core memory bandwidth.

Decoding video streams involves expanding and storing packets of streamed data. A single core was sufficient for this task a decade ago, especially for a single HD screen. However, with advancements in network speeds, decoding now often requires larger processors. For example, a 10 Mbps video stream transmitted over a 100 Gbps channel may seem underutilized, but the physical speed of light becomes the limiting factor for frame presentation. Such applications may require multiple cores to decode memory-stored chunks rapidly, reducing latency for professional gaming or remote driving scenarios. Frames must be decoded within 10 milliseconds to accommodate jitter and ensure smooth performance. While a highly clocked single core could achieve this, cost considerations often favor parallel decoding using 10-100 cores operating at lower clock speeds (e.g., 100 MHz) with vector instruction support.

Semiconductor design is a fascinating and complex field, offering ample opportunities to optimize for cost, latency, and power efficiency.