Computing in the era of Doom: What were PCs like in 1993?

Game Engine Black Book: Doom. By Fabien Sanglard. 429 pages.

Doom. The game that popularized the term “deathmatch”. The game that made modding accessible to millions of players. The game that killed productivity across thousands of offices around the world. It wasn’t the first First Person Shooter (for example, Id had released Wolfenstein 3D just a year prior) but it was the game that put the genre on the map. Fabien Sanglard’s excellent book dives completely into the internals of the game, detailing the compromises made to render a 3D world on consumer hardware of the era. The result is not just a technical reference but a memory capsule of what it was like to develop and play games on a PC in 1993, when every byte of memory mattered and every CPU cycle had to be fought for.

Playing Doom in 1993 required an expensive machine, even at sub-20 fps. How much would such a machine cost? What was the state of the art in features back then and how did it work?

The CPU

The state of the art CPU of that era was the Intel 486 DX2, which was 2.5x as fast as a predecessor 386 running at the same clock speed. How did Intel achieve such a feat? Through improving the instruction pipeline and adding a brand new on-CPU cache.

The Pipeline

What is pipelining?

Imagine you have multiple roommates sharing a laundry room, all wanting to do one load of laundry at the same time. It would be pretty silly for one roommate to reserve the whole room at the same time. Instead, once one roomate's washing is done and in the dryer, the next roomate can start their washing. Pipelining illustrated with a laundry metaphor

Pipelining illustrated with a laundry metaphor

A similar idea applies to CPUs. Each instruction goes through several phases during it's execution. A processor might have the following phases:

Prefetch - grab the instruction from memory before it is needed
Decode - Figure out what the instruction does
Execute - Actually execute the instruction (Add, subtract, jump)
Write-Back - Write the results of the executed instruction to the registers and/or cache

In the 386, there was a 3 phase pipeline: Prefetch, Decode and Execute. However, decoding took two clock cycles to complete, meaning that an instruction could only be executed every two cycles. The 486 introduced a 5 phase pipeline with two decode steps, allowing an instruction to be executed each cycle. This alone doubled the throughput of the processor.

386 vs 486 pipeline comparison

However, there was a tradeoff: a deeper pipeline is more likely to starve. If for some reason the pipeline is forced to clear (for example: an if statement jumping to another memory address), it would take five cycles to restart the flow of instructions again rather than the four of the 386. If the 486 pipeline frequently stalled, it would actually be slower than the 386.

The core problem was getting instructions and data from RAM to the CPU. While CPUs were getting faster, memory access times were not. Fetching from RAM took at minimum two cycles, usually more. If the CPU was constantly waiting for information from RAM, that was time not spent computing. What Intel needed was a way to bring instructions and data to the CPU before they were needed, ensuring an always full pipeline.

What Intel needed was an on-CPU memory cache.

The Cache

When designing the 386, Intel engineers originally planned to add a cache, but it would not fit in the lithography machine that made the chip, so it was abandoned in favor of an optional off-CPU cache. Over the next four years, semiconductor manufacturing processes improved enough for Intel to add an 8 KiB cache on the 486. It’s a very small cache: for a computer with 4 MiB of RAM, 8 KiB only covers .2% of the available memory space. But it didn’t need to be large – it just needed to hold the right information at the right time.

Unlike future processors like the Pentium 4, the cache did not attempt to predict what it would need in advance. When the processor attempted to access memory not in the cache, (Known as a cache miss), the processor would instruct the memory controller to get the missing information. But the processor would not just get that byte and cache it, it would pull in the surrounding 15 bytes referred to as a cacheline. This exploited spatial locality: if your code accesses one byte, the surrounding bytes are likely needed soon too. Sequential code execution or iterating through an array would naturally fill the cache with useful nearby data. This worked well in practice for the tight inner loops typical of game engines like Doom’s renderer, where the same instructions and data structures were accessed over and over.

Sounds great in theory, but there were some problems: Because the 486’s cache was unified (shared between data and instructions), a data-heavy operation could evict instructions from the cache, and vice versa. Imagine Doom’s renderer is running a loop — the instructions for that loop are in the cache, running at 1 cycle per hit. Then the loop reads a texture or a lookup table, and that data load evicts some of the loop’s own instructions. On the next iteration, those instructions have to be fetched from RAM again, stalling the pipeline. The code is fighting with its own data for 8 KiB of space. This phenomenon is known as conflict misses or collision misses.

Later processors like the Pentium solved this by splitting the cache into a separate instruction cache and data cache (8 KiB each), so the two could never interfere with each other. For the 486 however, Intel would have to compromise with a unified cache. Engineers instead used a clever trick to avoid conflict misses: set up the cache so that 4 different cachelines can co-exist at the same time. This is known as a 4-way set associative cache.

To understand what this means, consider the simplest alternative: a direct-mapped cache. In a direct-mapped cache, each memory address maps to exactly one slot (known as a set) in the cache. If two addresses happen to map to the same slot (and in 8 KiB of cache with megabytes of RAM, collisions are frequent), they evict each other every time — even if the rest of the cache is completely empty.

In the following image, we have a cache with 4 potential memory slots, A, B, C, and D. A and C map to Set 0 and B and D map to set 1. In a direct-mapped cache, A and B have to be evicted to free up room for C and D even if Sets 2 - 7 are empty. Why can’t those slots be used for A and B? If any slot could be used for any cacheline, there would have to be a database linking each slot to each address. That would take up more space that could be used for the cache, and require a time-consuming lookup through the entire database before finding the cacheline - defeating the purpose of the cache.

Direct-mapped vs 4-way set associative cache

The 486’s 4-way set associative design instead divided the cache into 128 sets containing 4 ways (slots) rather than 512 sets. A given memory address still maps to a specific set, but within that set it can occupy any of the 4 ways. This means 4 different addresses that would have collided in a direct-mapped cache can now coexist. Only when a 5th competing address arrives does something need to be evicted. Through this design, the 486 managed to hit the cache 92% of the time during normal operation (known as 92% cache hits).

The improved throughput of the 486 would allow Doom to run at a decent framerate. But Doom’s designers would have to contend with an even bigger limitation: The dreaded DOS memory limit.

RAM and DOS Extenders

640K ought to be enough for anybody

While Bill Gates did not actually say the above oft-repeated quote, it was true that DOS had a limitation of 1 MiB of RAM with 384 KiB reserved for system use, leaving only 640 KiB for applications.

Why did DOS have this limit? The original processor for the IBM PC, the Intel 8088 (a 8086 variant) had a maximum addressable memory of 20-bits (1 MiB). However, the actual processor could only handle 16-bit words (64 KiB). To use 20-bit memory addresses, Intel engineers came up with a bizarre trick called segmented addressing where two 16-bit addresses (a segment and offset address) were combined to form a 20-bit address.

8088 segmented addressing

This design had a lot of problems: memory manipulation was error-prone since different segment/address combinations could point to the same real memory address. The situation only became more complicated with the 24-bit 286 and 32-bit 386. Intel’s solution was to let post-8088 processors run in two modes: Real Mode, where the processor functioned as a very fast 8088, and protected mode which allowed you to use the full 32-bits of a 386 or 486.

Problem solved! Well, it would be if Microsoft’s MS-DOS supported Protected Mode. Unfortunately, in order to keep applications backward-compatible the OS only allowed real mode, forcing developers into the 640 KiB limit. This created a market for DOS extenders that allowed DOS to use the extra memory.

How did they work? An application (DOS could only run one thing at once, after all) would start in protected mode. If it needed to make a system call to DOS, the extender would intercept the call, switch the processor to real mode, make the system call, translate the result from 16-bits to 32-bits and finally switch back to protected mode.

They were complicated to set up. You had to locate the DOS extender file, load it and the application, and configure the application to use it. This required 100 lines of C code. Fortunately, enterprising developers from Canada solved the problem by creating a compiler that bundled the extender into your program. Watcom’s C compiler retailed for mere $639 ($1460 today). Hard to believe compared to today, where gcc and clang are free world-class compilers. But for the price, Watcom would free you from the hated 16-bit limitations of MS-DOS.

Like many games and applications of the era, Doom did not use the malloc/free memory allocators provided by libc and instead opted to use it’s own memory manager. This was to prevent memory fragmentation, where as memory objects are allocated and freed, memory has a lot of small holes in it, making it impossible to allocate a larger object.

Memory fragmentation

In this example, there is enough room for F to be allocated, but there is not a hole big enough for it to fit in, so F cannot be allocated. While memory could be defragged, the game has to pause to consolidate memory. This would be unacceptable.

To stop memory fragmentation from crashing the game, Doom used a zone-based memory-manager: all assets for a level were grouped into the same zone together, so when you finished the level, the entire zone could be freed in one sweep, leaving a large block for the next level to use.

Graphics

Having a fast CPU and RAM to render Doom’s semi-3D environments was important, but the pixels still had to get to your monitor somehow. Once the CPU rendered a frame of Doom, it had to be sent over the bus to the VGA (Video Graphics Array) controller. VGA and the ISA bus were never designed to play fast action games, but to render graphs, text and spreadsheets. If you ran VGA in 640 x 480 as Windows 3.1 did, you were limited to 16 colors. You try animating demon gore with only one shade of red! If you wanted more colors, you could use “Mode Y” which supported 256 colors at 300 x 200. Doom’s developers opted to use that mode.

What is the ISA bus and why was it a performance killer?

In addition to the difficulty programming for VGA, actually getting the frame RAM to VRAM was a very slow operation. On PC's of the era, if you wanted to send or receive data from the CPU/RAM to a device like a hard drive, graphics controller or modem the data had to transit over the ISA bus. This feature of the PC had not been updated since 1984, and while it's theoretical throughput was 8 MiB/s, in practice it was 1-2 MiB/s and could be as low as 500 KiB/s.

Doom ran in "tics" of 35 tics/s so it could go up to 35 fps. But if you actually wanted to do that, you had to copy 35 frames a second, and that required being able to transfer 2.1 MiB/s. That would take more than the available bandwidth of the bus.

Even simple desktop GUIs like Windows 3.1 had to resort to dragging a window by its outlines in order to get an acceptable performance from PCs of the era. If it showed the contents of the window while moving it, the graphics controller would be unable to keep up.

Windows 3.1 — Look at me ye mighty PCs and despair!

Hardware manufacturers were fed up with an ISA standard that hadn't been refreshed in almost 10 years. They introduced a new standard called VESA Local Bus (VLB) that would be much faster.

VLB was about 10x faster than ISA, and was very simple: It adopted the same protocol as the 486 Bus Unit Protocol. The bus is "local" in that it's directly hooked up to the CPU, not requiring a chipset to mediate between the CPU and the hardware. This sped up adoption.

There were, however, some downsides:

The bus ran at the same speed as the CPU (since it was synchronous with the CPU) and it had a lot of instability past 40 MHz. The electrical load driven by the CPU was inversely proportional to the clock speed, so the number of slots available decreased if you increased the clock speed. 3 could be provided at 33 MHz, 2 at 40, and just 1 at 50 MHz.

Hardware manufactuers had to take care their hardware peripherals ran at a variety of speeds, leading to a lot of compatibility problems and frustrations for consumers who wanted a computer that just worked.

Then in 1993, Intel introduced the Pentium Bus Protocol, which was based on PCI and totally incompatible. VLB quickly faded away, but it was there long enough to make Doom playable.

VGA was difficult to program for: The CPU did the actual work of rendering, but then the scene it had rendered in RAM had to be copied to the VGA controller’s Video RAM (VRAM). To compensate for having only 64 KiB addresses for 256 KiB of video graphics memory, the VRAM was divided into 4 memory banks:

Bank 0: Pixels 0, 4, 8…
Bank 1: Pixels 1, 5, 9…
Bank 2: Pixels 2, 6, 10…
Bank 3: Pixels 3, 7, 11…

This design also allowed the VGA controller to read from all 4 banks in parallel, making it fast enough to drive a 60 Hz monitor. However, for the programmer it got complicated very quickly: a simple operation like drawing a horizontal line might span all four banks, requiring four separate bank switches and four separate copy operations. Every pixel write required the programmer to calculate which bank it belonged to, switch to that bank if necessary, and compute the correct address within the bank. Forgetting to switch banks or switching to the wrong one produced garbled output with pixels scattered across the screen.

An ordinary programmer would despair at such a setup. But programmer John Carmack instead turned VGA’s weaknesses into a strength. Doom cleverly stored three frames in VRAM: One currently drawn to the screen, the next frame to display, and the one currently being copied from RAM to VRAM. This triple-buffering eliminated screen tearing, where the monitor displays a half-drawn frame. The renderer could draw to an off-screen buffer while the monitor displayed a completed one, then swap them during the monitor blank interval when the monitor finished drawing.

Sound

Who can forget the iconic soundtrack to E1M1?

Doom’s iconic sound effects and music didn’t come for free. The PC came with a very basic speaker that was mostly used to diagnose the health of the system. The number of beeps (1 = healthy) would give you some diagnostics, but any serious gamer had to invest in a sound card.

Here’s a comparison of them over time through another classic PC game, The Secret of Monkey Island:

Networking

The Whole Package

If you wanted to play Doom at anything approaching 30 FPS, you needed the following:

486 DX2 ($550)
Diamond Stealth Pro Graphics Card ($350-$500)
SoundBlaster 16 ($179)

Adding in a motherboard, hard drive, power supply etc would run you another $1000, bringing the cost to around $2179 - close to $5000 in today’s dollars. For comparison, the median household earned $31,240 in 1993. Such a computer would represent 7% of the median family’s income. And if you wanted the best sound, you needed two sound cards, not just a SoundBlaster.

According to the book, this could play Doom at 24 FPS. Now we complain if it’s one frame below 60 fps.

Sometimes I like to appreciate just how far we’ve come in computing in the 35 years I’ve been on this planet. I was too young to experience the glory days of Doom.

Works Cited

Crawford, John H. Calif.. “The i486 CPU: executing instructions in one clock cycle.” IEEE Micro 10 (1990): 27-36.

<< Previous