A PDP-11 machine. )Image Credit: Wikipedia)
Everyone’s familiar with spellchecking, a handy feature in browsers, text editors, and other software. The tool proved extremely challenging to install as systems lacked sufficient memory to support it at the time. But, a lossless compression algorithm, which is still used today, allowed it to be integrated into a system.
In 1975, AT&T programmers explored ways to use Unix for text processing. However, this required a working spellchecking tool. The easiest way to do this involves putting the dictionary into system memory, and the computer would then look each word up. It’s a tedious task that takes time, and during those years, computers had insufficient memory to handle it.
But, through compression, we can easily load a 250 kB file into 64 kB of RAM. However, this is more challenging than we think. To do this, you would need to use a PC with a superfast multicore processor and lots of RAM. You can create a 256 kB text file with 260,000 characters, compressing it via a compressor and the Lempel-Ziv-Markov chain algorithm to shrink the size to 257 bytes.
AT&T used PDP-11 systems (16-bit microcomputers), which are weaker as they feature less memory (64 kB to a few hundred kB) and processing power. Due to hardware constraints, it would be challenging for the PDP-11 systems to search through a 256 kB text file as the system may struggle to handle that task. It can handle a compressed 257-byte file.
Computer scientist Steve Johnson developed the first Unix-based, disk-backed spellchecking prototype. While functional, the system was slow and made mistakes. Douglas McIlroy developed an algorithm that decreased the amount of memory required to store dictionary words and introduced a data structure, allowing the entire dictionary to load into a few kB of memory. This algorithm only needed 14 bits per word, which meant a 30,000-word dictionary could fit in fewer than 52 kB of memory. Theoretically, the system only required at least 13.57 bits per word.
A big part of the solution was Golomb coding, a lossless compression still used today in Rice coding, which is used in formats like FLAC, Apple Lossless, and Lossless JPEG.
Have a story tip? Message me at: http://twitter.com/Cabe_Atwell