Making 8-bit computers load from "tape" super fast

Making 8-bit computers load from "tape" super fast (posted 2022-04-03)

Back in the early 1980s, kids such as myself had their first computing experiences with 8-bit home computers such as the ZX Spectrum and the Commodore 64. And the only way, or in Europe in those days, the only affordable way to load games and save/load your own programs was from cassette tape. And boy was that slow. I remember that my fighter jet game on the C64 took 30 minutes to load.

Fortunately, there were fastloaders that replaced the built-in routines optimized for reliability with something much faster. In this video, Noel of the Youtube channel Noel's Retro Lab asks:

If we suddenly had a perfectly-reliable cassette tape, how fast could we possibly load data from it? This is a question I started pondering a while ago. To answer it, I had to do look into how data is stored on tape and how exactly we load it. Along the way, we'll find several limits and assumptions we have to work around on our quest for the fastest-possible loading.

Spoiler alert!

The big limitation he ran into was the little device he used as a standin for the cassette player, that just couldn't create the audio pulses with consistent timing as he pushed the speed as far as it would go.

Serial port to the rescue

These old 8-bit computers could just "bit bang" this type of stuff, where you could have a small bit of assembly code read or write I/O pins at completely predictable intervals. Modern computers just can't do this. Even if the sound pulse generating code wouldn't have to pause to read more data from the SD card, variable clock rates and hard or impossible to predict timing because of caches makes for inconsistent timing.

But then modern computers, even small microcontrollers, have all kinds of more advanced I/O options that we could use. For instance, the audio out interface. But I think we can do better, and use an RS232 serial port to generate our fake cassette signal. Not sure how practical this would be in practice, but I thought I'd see how fast this should be able to work in theory.

C64 hardware

Noel uses Z80 assembly on his Amstrad CPC. However, I'm more familiar with the C64 and 6502 assembly, so I'll use that for my proof of concept code below. However, it turns out that the cassette port on a C64 is unsuitable for decoding serial communication, as it will only measure the timing between two pulses, and not let us see the actual state of the signal. So we'll just connect the serial output of our imaginary tape emulation machine to the C64 user port pin that normally handles serial port input. (After accounting for the difference in imaginary voltages!)

Now don't get too excited: we can't just use the C64 implementation of the serial port to receive the data, as that implementation tops out at 2400 bps. And I want to go as fast as 57600 bps. That's about 5 kilobytes per second, enough to load any C64 program in 12 seconds or less.

That's about 200 times faster than the original Kansas City (more or less) standard and still 25 times faster than the improved 2400 bps version used in computers like the MSX.

The C64's 6510 CPU runs at just about 1 MHz. 1000000 clock pulses / 57600 bits = ~ 17 clock cycles per bit. So the challenge I'm setting for myself is write the code that reads one bit in no more than 17 clock cycles.

The protocol

In normal serial communication, both sides are set to the same speed. I think that's an unnecessary limitation for our purposes here. Instead, the sending device sends some synchronization bytes. These contain the byte $7E. (The convention on the C64 is that hexadecimal numbers are preceded by $ to keep them apart from decimal numbers. % for binary numbers.) This generates the following bits on the serial port:

S 0 1 1 1 1 1 1 0 S

Where the first S is the start bit, which is 0, and the second one the stop bit, which is 1. So:

0 0 1 1 1 1 1 1 0 1

When the line is idle, it's set to 1. If we now measure the time from the start of the first transition from 1 to 0 (at the beginning of the start bit) to the start of the second transition from 1 to 0, we know the time it takes to transmit 8 bits. This is more precise than trying to time a single bit. (However, for brevity, timing variations aren't addressed below.)

Then we need a few bytes that tell the receiver we're going to send a data block, and how long that data block is going to be. I'm limiting that length to 256 bytes as that makes the code on the 6510 (the C64's variant of the 6502) faster. We can then insert a few "idle" bytes during which the receiver has time to do some house keeping to get ready for the next data block.

6502 assembly in two paragraphs

The 6502 is a super simple CPU. It only has three 8-bit registers that programmers can use: the A register (accumulator) that can be used for simple math and logic operations, and the X and Y index registers that pretty much just count. There's also a status register that has a number of flags that are set based on the previous operation and can be used to perform branches (conditional jumps).

So as an assembly programmer you need to assemble your algorithms from really small and simple parts. So it's easy enough to do small things. Only the various addressing modes take a bit of time to get used to. The full 6502 instruction set.

The code

We set up our receiving code by loading the address where the incoming data is stored in bytes $FC and $FD. Some other initialization, assuming we're going to receive 256 bytes:

          SEI           ; disable interrupts
          LDY #$00      ; load $00 in Y, index for our data block

So now we first wait for the start bit using:

readbyte  LDA #$01      ; load $01 into A register
waitstart BIT $DD01     ; AND of A register value ($01) and CIA
                        ; register (4 cycles)
          BNE waitstart ; branch if not equal (not zero) back to
                        ; repeat (2/3 cycles)

Depending on when the start bit happened, we're done 2 to 9 cycles after that moment. So we want to use up another 20 cycles before we start reading the first data bit so we sample the data bit more or less halfway through. We still have six cycles worth of setup to do:

          LDX #$FE      ; load value $FE in X register (2 cycles)
          STX $FE       ; store X register in memory at $FE (2
                        ; cycles)
          SEC           ; set the carry (overflow) bit (2 cycles)

Then add another 7 NOPs (no operation) at 2 cycles each:

          NOP           ; #1
          NOP           ; #2
          NOP           ; #3
          NOP           ; #4
          NOP           ; #5
          NOP           ; #6
          NOP           ; #7

And then we can finally start reading a data byte:

readbit   BIT $DD01     ; check the CIA register bit (4 cycles)
          BEQ writezero ; branch if equal (zero) to write a zero,
                        ; otherwise continue and write a one
                        ; (2/3 cycles)
          ROL $00FE     ; rotate left: all bits shift, carry bit
                        ; (which is 1) goes into bit 0 (6 cycles)
          BCS readbit   ; if carry is set, jump back to read next
                        ; bit (3 cycles)
          BCC bitsdone  ; if carry is clear, we're done, skip
                        ; ahead (3 cycles)
writezero ASL $FE       ; arithmetic shift left, 0 bit becomes 0
                        ; (5 cycles)
          BCS readbit   ; if carry is set, jump back to read next
                        ; bit (3 cycles)

Now we have a complete byte, so store that byte and do it all over again until we have our full block:

bitsdone LDA $FE        ; load data from memory at $FE into A
                        ; (2 cycles)
         STA ($FC),Y    ; take memory location from $FC/$FD, add Y
                        ; to it, store A there (6 cycles)
         INY            ; increment Y register (2 cycles)
         BNE readbyte   ; if Y register is not 0, jump back to
                        ; read the next byte (2/3 cycles)

That's 13 cycles that happen during the stop bit, so the less critical parts are all fine.

Are we fast enough?

Now let's look at the timing of code that handles reading the individual bits. This is what gets executed when the incoming bit is zero:

readbit   BIT $DD01     ; check the CIA register bit (4 cycles)
          BEQ writezero ; branch if equal (zero) to write a zero
                        ; (3 cycles)
…
writezero ASL $FE       ; arithmetic shift left, 0 bit becomes 0
                        ; (5 cycles)
          BCS readbit   ; if carry is set, jump back to read next
                        ; bit (3 cycles)

So that's 4 + 3 + 5 + 3 = 15 cycles. And if the incoming bit is 1:

readbit   BIT $DD01     ; check the CIA register bit(4 cycles)
          BEQ writezero ; branch if equal (zero) to write a zero
                        ; (2 cycles)
          ROL $00FE     ; rotate left: all bits shift, carry bit
                        ; (which is 1) goes into bit 0 (6 cycles)
          BCS readbit   ; if carry is set, jump back to read next
                        ; bit (3 cycles)

Note that in this case, the first jump is not taken, and therefore only takes 2 cycles rather than 3. Having different numbers of cycles depending on whether we read a 0 or a 1 would be bad, so we need to compensate for that. However, it's only a 1-cycle difference and all 6502 instructions, even ones without any arguments such as NOP, take at least 2 cycles.

But we can use a trick: address memory location $FE using the full 16-bit address $00FE rather than the 8-bit zero page address $FE with the ROL instruction. This way, the ROL instruction uses an extra cycle, so in this case everything adds up to 4 + 2 + 6 + 3 = 15 cycles, too. So we actually need a 2-cycle NOP to reach our 17-cycle target!

What helped here is that I was able to avoid using the X register to count down the bits. Instead, the memory location $FE where we're going to store the incoming bits is initialized with $FE = %11111110. To make room for an incoming bit, those existing bits all move one position to the left when using the ROL or ASL instructions. When that happens, the highest bit is placed in the carry flag. So after the first 7 bits, the carry is always 1 and then we know we need to read more bits. But after receiving the 8th bit, that lone 0 moves into the carry and we know we're done.

Conclusion

A Commodore 64 with its 1 MHz should be just about fast enough to read serial data at 57600 bps. I'm assuming the same is true for a Z80 clocked at the commonly used 3.5 MHz. Not because the clock is so much faster, as the Z80 uses many more cycles per instruction than the 6502. But the Z80 has a lot more registers so it probably wouldn't be necessary to store intermediate results in memory, which is the slowest part of the 6502 code.

It would be interesting to see how well all of this theory holds up when it hits the real world… interesting, but time consuming.

by Iljitsch van Beijnum.