Decoding Acorn BBC Basic Files
Posted on Sunday, June 23, 2019
During my archeologic expedition into ancient computer stuff, I
encountered a bunch of old BBC BASIC programs I'd written on my Acorn
Archimedes A5000 computer in the early 1990s. For the sake of
reminiscence, I wanted to have a look at them, but was thwarted by the
fact that they are stored in a tokenized format that is not directly
readable. So I did some research on the format and created a
converter...
Format of a BBC BASIC file
First, let's have a look at the structure of the lines in a BBC BASIC file. Each line looks like this:
+------+--------+--------+--------+----------------------------------+
| 0x0D | lno hi | lno lo | length | line data (text and tokens).... |
+------+--------+--------+--------+----------------------------------+
- The first byte of each line is always
0x0D(13) - The second and third bytes are the MSB and LSB of the line number
- The fourth byte is the length of the line data including the four byte preamble of the line
- The rest of the line is the line data consisting of text and tokenized BASIC statements
- A sequence of
0x0D 0xFFat the begin of the line marks the end of the file
Tokens
Detokenizing the line data is pretty straightforward: Tokens are either
one or two byte long and are in the range 0x7F..0xFF. Everything in
the range 0x20..0x7E is treated as normal text.
0xC6 0xNNis a two-byte function token0xC7 0xNNis a two-byte command token0xC8 0xNNis a two byte statement token- The second byte in a two-byte token is always in the range
0x8E..0xFF 0x8Dintroduces a line reference (see below)- Everything else in the range
0x7F..0xFFis a one-byte token
The list of tokens is included in Appendix B of the BBC BASIC Reference Manual, so it is quite easy to create a decoding table.
Line References
Line references, such as used in GOTO nnnnn or GOSUB nnnnn
statements, are stored in an internal format. The sequence starts with
0x8D, followed by three bytes:
[0x8D] [b0] [b1] [b2]
The line number in a line reference is calculated as follows:
lineno = ((b2 EOR (b0 * 16)) * 256 + (b1 EOR ((b0 * 4) AND 0xC0))) AND 0xFFFF;
Actually, the internal format of line references was not so easy to figure out. I found the decoding algorithm in the RISC OS source code (in the file "s.bastxt"), but I also learned that there is a document on J.G.Harston's website describing it.
Putting it together
After having researched the information above, I created a small program that takes a tokenized BBC BASIC file as its input and translates it into plain text. The program is implemented in C and you can download the source code here.
Conclusion
Although there is no practical use for my old BBC BASIC programs these days, I find it interesting (and sometimes inspiring) to have a look at the stuff I created almost 30 years ago. To achive this goal, it would most certainly have been easier to just use an emulation, but researching an old file format and building a tool to read it are the most fun in digital archeology.


