Decoding Acorn BBC Basic Files
Posted on Sunday, June 23, 2019 by TheBlackzone
During my archeologic expedition into ancient computer stuff, I encountered a bunch of old BBC BASIC programs I'd written on my Acorn Archimedes A5000 computer in the early 1990s. For the sake of reminiscence, I wanted to have a look at them, but was thwarted by the fact that they are stored in a tokenized format that is not directly readable. So I did some research on the format and created a converter...
Format of a BBC BASIC file
First, let's have a look at the structure of the lines in a BBC BASIC file. Each line looks like this:
+------+--------+--------+--------+----------------------------------+ | 0x0D | lno hi | lno lo | length | line data (text and tokens).... | +------+--------+--------+--------+----------------------------------+
- The first byte of each line is always
0x0D
(13) - The second and third bytes are the MSB and LSB of the line number
- The fourth byte is the length of the line data including the four byte preamble of the line
- The rest of the line is the line data consisting of text and tokenized BASIC statements
- A sequence of
0x0D 0xFF
at the begin of the line marks the end of the file
Tokens
Detokenizing the line data is pretty straightforward: Tokens are either one or two byte long and are in the range 0x7F..0xFF
. Everything in the range 0x20..0x7E
is treated as normal text.
0xC6 0xNN
is a two-byte function token0xC7 0xNN
is a two-byte command token0xC8 0xNN
is a two byte statement token- The second byte in a two-byte token is always in the range
0x8E..0xFF
0x8D
introduces a line reference (see below)- Everything else in the range
0x7F..0xFF
is a one-byte token
The list of tokens is included in Appendix B of the BBC BASIC Reference Manual, so it is quite easy to create a decoding table.
Line References
Line references, such as used in GOTO nnnnn
or GOSUB nnnnn
statements, are stored in an internal format. The sequence starts with 0x8D
, followed by three bytes:
[0x8D] [b0] [b1] [b2]
The line number in a line reference is calculated as follows:
lineno = ((b2 EOR (b0 * 16)) * 256 + (b1 EOR ((b0 * 4) AND 0xC0))) AND 0xFFFF;
Actually, the internal format of line references was not so easy to figure out. I found the decoding algorithm in the RISC OS source code (in the file "s.bastxt"), but I also learned that there is a document on J.G.Harston's website describing it.
Putting it together
After having researched the information above, I created a small program that takes a tokenized BBC BASIC file as its input and translates it into plain text. The program is implemented in C and you can download the source code here.
Conclusion
Although there is no practical use for my old BBC BASIC programs these days, I find it interesting (and sometimes inspiring) to have a look at the stuff I created almost 30 years ago. To achive this goal, it would most certainly have been easier to just use an emulation, but researching an old file format and building a tool to read it are the most fun in digital archeology.