Decoding Acorn BBC Basic Files
Posted on Sunday, June 23, 2019 by TheBlackzone
During my archeologic expedition into ancient computer stuff, I encountered a bunch of old BBC BASIC programs I'd written on my Acorn Archimedes A5000 computer in the early 1990s. For the sake of reminiscence, I wanted to have a look at them, but was thwarted by the fact that they are stored in a tokenized format that is not directly readable. So I did some research on the format and created a converter...
Format of a BBC BASIC file
First, let's have a look at the structure of the lines in a BBC BASIC file. Each line looks like this:
+------+--------+--------+--------+----------------------------------+ | 0x0D | lno hi | lno lo | length | line data (text and tokens).... | +------+--------+--------+--------+----------------------------------+
- The first byte of each line is always
- The second and third bytes are the MSB and LSB of the line number
- The fourth byte is the length of the line data including the four byte preamble of the line
- The rest of the line is the line data consisting of text and tokenized BASIC statements
- A sequence of
0x0D 0xFFat the begin of the line marks the end of the file
Detokenizing the line data is pretty straightforward: Tokens are either one or two byte long and are in the range
0x7F..0xFF. Everything in the range
0x20..0x7E is treated as normal text.
0xC6 0xNNis a two-byte function token
0xC7 0xNNis a two-byte command token
0xC8 0xNNis a two byte statement token
- The second byte in a two-byte token is always in the range
0x8Dintroduces a line reference (see below)
- Everything else in the range
0x7F..0xFFis a one-byte token
The list of tokens is included in Appendix B of the BBC BASIC Reference Manual, so it is quite easy to create a decoding table.
Line references, such as used in
GOTO nnnnn or
GOSUB nnnnn statements, are stored in an internal format. The sequence starts with
0x8D, followed by three bytes:
[0x8D] [b0] [b1] [b2]
The line number in a line reference is calculated as follows:
lineno = ((b2 EOR (b0 * 16)) * 256 + (b1 EOR ((b0 * 4) AND 0xC0))) AND 0xFFFF;
Actually, the internal format of line references was not so easy to figure out. I found the decoding algorithm in the RISC OS source code (in the file "s.bastxt"), but I also learned that there is a document on J.G.Harston's website describing it.
Putting it together
After having researched the information above, I created a small program that takes a tokenized BBC BASIC file as its input and translates it into plain text. The program is implemented in C and you can download the source code here.
Although there is no practical use for my old BBC BASIC programs these days, I find it interesting (and sometimes inspiring) to have a look at the stuff I created almost 30 years ago. To achive this goal, it would most certainly have been easier to just use an emulation, but researching an old file format and building a tool to read it are the most fun in digital archeology.
Tags: ancient, riscos