After (re)discovering the semicolon bug in Atari BASIC revision A, I thought I’d spend a bit of time trying to find out exactly why BASIC was exhibiting this behaviour. In order to do this, I had to re-learn how BASIC stores programs in memory.
Atari BASIC uses tokenization to reduce the memory footprint and increase the execution speed of programs. Tokenization replaces keyword strings (such as PRINT) with single-character tokens (0x20). Ultimately, this bug is caused by the tokenization process and incorrect bounds checking.
First, let’s look at a simple PRINT statement, and how Atari BASIC tokenizes it. You may want to reference De Re Atari which has a good explanation of the tokenizing process, as well as a token table.
| 0A | 00 | 0A | 0A | 20 | 0F | 02 | 48 | 49 | 16 |
| line number | llen | slen | string | strlen | H | I | eol | ||
Our first example print statement has a 2-byte string constant “HI”, followed by the token for end-of-line. 0x20 is the token for PRINT. llen and slen are the line length and statement length.
![]()
| 0A | 00 | 0B | 0B | 20 | 0F | 02 | 48 | 49 | 15 | 16 |
| line number | llen | slen | string | strlen | H | I | ; | eol | ||
Our second example adds a semicolon to the end of line. The normal behaviour for semicolon is to suppress the automatic carriage return. Note that the string is still 2 bytes long, followed by the token for the semicolon (0x15).
| 0A | 00 | 0B | 0B | 20 | 0F | 03 | 48 | 49 | 15 | 16 |
| line number | llen | slen | string | strlen | H | I | ^U | eol | ||
Our third example now has a 3-byte string constant with no semicolon. The only difference between this example and the previous example is the string constant length.
Now that we’ve seen how the different lines are tokenized, let’s look at the BASIC source code. We need to look at the XPRINT function, which begins at 0xB3B6.
B3B6 XPRINT
B3B6 A5C9 LDA PTABW ; GET TAB VALUE
B3B8 85AF STA SCANT ; SCANT
B3BA A900 LDA #0 ; SET OUT INDEX = 0
B3BC 8594 STA COX
;
B3BE A4A8 :XPR0 LDY STINDEX ; GET STMT DISPL
B3C0 B18A LDA [STMCUR],Y ; GET TOKEN
;
B3C2 C912 CMP #CCOM
B3C4 F053 ^B419 BEQ :XPTAB ; BR IF TAB
B3C6 C916 CMP #CCR
B3C8 F07C ^B446 BEQ :XPEOL ; BR IF EOL
B3CA C914 CMP #CEOS
B3CC F078 ^B446 BEQ :XPEOL ; BR IF EOL
B3CE C915 CMP #CSC
B3D0 F06F ^B441 BEQ :XPNULL ; BR IF NULL
B3D2 C91C CMP #CPND
B3D4 F061 ^B437 BEQ :XPRIOD
;
B3D6 20E0AA JSR EXEXPR ; GO EVALUATE EXPRESSION
B3D9 20F2AB JSR ARGPOP ; POP FINAL VALUE
B3DC C6A8 DEC STINDEX ; DEC STINDEX
B3DE 24D2 BIT VTYPE ; IS THIS A STRING
B3E0 3016 ^B3F8 BMI :XPSTR ; BR IF STRING
;
B3E2 20E6D8 JSR CVFASC ; CONVERT TO ASCII
B3E5 A900 LDA #0
B3E7 85F2 STA CIX
;
B3E9 A4F2 :XPR1 LDX CIX ; OUTPUT ASCII CHARACTERS
B3EB B1F3 LDA [INBUFF],Y ; FROM INBUFF
B3ED 48 PHA ; UNTIL THE CHAR
B3EE E6F2 INC CIX ; WITH THE MSB ON
B3F0 205DB4 JSR :XPRC ; IS FOUND
B3F3 68 PLA
B3F4 10F3 ^B3E9 BPL :XPR1
B3F6 30C6 ^B3BE BMI :XPR0 ; THEN GO FOR NEXT TOKEN
B3F8 :XPSTR
B3F8 209BAB JSR GSTRAD ; GO GET ABS STRING ARRAY
B3FB A900 LDA #0
B3FD 85F2 STA CIX
B3FF A5D6 :XPR2C LDA VTYPE+EVSLEN ; IF LEN LOW
B401 D004 ^B407 BNE :XPR2B ; NOT ZERO BR
B403 C6D7 DEC VTYPE+EVSLEN+1 ; DEC LEN HI
B405 30B7 ^B3BE BMI :XPR0 ; BR IF DONE
B407 C6D6 :XPR2B DEC VTYPE+EVSLEN ; DEC LEN LOW
;
B409 A4F2 :XPR2 LDY CIX ; OUTPUT STRING CHARS
B40B B1D4 LDA [VTYPE+EVSADR],Y ; FOR THE LENGTH
B40D E6F2 INC CIX ; OF THE STRING
B40F D002 ^B413 BNE :XPR2A
B411 E6D5 INC VTYPE+EVSADR+1
B413 :XPR2A
B413 205FB4 JSR :XPRC1
B416 4CFFB3 JMP :XPR2C
;
B419 :XPTAB
B419 A494 :XPR3 LDY COX ; DO UNTIL COX+1 <SCANT
B41B C8 INY
B41C C4AF CPY SCANT
B41E 9009 ^B429 BCC :XPR4
B420 18 :XPIC3 CLC
B421 A5C9 LDA PTABW ; SCANT = SCANT+TAB
B423 65AF ADC SCANT
B425 85AF STA SCANT
B427 90F0 ^B419 BCC :XPR3
;
B429 A494 :XPR4 LDY COX ; DO UNTIL COX = SCANT
B42B C4AF CPY SCANT
B42D B012 ^B441 BCS :XPR4A
B42F A920 LDA #$20 ; PRINT BLANKS
B431 205DB4 JSR :XPRC
B434 4C29B4 JMP :XPR4
;
B437 2002BD :XPRIOD JSR GIOPRM ; GET DEVICE NO.
B43A 85B5 STA LISTDTD ; SET AS LIT DEVICE
B43C C6A8 DEC STINDEX ;DEC INDEX
B43E 4CBEB3 JMP :XPR0 ; GET NEXT TOKEN
;
B441 :XPR4A
B441 E6A8 :XPNULL INC STINDEX ; INC STINDEX
B443 4CBEB3 JMP :XPR0
;
B446 :XPEOL
B446 A4A8 :XPEOS LDY STINDEX ; AT END OF PRINT
B448 88 DEY
B449 B18A LDA [STMCUR],Y ; IF PREV CHAR WAS
B44B C915 CMP #CSC ; SEMI COLON THEN DONE
B44D F009 ^B458 BEQ :XPRTN ; ELSE PRINT A CR
B44F C912 CMP #CCOM ; OR A COMMA
B451 F005 ^B458 BEQ :XPRTN ; THEN DONE
B453 A99B LDA #CR
B445 205FB4 JSR :XPRC1 ; THEN DONE
B458 :XPRTN
B458 A900 LDA #0 ; SET PRIMARY
B45A 85B5 STA LISTDTD ; LIST DVC = 0
B45C 60 RTS ; AND RETURN
I know that’s a lot of code, but let’s follow the bouncing ball. The first key part happens at address 0xB3C8 – we look for an eol token (0x16). If we find one, we branch to XPEOL (0xB446). What’s the first thing we do at the end of line? We rewind one byte (DEY – decrement Y), and see if it’s the token for semicolon (0x15). If it is, we skip printing a carriage return.
But wait a minute. We blindly rewind one byte, even if that rewind takes us inside a string constant! There’s the bug. We should not be blindly rewinding one byte – we should be checking to see if we are inside a string constant our outside a string constant.
Looking at the code, similar behaviour will happen with the value 0x12, which is the token for a comma.
This bug has been fixed in revision C BASIC, but I’m not aware of commented source code being available for revision C.