6502 Second Processor programming

A glorious multicore future, long prophesied, is upon us. Can your Tube-equipped BBC Micro participate? And if it can… how? Read on!

The Tube

The Tube is a data link connecting two systems: the host (or I/O) system (an ordinary 2MHz 65x2 BBC Micro with a keyboard and so on), and the parasite system (for the purposes of this document, a 3+MHz 65x2 CPU with some memory).

From a software point of view, this link appears as 4 double-ended FIFOs, numbered 1 to 4 inclusive. Writing to the write port of a FIFO on one system causes the byte to appear in the read port of the other. Each FIFO has status ports that can be polled to find out whether the FIFO is not empty (for reading) or not full (for writing); on the parasite side, certain FIFOs can cause an IRQ or NMI when a byte arrives.

There's a control port for setting up the overall system too.

But as it turns out, you don't actually need to worry about any of this too much!

What goes on when the Tube is on?

From the point of view of (Tube-friendly!) code running in the parasite, the system looks very similar to the host. You can call the usual OS routines - OSWRCH, OSRDCH, OSBYTE, and so on - and they appear to do the same stuff.

What's happening here is that the input values in each case are packaged up and sent over the Tube to the host. The host then performs the operation and sends the results back. And the parasite picks the results up, and returns them to the caller appropriately.

By way of example, here's the parasite-side code for OSRDCH:

(These disassembly snippets come from JGH's page.)

OSRDCH:
		LDA #&00        ;Command &00 = OSRDCH
		JSR PSendCommand
PWaitCarryChar:
		JSR PWaitByte
		ASL A           ;Get C flag result
PWaitByte:
		BIT TubeS2
		BPL PWaitByte
		LDA TubeR2      ;Get char result
PNullReturn:
		RTS

PSendCommand:
		; Wait for Tube R2 free
PSendByte:
		BIT TubeS2
		BVC PSendByte
		STA TubeR2      ;Send byte to Tube R2
		RTS

So this sends a command representing OSRDCH through FIFO 2. Then it waits for a response from the host, and reads two bytes: the first is used to set the carry flag (set if there was an error condition), and the second is used to set A (the value of the key pressed). These are just the usual OSRDCH return values.

Here's the parasite-side code for OSWRCH:

OSWRCH:
		; Wait for FIFO 1 available
		BIT TubeS1
		NOP
		BVC OSWRCH
		STA TubeR1      ;Send char through FIFO 1
		RTS

Similar, but it uses a different FIFO, is slightly tighter code (because the polling loop is inlined), and doesn't wait for a result (because there won't be one).

Of course, for this to work, the host has to be listening. Ordinarily, it wouldn't be, but when the Tube is active, it is. Instead of running a language ROM, or a user program, it runs the Tube host code: a few hundred bytes of code copied into the area usually occupied by the language ROM's workspace. (That is: zero page &00-&8F, and addresses &0400-&07FF. Since it's the parasite that's running the language ROM, these areas are of course unused.)

(This copying is done by a ROM - on the BBC B-style models, a DNFS ROM, and on the Master, the MOS. You can see the whole BBC B story in JGH's Tube host code disassembly, which also includes the relevant snippets from OS 1.20 and the DNFS ROM.)

A key part of the Tube host code is the Tube loop, a short loop that polls FIFOs 1 and 2, and services incoming requests appropriately:

HTubeLoop:
		BIT TubeS1      ;Char in FIFO 1?
		BPL HCheck2     ;If not, check FIFO 2
HDoOSWRCH:
		LDA TubeR1      ;Get char from FIFO 1
		JSR OSWRCH      ;Call OSWRCH
HCheck2:
		BIT TubeS2      ;Command in FIFO 2?
		BPL HTubeLoop   ;If not, try again
		BIT TubeS1      ;Char in FIFO 1?
		BMI HDoOSWRCH   ;If so, do an OSWRCH
		LDX TubeR2      ;Get command from R2
		STX HJump+1     ;Use as index into table at &0500
HJump:
		JMP (&0500)

Bytes coming over FIFO 1 are sent to OSWRCH (and evidently low OSWRCH latency is prioritized), and bytes coming over FIFO 2 are treated as commands. Each command byte is used to look up into a table of routines, roughly (but not quite) one per OS entry point that parasite code might use. The routine for the case where the command is OSRDCH is as follows:

HDoOSRDCH:
		JSR OSRDCH      ;Do the OSRDCH call
		ROR A           ;Get carry in b7
		JSR HSend2      ;Send through FIFO 2
		ROL A           ;Restore A
		JMP HSend2ThenIdle ;Send through FIFO 2 then go back
				   ;to the tube loop

HSend2:
		BIT TubeS2      ;FIFO 2 available?
		BVC HSend2      ;If not, loop
		STA TubeR2      ;Send through FIFO 2
		RTS

HSend2ThenIdle:
		BIT TubeS2      ;FIFO 2 available?
		BVC HSend2ThenIdle ;If not, loop
		STA TubeR2      ;Send through FIFO2
		JMP HTubeLoop   ;Go back to the Tube loop

Hopefully it should be fairly clear how this maps to the code in the parasite-side OSRDCH. And hopefully it should also be clear that this approach would extend straightforwardly to the other OS routines as well, at least in principle. (Though in practice, bulk transfer operations such as OSFILE or OSGBPB are implemented slightly differently.)

How do you run your own code?

With the Tube active, the parasite processor is of course under your control, and you can run your own code in the usual fashion. But what about getting the host to do something? Perhaps you want to do something specific such as poke I/O or screen memory, under control of the parasite, without the per-call overhead of one of the OSBYTE or OSWORD calls such as OSBYTE 150.

The way to do this is to write some code to run on the host, that hooks some OS vectors and then returns (leaving itself resident in the usual fashion). Acorn recommends assembling the code to run at &2100, so do that.

With BASIC II or HIBASIC you can do the assembling on the parasite and use *SAVE to create the file. Getting the code to then run on the host is pretty straightforward: ensure that the top 16 bits of the intended load and execution addresses are all set.

Then *RUN from the parasite, via OSCLI, and you're set! Once OSCLI finishes, your hooks are in place, and your host-side code can run under control of the parasite.

(Using a utility sideways ROM would also be an option. They always execute in the host system, and are of course given a chance on initialization to override any vectors.)

Running your own code - an example

Here's an example, written to be assembled on the parasite, that hooks OSWRCH, and fiddles with screen RAM every time CHR$255 is printed:

10REM>S.HOOK
20DEST=&2100:DIM BUF 100
30WRCHV=&20E
40FOR I%=4 TO 7 STEP 3
50O%=BUF:P%=DEST
60[OPT I%
70.START
80LDA WRCHV+0:STA DOOSWRCH+1
90LDA WRCHV+1:STA DOOSWRCH+2
100LDA #HOOK MOD256:STA WRCHV+0
110LDA #HOOK DIV256:STA WRCHV+1
120RTS
130:
140.HOOK
150CMP #255:BEQ DOHOOK
160.DOOSWRCH JMP &FFFF
170.DOHOOK
180INC &7C00
190RTS
200]
210NEXT I%
220:
230X$="SAVE H.HOOK "+STR$~BUF+" "+STR$~O%+" "+STR$~(&FFFF0000 OR DEST)+" "+STR$~(&FFFF0000 OR START)
240PRINTX$
250OSCLIX$

And here's the test code for it, again for running on the parasite, though this time from BASIC:

10REM>T.HOOK
20MODE 7
30VDU 28,0,24,39,2
40*RUN H.HOOK
50REPEAT
60VDU 255
70*FX 19
80UNTIL FALSE

Run S.HOOK to create H.HOOK, which will execute in the host when run with *RUN. Then run T.HOOK to see it in action! When T.HOOK runs, you'll see the character at the top left of the screen change. This is done by the INC &7C00 running in the OSWRCH hook on the host processor, which runs each time the parasite prints out CHR$ 255.

The protocol is fixed

You can use this sort of approach to send commands from the parasite to the host and get data back. You just need to pick the right OS routine to hook, according to the desired flow and quantity of data.

(In theory you could invent your own protocols, giving yourself pretty much free reign, by having your host code overwrite the Tube loop, and then overwriting the Tube OS in the parasite. This would be a bit of a pain to arrange, though, and probably not of enough benefit to be worth the bother.)

You could hook any OS routine you fancy (that's supported over the Tube), but good routines are probably OSWRCH, OSBYTE and OSWORD. They're pretty easy to hook. From the point of view of your hook code that's running on the host, the following notes apply.

`OSWRCH`

Your OSWRCH routine receives the A register from the parasite. The return value is ignored.

OSWRCH is asynchronous, and the parasite doesn't wait for the call to complete before continuing execution. This gets you a bit of free parallelism. (Discussed below.)

`OSBYTE`

When A < &80, your OSBYTE routine receives the A and X registers from the parasite, and the X register is returned to the parasite; otherwise, it receives A, X and Y from the parasite, and X, Y and the carry flag are returned. (Don't forget to preserve the accumulator; even though it's not sent back to the parasite, the Tube host code relies on its value not changing!)

There are four OSBYTE calls that work slightly unusually.

Three never reach the host, so hooking them is no use: &82 - Read machine high order address (it always returns &0000); &83 - Read OSHWM (always &0800); &84 - read HIMEM (depends on ROM).

One is asynchronous like OSWRCH: &9D - Fast Tube BPUT. Like OSWRCH, the parasite doesn't wait for the call to complete before continuing execution. This gets you a bit of free parallelism. (Discussed below.)

`OSWORD`

Since OSWORD has a parameter block in memory, some amount of data might have to be copied across the Tube into host memory before the call is made. Then once the call is finished, some data may need to be copied back, replacing the original parameter block in parasite memory.

The default set of OSWORD calls (A <= &14) have parameter blocks of known size. The Tube OS sends the right number of bytes to the host, and the Tube loop sends the right number of bytes back to overwrite the appropriate part of the parameter block. (OSWORD 0 - Read line - is special-cased, being more akin to something like OSFILE. Unlike most OS calls it handles, the Tube host code makes assumptions about the contents of the parameter block and what they mean.)

Other OSWORD calls, being handled by ROMs or other non-OS code, have parameter blocks of no defined size. So these calls have been split into two groups: one that sends and receives an arbitrary fixed number of bytes (that's probably large enough for most purposes), and one that sends and receives a caller-defined number of bytes.

For non-standard OSWORD calls where A < &80, a fixed number of bytes are sent and received: 16 bytes of the parameter block are sent to the host, and are replaced with the 16 bytes that result after the call.

For OSWORD calls where A >= &80 (which are all non-standard), the first 2 bytes of the parameter block indicate how many bytes should be sent and received. The first byte indicates how many bytes should be sent from parasite to host before the call, and the second byte indicates how many should be returned after the call.

Both values include the counts themselves, and the maximum size is 128 bytes.

(These two bytes are included in the parameter block that OSWORD on the host sees, but at that point they'll already have been read - if they're overwritten, the correct amount of data is still returned to the parasite, including the two overwritten count bytes.)

One thing to note might be that the parameter block is copied to the host system's stack, starting at &0128, and finishing (potentially) at &1A8.

Parallelism

The default Tube setup doesn't really get you much in the way of parallelism. The average OS routine involves the parasite sending a request to the host, then waiting for a result back. While the host is working, the parasite is idle! And since the host spends a lot of its time running the Tube loop, while the parasite is busy, the host is idle. This isn't really ideal.

As it happens, though, there are two exceptions to the above rule: OSWRCH, and OSBYTE 157, the (somewhat cryptically named) Fast Tube BPUT. The parasite OS routines for these two don't wait for an answer after sending the request to the host; they just return immediately. So, with the right hooks installed, you can call your routine from the parasite, have it start your code on the host, and then control return to the caller on the parasite with the host still running your code.

A lot of the time, you don't really need to pay much attention to this to get some benefit. For example, if your OSWRCH hook just does something that's very much along the lines of OSWRCH (i.e., put something on screen, without usually spending all that much time doing it), you can just let it do that, and you get a bit of free parallelism. (And this is, of course, why OSWRCH is set up this way in the first place.)

But if you're trying to do some more time-consuming jobs on both CPUs, you might want to be more explicit.

Parallelism - an example

Here's some code that that performs a lengthy operation on the host when the Fast Tube BPUT OSBYTE call is made. This potentially leaves the caller on the parasite time to do something else while it runs.

(The lengthy operation here isn't terribly useful: it's just a delay of about 256 TIME ticks. And while this is going on, the host also does a repeated INC &7C00, so you can see when it's busy and when it's finished.)

10REM>S.PAR
20:
30BYTEV=&20A
40:
50REM HOST SIDE PART
60:
70DEST%=&2100:DIM BUF% 1000
80FOR I%=4 TO 6 STEP 2
90P%=DEST%:O%=BUF%
100[OPTI%
110:
120.START
130LDA BYTEV+0:STA DOOSBYTE+1
140LDA BYTEV+1:STA DOOSBYTE+2
150LDA #OSBYTEHOOK MOD256:STA BYTEV+0
160LDA #OSBYTEHOOK DIV256:STA BYTEV+1
170:
180RTS
190:
200.OSBYTEHOOK
210CMP #70:BEQ UNINSTALL
220CMP #157:BEQ LONGOP
230.DOOSBYTE JMP &FFFF
240:
250.UNINSTALL
260LDA DOOSBYTE+1:STA BYTEV+0
270LDA DOOSBYTE+2:STA BYTEV+1
280RTS
290:
300.LONGOP
310PHA
320JSR DELAY
330PLA
340RTS
350:
360.DELAY
370LDX #CLKWVAL MOD256:LDY #CLKWVAL DIV256:LDA #2:JSR &FFF1
380.DELAYLP
390LDX #CLKRVAL MOD256:LDY #CLKRVAL DIV256:LDA #1:JSR &FFF1
400LDA #19:JSR &FFF4
410INC &7C00
420LDA CLKRVAL+0:CMP#50:BCC DELAYLP
430RTS
440:
450.CLKWVAL EQUD 0:BRK
460.CLKRVAL EQUD 0:BRK
470]
480NEXT
490:
500X$="SAVE H.PAR "+STR$~BUF%+" "+STR$~O%+" "+STR$~(&FFFF0000 OR START)+" "+STR$~(&FFFF0000 OR DEST%)
510PRINT X$
520OSCLI X$

And here's some test code that works it.

10REM>T.PAR
20MODE7
30:
40*RUN H.PAR
50:
60PRINT'"CALIBRATING..."
70TIME=0
80N%=0:REPEAT N%=N%+1:UNTILTIME>=256
90PRINT"APPROX ";N%;" ITERATIONS"
100:
110PRINT'"STARTING LENGTHY OPERATION..."
120A%=157:CALL &FFF4
130:
140T%=0:REPEAT T%=T%+1:UNTIL T%>N%
150:
160PRINT ;T%;" TIMES."
170:
180A%=70:CALL &FFF4

While the host is busy doing its INC &7C00, this sets the parasite incrementing a variable. It first figures out roughly how many iterations it can do in the time the host will be busy for, then (while the host is busy) iterates that many times. (This way you can see by eye they're genuinely running in parallel, even if the host finishes first.)

You have to be a bit careful when doing this. When the host is busy, it isn't running the Tube loop; and when it isn't running the Tube loop, it can't field requests from the parasite. The parasite OS routines will just hang until the host is ready again. (BASIC code is generally pretty safe, but watch out for implicit OS calls, such as TIME! You're safest with assembly language.)

So this particular demo shows one way of handling that: guess roughly how much work the parasite will be able to do while the host is busy, then do that much work. You just have to hope that this doesn't leave too much dead time on either end.

Fortunately, you can do better…

Poking the parasite - a host busy flag

In some situations, there's not much you can do when the host is busy except wait. But for some tasks you might be able to have the parasite keep working, buffering up any results and perhaps sending them across to the host when they're ready.

You can do this by maintaining a flag in parasite memory that indicates whether the host is busy. A reasonable way of doing this is to have a flag in zero page and test its bit 7: you can check whether bit 7 is set using BIT and BMI, which takes only 5 cycles if bit 7 is clear and there's nothing to be done. (BBS would be another option.)

You'll then need some code on the host to set the parasite's busy flag remotely, using one of the bulk transfer operations. The technique is described in the Tube Application Note, p7 - unfortunately this requires poking the Tube hardware directly, but it's pretty easy! The only thing needed as a Tube claimant ID: a 6-bit number uniquely identifying your program. JGH's list of known Tube claimant IDs will show you which values are free.

The steps involved are quite straightforward:

Loop until your code has claimed the Tube hardware, by calling the Tube host code entry point at &406;
Set up the 32-bit transfer destination address, pointing at the value in parasite memory, again by calling the Tube host code entry point at &406;
Store the value you want to write in &FEE5 (Tube FIFO 3) - the Tube host code has set this up for you already;
Unclaim the Tube hardware, again via the routine at &406.

(You can also write blocks of data by repeatedly storing to &FEE5 - the Tube Application Note has the full details on this, and various other transfer modes. Internally, these use NMIs on the parasite, which runs a bit of code that reads or writes one byte; these transfers don't have direct access.)

To set this up, you'll need your host code to expose an OSBYTE call that sets the busy flag address. Assume it's in zero page, so the address is just one byte, meaning only one OSBYTE call is required to supply the address. The host code for this new OSBYTE call is pretty simple: it just stores the X register in memory, in byte 0 of a 4-byte buffer that holds the address of the busy flag in parasite memory, as required by the Tube host code that's being called in step 2.

So, here's some improved host-side code that will do this. It's somewhat similar to the previous example, except that you use OSBYTE 70 to specify the (zero page) address of a flag in parasite memory. The host will set to zero once it's finished its Fast Tube BPUT routine.

10REM>S.POLLPAR
20:
30CLAIMANT=23
40:
50BYTEV=&20A:WRCHV=&20E
60:
70REM HOST SIDE PART
80:
90DEST%=&2100:DIM BUF% 1000
100FOR I%=4 TO 6 STEP 2
110P%=DEST%:O%=BUF%
120[OPTI%
130:
140.START
150LDA BYTEV+0:STA DOOSBYTE+1
160LDA BYTEV+1:STA DOOSBYTE+2
170LDA #OSBYTEHOOK MOD256:STA BYTEV+0
180LDA #OSBYTEHOOK DIV256:STA BYTEV+1
190:
200RTS
210:
220.OSBYTEHOOK
230CMP #70:BEQ SETADDRESS
240CMP #71:BEQ UNINSTALL
250CMP #157:BEQ LONGOP
260.DOOSBYTE JMP &FFFF
270:
280.UNINSTALL
290LDA DOOSBYTE+1:STA BYTEV+0
300LDA DOOSBYTE+2:STA BYTEV+1
310RTS
320:
330.SETADDRESS
340STX ADDRESS+0:RTS
350:
360.LONGOP
370PHA
380LDA ADDRESS:BEQ LONGOPDONE
390JSR DELAY
400JSR RESETBUSY
410.LONGOPDONE
420PLA:RTS
430:
440.DELAY
450LDX #CLKWVAL MOD256:LDY #CLKWVAL DIV256:LDA #2:JSR &FFF1
460.DELAYLP
470LDX #CLKRVAL MOD256:LDY #CLKRVAL DIV256:LDA #1:JSR &FFF1
480LDA #19:JSR &FFF4
490INC &7C00
500LDA CLKRVAL+0:CMP#50:BCC DELAYLP
510RTS
520:
530.RESETBUSY
540.CLAIM LDA #&C0+CLAIMANT:JSR &406:BCC CLAIM
550LDX #ADDRESS MOD256:LDY #ADDRESS DIV256:LDA #1:JSR &406
560JSR DELAY24
570LDA #0:STA &FEE5
580JSR DELAY24
590LDA #&80+CLAIMANT:JSR &406
600RTS
610:
620.DELAY24
630LDX #5:.DELAY24LP DEX:BPL DELAY24LP:RTS
640:
650.ADDRESS EQUD 0
660.CLKWVAL EQUD 0:BRK
670.CLKRVAL EQUD 0:BRK
680]
690NEXT
700:
710X$="SAVE H.POLLPAR "+STR$~BUF%+" "+STR$~O%+" "+STR$~(&FFFF0000 OR START)+" "+STR$~(&FFFF0000 OR DEST%)
720PRINT X$
730OSCLI X$

And here's some test code. Unlike the last version, it doesn't need to guess how much work will fit it one unit; it just keeps running, and polls the flag to see when the host is ready.

10REM>T.POLLPAR
20MODE7
30:
40FLAG=&70
50:
60*RUN H.POLLPAR
70:
80A%=70:X%=FLAG:CALL &FFF4
90?FLAG=255
100:
110PRINT'"STARTING LENGTHY OPERATION..."
120A%=157:CALL &FFF4
130:
140T%=0:REPEAT T%=T%+1:UNTIL ?FLAG=0
150:
160PRINT ;T%;" TIMES."
170:
180A%=71:CALL &FFF4

And now you know…

There wasn't much in the way of Tube-enhanced software in the BBC Micro's heyday. That's hardly surprising, since there were so few sold, but a shame, because making use of the Tube is actually fairly straightforward.

May somebody read this and be inspired!

Notes/Resources

Host system memory notes

The address &2100 appeared above. This crops up because when the Tube is active, the character set is fully exploded, pushing OSHWM 6 pages higher than normal. Assuming you're using DFS, that means OSHWM is &1900+&0600=&1F00, and in MODEs 0, 1 and 2 there's all of 4K of memory left. And if you've got ADFS installed, I'd be amazed if there's anything left at all.

(I suggested &2100 specifically, because that's what the Tube Application Note recommends - but its calculation doesn't sound quite right. On the other hand, another 2 pages for safety can't hurt much more.)

You can use OSBYTE &20 on startup to implode the character definitions and get some memory back.

Products that use the Tube in an interesting way

The famous Elite has a special Second Processor version, Elite Executive Edition, that runs on the parasite and uses the host to draw the lines. It's a fair bit faster than the normal one, runs in colour, and apparently displays more ships at once and in a greater variety. It's a bit flickery, however, compared to the (double-buffered, mostly) Master 128 version.

The interesting part about this version of Elite is that the source code is available from Ian Bell's Elite page. You can find the bulk of the host code in the file P.ELITEZ on drive 0. It works along the lines described in this article, hooking OSWRCH and OSWORD.