6502 Second Processor programming
A glorious multicore future, long prophesied, is upon us. Can your Tube-equipped BBC Micro participate? And if it can… how? Read on!
The Tube
The Tube is a data link connecting two systems: the host (or I/O) system (an ordinary 2MHz 65x2 BBC Micro with a keyboard and so on), and the parasite system (for the purposes of this document, a 3+MHz 65x2 CPU with some memory).
From a software point of view, this link appears as 4 double-ended FIFOs, numbered 1 to 4 inclusive. Writing to the write port of a FIFO on one system causes the byte to appear in the read port of the other. Each FIFO has status ports that can be polled to find out whether the FIFO is not empty (for reading) or not full (for writing); on the parasite side, certain FIFOs can cause an IRQ or NMI when a byte arrives.
There's a control port for setting up the overall system too.
But as it turns out, you don't actually need to worry about any of this too much!
What goes on when the Tube is on?
From the point of view of (Tube-friendly!) code running in the
parasite, the system looks very similar to the host. You can call the
usual OS routines - OSWRCH
, OSRDCH
, OSBYTE
, and so on - and they
appear to do the same stuff.
What's happening here is that the input values in each case are packaged up and sent over the Tube to the host. The host then performs the operation and sends the results back. And the parasite picks the results up, and returns them to the caller appropriately.
By way of example, here's the parasite-side code for OSRDCH
:
(These disassembly snippets come from JGH's page.)
OSRDCH: LDA #&00 ;Command &00 = OSRDCH JSR PSendCommand PWaitCarryChar: JSR PWaitByte ASL A ;Get C flag result PWaitByte: BIT TubeS2 BPL PWaitByte LDA TubeR2 ;Get char result PNullReturn: RTS PSendCommand: ; Wait for Tube R2 free PSendByte: BIT TubeS2 BVC PSendByte STA TubeR2 ;Send byte to Tube R2 RTS
So this sends a command representing OSRDCH
through FIFO 2. Then it
waits for a response from the host, and reads two bytes: the first is
used to set the carry flag (set if there was an error condition), and
the second is used to set A (the value of the key pressed). These are
just the usual OSRDCH
return values.
Here's the parasite-side code for OSWRCH
:
OSWRCH: ; Wait for FIFO 1 available BIT TubeS1 NOP BVC OSWRCH STA TubeR1 ;Send char through FIFO 1 RTS
Similar, but it uses a different FIFO, is slightly tighter code (because the polling loop is inlined), and doesn't wait for a result (because there won't be one).
Of course, for this to work, the host has to be listening. Ordinarily, it wouldn't be, but when the Tube is active, it is. Instead of running a language ROM, or a user program, it runs the Tube host code: a few hundred bytes of code copied into the area usually occupied by the language ROM's workspace. (That is: zero page &00-&8F, and addresses &0400-&07FF. Since it's the parasite that's running the language ROM, these areas are of course unused.)
(This copying is done by a ROM - on the BBC B-style models, a DNFS ROM, and on the Master, the MOS. You can see the whole BBC B story in JGH's Tube host code disassembly, which also includes the relevant snippets from OS 1.20 and the DNFS ROM.)
A key part of the Tube host code is the Tube loop, a short loop that polls FIFOs 1 and 2, and services incoming requests appropriately:
HTubeLoop: BIT TubeS1 ;Char in FIFO 1? BPL HCheck2 ;If not, check FIFO 2 HDoOSWRCH: LDA TubeR1 ;Get char from FIFO 1 JSR OSWRCH ;Call OSWRCH HCheck2: BIT TubeS2 ;Command in FIFO 2? BPL HTubeLoop ;If not, try again BIT TubeS1 ;Char in FIFO 1? BMI HDoOSWRCH ;If so, do an OSWRCH LDX TubeR2 ;Get command from R2 STX HJump+1 ;Use as index into table at &0500 HJump: JMP (&0500)
Bytes coming over FIFO 1 are sent to OSWRCH
(and evidently low
OSWRCH
latency is prioritized), and bytes coming over FIFO 2 are
treated as commands. Each command byte is used to look up into a table
of routines, roughly (but not quite) one per OS entry point that
parasite code might use. The routine for the case where the command is
OSRDCH
is as follows:
HDoOSRDCH: JSR OSRDCH ;Do the OSRDCH call ROR A ;Get carry in b7 JSR HSend2 ;Send through FIFO 2 ROL A ;Restore A JMP HSend2ThenIdle ;Send through FIFO 2 then go back ;to the tube loop HSend2: BIT TubeS2 ;FIFO 2 available? BVC HSend2 ;If not, loop STA TubeR2 ;Send through FIFO 2 RTS HSend2ThenIdle: BIT TubeS2 ;FIFO 2 available? BVC HSend2ThenIdle ;If not, loop STA TubeR2 ;Send through FIFO2 JMP HTubeLoop ;Go back to the Tube loop
Hopefully it should be fairly clear how this maps to the code in the
parasite-side OSRDCH
. And hopefully it should also be clear that
this approach would extend straightforwardly to the other OS routines
as well, at least in principle. (Though in practice, bulk transfer
operations such as OSFILE
or OSGBPB
are implemented slightly
differently.)
How do you run your own code?
With the Tube active, the parasite processor is of course under your control, and you can run your own code in the usual fashion. But what about getting the host to do something? Perhaps you want to do something specific such as poke I/O or screen memory, under control of the parasite, without the per-call overhead of one of the OSBYTE or OSWORD calls such as OSBYTE 150.
The way to do this is to write some code to run on the host, that hooks some OS vectors and then returns (leaving itself resident in the usual fashion). Acorn recommends assembling the code to run at &2100, so do that.
With BASIC II or HIBASIC you can do the assembling on the parasite and
use *SAVE
to create the file. Getting the code to then run on the
host is pretty straightforward: ensure that the top 16 bits of the
intended load and execution addresses are all set.
Then *RUN
from the parasite, via OSCLI
, and you're set! Once
OSCLI
finishes, your hooks are in place, and your host-side code can
run under control of the parasite.
(Using a utility sideways ROM would also be an option. They always execute in the host system, and are of course given a chance on initialization to override any vectors.)
Running your own code - an example
Here's an example, written to be assembled on the parasite, that hooks
OSWRCH
, and fiddles with screen RAM every time CHR$255
is printed:
10REM>S.HOOK 20DEST=&2100:DIM BUF 100 30WRCHV=&20E 40FOR I%=4 TO 7 STEP 3 50O%=BUF:P%=DEST 60[OPT I% 70.START 80LDA WRCHV+0:STA DOOSWRCH+1 90LDA WRCHV+1:STA DOOSWRCH+2 100LDA #HOOK MOD256:STA WRCHV+0 110LDA #HOOK DIV256:STA WRCHV+1 120RTS 130: 140.HOOK 150CMP #255:BEQ DOHOOK 160.DOOSWRCH JMP &FFFF 170.DOHOOK 180INC &7C00 190RTS 200] 210NEXT I% 220: 230X$="SAVE H.HOOK "+STR$~BUF+" "+STR$~O%+" "+STR$~(&FFFF0000 OR DEST)+" "+STR$~(&FFFF0000 OR START) 240PRINTX$ 250OSCLIX$
And here's the test code for it, again for running on the parasite, though this time from BASIC:
10REM>T.HOOK 20MODE 7 30VDU 28,0,24,39,2 40*RUN H.HOOK 50REPEAT 60VDU 255 70*FX 19 80UNTIL FALSE
Run S.HOOK
to create H.HOOK
, which will execute in the host when
run with *RUN
. Then run T.HOOK
to see it in action! When T.HOOK
runs, you'll see the character at the top left of the screen change.
This is done by the INC &7C00
running in the OSWRCH
hook on the
host processor, which runs each time the parasite prints out CHR$
255
.
The protocol is fixed
You can use this sort of approach to send commands from the parasite to the host and get data back. You just need to pick the right OS routine to hook, according to the desired flow and quantity of data.
(In theory you could invent your own protocols, giving yourself pretty much free reign, by having your host code overwrite the Tube loop, and then overwriting the Tube OS in the parasite. This would be a bit of a pain to arrange, though, and probably not of enough benefit to be worth the bother.)
You could hook any OS routine you fancy (that's supported over the
Tube), but good routines are probably OSWRCH
, OSBYTE
and OSWORD
.
They're pretty easy to hook. From the point of view of your hook code
that's running on the host, the following notes apply.
OSWRCH
Your OSWRCH
routine receives the A register from the parasite. The
return value is ignored.
OSWRCH
is asynchronous, and the parasite doesn't wait for the call
to complete before continuing execution. This gets you a bit of free
parallelism. (Discussed below.)
OSBYTE
When A < &80, your OSBYTE
routine receives the A and X registers
from the parasite, and the X register is returned to the parasite;
otherwise, it receives A, X and Y from the parasite, and X, Y and the
carry flag are returned. (Don't forget to preserve the accumulator;
even though it's not sent back to the parasite, the Tube host code
relies on its value not changing!)
There are four OSBYTE
calls that work slightly unusually.
Three never reach the host, so hooking them is no use: &82 - Read machine high order address (it always returns &0000); &83 - Read OSHWM (always &0800); &84 - read HIMEM (depends on ROM).
One is asynchronous like OSWRCH
: &9D - Fast Tube BPUT. Like
OSWRCH
, the parasite doesn't wait for the call to complete before
continuing execution. This gets you a bit of free
parallelism. (Discussed below.)
OSWORD
Since OSWORD
has a parameter block in memory, some amount of data
might have to be copied across the Tube into host memory before the
call is made. Then once the call is finished, some data may need to be
copied back, replacing the original parameter block in parasite
memory.
The default set of OSWORD
calls (A <= &14) have parameter blocks of
known size. The Tube OS sends the right number of bytes to the host,
and the Tube loop sends the right number of bytes back to overwrite
the appropriate part of the parameter block. (OSWORD
0 - Read line -
is special-cased, being more akin to something like OSFILE
. Unlike
most OS calls it handles, the Tube host code makes assumptions about
the contents of the parameter block and what they mean.)
Other OSWORD
calls, being handled by ROMs or other non-OS code, have
parameter blocks of no defined size. So these calls have been split
into two groups: one that sends and receives an arbitrary fixed number
of bytes (that's probably large enough for most purposes), and one
that sends and receives a caller-defined number of bytes.
For non-standard OSWORD
calls where A < &80, a fixed number of bytes
are sent and received: 16 bytes of the parameter block are sent to the
host, and are replaced with the 16 bytes that result after the call.
For OSWORD
calls where A >= &80 (which are all non-standard), the
first 2 bytes of the parameter block indicate how many bytes should be
sent and received. The first byte indicates how many bytes should be
sent from parasite to host before the call, and the second byte
indicates how many should be returned after the call.
Both values include the counts themselves, and the maximum size is 128 bytes.
(These two bytes are included in the parameter block that OSWORD
on
the host sees, but at that point they'll already have been read - if
they're overwritten, the correct amount of data is still returned to
the parasite, including the two overwritten count bytes.)
One thing to note might be that the parameter block is copied to the host system's stack, starting at &0128, and finishing (potentially) at &1A8.
Parallelism
The default Tube setup doesn't really get you much in the way of parallelism. The average OS routine involves the parasite sending a request to the host, then waiting for a result back. While the host is working, the parasite is idle! And since the host spends a lot of its time running the Tube loop, while the parasite is busy, the host is idle. This isn't really ideal.
As it happens, though, there are two exceptions to the above rule:
OSWRCH
, and OSBYTE
157, the (somewhat cryptically named) Fast Tube
BPUT. The parasite OS routines for these two don't wait for an answer
after sending the request to the host; they just return immediately.
So, with the right hooks installed, you can call your routine from the
parasite, have it start your code on the host, and then control return
to the caller on the parasite with the host still running your code.
A lot of the time, you don't really need to pay much attention to this
to get some benefit. For example, if your OSWRCH
hook just does
something that's very much along the lines of OSWRCH
(i.e., put
something on screen, without usually spending all that much time doing
it), you can just let it do that, and you get a bit of free
parallelism. (And this is, of course, why OSWRCH
is set up this way
in the first place.)
But if you're trying to do some more time-consuming jobs on both CPUs, you might want to be more explicit.
Parallelism - an example
Here's some code that that performs a lengthy operation on the host
when the Fast Tube BPUT OSBYTE
call is made. This potentially leaves
the caller on the parasite time to do something else while it runs.
(The lengthy operation here isn't terribly useful: it's just a delay
of about 256 TIME
ticks. And while this is going on, the host also
does a repeated INC &7C00
, so you can see when it's busy and when
it's finished.)
10REM>S.PAR 20: 30BYTEV=&20A 40: 50REM HOST SIDE PART 60: 70DEST%=&2100:DIM BUF% 1000 80FOR I%=4 TO 6 STEP 2 90P%=DEST%:O%=BUF% 100[OPTI% 110: 120.START 130LDA BYTEV+0:STA DOOSBYTE+1 140LDA BYTEV+1:STA DOOSBYTE+2 150LDA #OSBYTEHOOK MOD256:STA BYTEV+0 160LDA #OSBYTEHOOK DIV256:STA BYTEV+1 170: 180RTS 190: 200.OSBYTEHOOK 210CMP #70:BEQ UNINSTALL 220CMP #157:BEQ LONGOP 230.DOOSBYTE JMP &FFFF 240: 250.UNINSTALL 260LDA DOOSBYTE+1:STA BYTEV+0 270LDA DOOSBYTE+2:STA BYTEV+1 280RTS 290: 300.LONGOP 310PHA 320JSR DELAY 330PLA 340RTS 350: 360.DELAY 370LDX #CLKWVAL MOD256:LDY #CLKWVAL DIV256:LDA #2:JSR &FFF1 380.DELAYLP 390LDX #CLKRVAL MOD256:LDY #CLKRVAL DIV256:LDA #1:JSR &FFF1 400LDA #19:JSR &FFF4 410INC &7C00 420LDA CLKRVAL+0:CMP#50:BCC DELAYLP 430RTS 440: 450.CLKWVAL EQUD 0:BRK 460.CLKRVAL EQUD 0:BRK 470] 480NEXT 490: 500X$="SAVE H.PAR "+STR$~BUF%+" "+STR$~O%+" "+STR$~(&FFFF0000 OR START)+" "+STR$~(&FFFF0000 OR DEST%) 510PRINT X$ 520OSCLI X$
And here's some test code that works it.
10REM>T.PAR 20MODE7 30: 40*RUN H.PAR 50: 60PRINT'"CALIBRATING..." 70TIME=0 80N%=0:REPEAT N%=N%+1:UNTILTIME>=256 90PRINT"APPROX ";N%;" ITERATIONS" 100: 110PRINT'"STARTING LENGTHY OPERATION..." 120A%=157:CALL &FFF4 130: 140T%=0:REPEAT T%=T%+1:UNTIL T%>N% 150: 160PRINT ;T%;" TIMES." 170: 180A%=70:CALL &FFF4
While the host is busy doing its INC &7C00
, this sets the parasite
incrementing a variable. It first figures out roughly how many
iterations it can do in the time the host will be busy for, then
(while the host is busy) iterates that many times. (This way you can
see by eye they're genuinely running in parallel, even if the host
finishes first.)
You have to be a bit careful when doing this. When the host is busy,
it isn't running the Tube loop; and when it isn't running the Tube
loop, it can't field requests from the parasite. The parasite OS
routines will just hang until the host is ready again. (BASIC code is
generally pretty safe, but watch out for implicit OS calls, such as
TIME
! You're safest with assembly language.)
So this particular demo shows one way of handling that: guess roughly how much work the parasite will be able to do while the host is busy, then do that much work. You just have to hope that this doesn't leave too much dead time on either end.
Fortunately, you can do better…
Poking the parasite - a host busy flag
In some situations, there's not much you can do when the host is busy except wait. But for some tasks you might be able to have the parasite keep working, buffering up any results and perhaps sending them across to the host when they're ready.
You can do this by maintaining a flag in parasite memory that
indicates whether the host is busy. A reasonable way of doing this is
to have a flag in zero page and test its bit 7: you can check whether
bit 7 is set using BIT
and BMI
, which takes only 5 cycles if bit 7
is clear and there's nothing to be done. (BBS
would be another
option.)
You'll then need some code on the host to set the parasite's busy flag remotely, using one of the bulk transfer operations. The technique is described in the Tube Application Note, p7 - unfortunately this requires poking the Tube hardware directly, but it's pretty easy! The only thing needed as a Tube claimant ID: a 6-bit number uniquely identifying your program. JGH's list of known Tube claimant IDs will show you which values are free.
The steps involved are quite straightforward:
- Loop until your code has claimed the Tube hardware, by calling the Tube host code entry point at &406;
- Set up the 32-bit transfer destination address, pointing at the value in parasite memory, again by calling the Tube host code entry point at &406;
- Store the value you want to write in &FEE5 (Tube FIFO 3) - the Tube host code has set this up for you already;
- Unclaim the Tube hardware, again via the routine at &406.
(You can also write blocks of data by repeatedly storing to &FEE5 - the Tube Application Note has the full details on this, and various other transfer modes. Internally, these use NMIs on the parasite, which runs a bit of code that reads or writes one byte; these transfers don't have direct access.)
To set this up, you'll need your host code to expose an OSBYTE
call
that sets the busy flag address. Assume it's in zero page, so the
address is just one byte, meaning only one OSBYTE
call is required
to supply the address. The host code for this new OSBYTE
call is
pretty simple: it just stores the X register in memory, in byte 0 of a
4-byte buffer that holds the address of the busy flag in parasite
memory, as required by the Tube host code that's being called in
step 2.
So, here's some improved host-side code that will do this. It's
somewhat similar to the previous example, except that you use OSBYTE
70 to specify the (zero page) address of a flag in parasite memory.
The host will set to zero once it's finished its Fast Tube BPUT
routine.
10REM>S.POLLPAR 20: 30CLAIMANT=23 40: 50BYTEV=&20A:WRCHV=&20E 60: 70REM HOST SIDE PART 80: 90DEST%=&2100:DIM BUF% 1000 100FOR I%=4 TO 6 STEP 2 110P%=DEST%:O%=BUF% 120[OPTI% 130: 140.START 150LDA BYTEV+0:STA DOOSBYTE+1 160LDA BYTEV+1:STA DOOSBYTE+2 170LDA #OSBYTEHOOK MOD256:STA BYTEV+0 180LDA #OSBYTEHOOK DIV256:STA BYTEV+1 190: 200RTS 210: 220.OSBYTEHOOK 230CMP #70:BEQ SETADDRESS 240CMP #71:BEQ UNINSTALL 250CMP #157:BEQ LONGOP 260.DOOSBYTE JMP &FFFF 270: 280.UNINSTALL 290LDA DOOSBYTE+1:STA BYTEV+0 300LDA DOOSBYTE+2:STA BYTEV+1 310RTS 320: 330.SETADDRESS 340STX ADDRESS+0:RTS 350: 360.LONGOP 370PHA 380LDA ADDRESS:BEQ LONGOPDONE 390JSR DELAY 400JSR RESETBUSY 410.LONGOPDONE 420PLA:RTS 430: 440.DELAY 450LDX #CLKWVAL MOD256:LDY #CLKWVAL DIV256:LDA #2:JSR &FFF1 460.DELAYLP 470LDX #CLKRVAL MOD256:LDY #CLKRVAL DIV256:LDA #1:JSR &FFF1 480LDA #19:JSR &FFF4 490INC &7C00 500LDA CLKRVAL+0:CMP#50:BCC DELAYLP 510RTS 520: 530.RESETBUSY 540.CLAIM LDA #&C0+CLAIMANT:JSR &406:BCC CLAIM 550LDX #ADDRESS MOD256:LDY #ADDRESS DIV256:LDA #1:JSR &406 560JSR DELAY24 570LDA #0:STA &FEE5 580JSR DELAY24 590LDA #&80+CLAIMANT:JSR &406 600RTS 610: 620.DELAY24 630LDX #5:.DELAY24LP DEX:BPL DELAY24LP:RTS 640: 650.ADDRESS EQUD 0 660.CLKWVAL EQUD 0:BRK 670.CLKRVAL EQUD 0:BRK 680] 690NEXT 700: 710X$="SAVE H.POLLPAR "+STR$~BUF%+" "+STR$~O%+" "+STR$~(&FFFF0000 OR START)+" "+STR$~(&FFFF0000 OR DEST%) 720PRINT X$ 730OSCLI X$
And here's some test code. Unlike the last version, it doesn't need to guess how much work will fit it one unit; it just keeps running, and polls the flag to see when the host is ready.
10REM>T.POLLPAR 20MODE7 30: 40FLAG=&70 50: 60*RUN H.POLLPAR 70: 80A%=70:X%=FLAG:CALL &FFF4 90?FLAG=255 100: 110PRINT'"STARTING LENGTHY OPERATION..." 120A%=157:CALL &FFF4 130: 140T%=0:REPEAT T%=T%+1:UNTIL ?FLAG=0 150: 160PRINT ;T%;" TIMES." 170: 180A%=71:CALL &FFF4
And now you know…
There wasn't much in the way of Tube-enhanced software in the BBC Micro's heyday. That's hardly surprising, since there were so few sold, but a shame, because making use of the Tube is actually fairly straightforward.
May somebody read this and be inspired!
Notes/Resources
Host system memory notes
The address &2100 appeared above. This crops up because when the Tube is active, the character set is fully exploded, pushing OSHWM 6 pages higher than normal. Assuming you're using DFS, that means OSHWM is &1900+&0600=&1F00, and in MODEs 0, 1 and 2 there's all of 4K of memory left. And if you've got ADFS installed, I'd be amazed if there's anything left at all.
(I suggested &2100 specifically, because that's what the Tube Application Note recommends - but its calculation doesn't sound quite right. On the other hand, another 2 pages for safety can't hurt much more.)
You can use OSBYTE
&20 on startup to implode the character
definitions and get some memory back.
Products that use the Tube in an interesting way
The famous Elite has a special Second Processor version, Elite Executive Edition, that runs on the parasite and uses the host to draw the lines. It's a fair bit faster than the normal one, runs in colour, and apparently displays more ships at once and in a greater variety. It's a bit flickery, however, compared to the (double-buffered, mostly) Master 128 version.
The interesting part about this version of Elite is that the source
code is available from Ian Bell's Elite page. You can find the bulk of
the host code in the file P.ELITEZ
on drive 0. It works along the
lines described in this article, hooking OSWRCH
and OSWORD
.
Other links
ReCo6502Mini - a remake of the 65C102 Co-Processor (internal version of the Second Processor, suitable for the Master 128) using contemporary parts.
JGH's list of known Tube claimant IDs - so you know not to step on anybody else's.
Tube Application Note - the official word on Tube programming.
JGH's 6502 Tube Stuff - includes parasite OS disassembly for 6502 Second Processor and 6502 Coprocessor. Very helpful when I was first looking at this stuff. (JGH's page provides tokenized BASIC; I've made text versions available here: Tube parasite OS; CoPro client OS.)
JGH's Tube host code disassembly - also very helpful.
The New Advanced User Guide - Chapter 18 deals with the Tube, partly rehashing the Tube Application Note. Not that I can talk! (Chapter 17 also has a couple of notes about creating HI versions of language ROMs, which is outside the scope of this article but Tube-related nonetheless.)