Back in the days of yore, when computers had front panels of switches and lights and core memory was actually made of ferrite cores, an editor seemingly designed for writing editors came to life. Amazingly, it is still alive and can be used on modern Windows, Mac, and Linux systems today.
In those early days, computer makers did not always worry about data interchange, and representations for text outside of the Roman alphabet, numerals, and a few punctuation marks. Eventually, the need for agreement became painfully obvious and a veritable forest of standards for encoding text followed. Unicode emerged victorious, but the legacy of 7-bit ASCII lives on.
Which then raises today’s silly question of the day: Can an editor born in 1963 for seven bit ASCII on 8 bit paper tape edit Unicode text?
TECO
Originating at MIT in the early 1960s, TECO was adopted and ported to a number of computers, and extended in various weird and wonderful ways. A great deal of the power of TECO was due its use of macros, even to the extent that some features of TECO itself are implemented with macros. Some years later, the first version of emacs was written as a collection of TECO macros. (One rumor has it that the name “emacs” was originally coined from the phrase “editor macros”.)
Raw TECO is a very powerful monster that can be used to transform files in powerful ways, especially on the sorts of computers where it proliferated. These were machines like the early DEC computers PDP-1, PDP-6, PDP-8, PDP-10, and PDP-11 running under both official and unofficial operating systems.
With only a little effort, you can have TECO running on your PC. Start at the Github project for the sources, build it to suit, and install according to the documentation. I was able to have it built and apparently working on Windows in less than 30 minutes. There is no slick installer for Windows, and I strongly doubt it is included in any of the package managers for either Linux or Mac. This is a DIY tool that won’t hold your hands when using it1, so a lack of hand-holding during installation is not a surprise.
UTF-8
When TECO was young, ASCII was already the prevalent encoding of letters, numbers, and symbols used on computers. It’s competitor, EBCDIC, doesn’t really deserve mentioning except that its use by IBM2 and a few other mainframe vendors has kept it alive long past all reason.
ASCII has a number of advantages in an English speaking world. It is compact, it groups the alphabet in a single block and in order, uppercase letters are a constant offset from their matching lowercase letter, and the decimal digits can be converted to and from binary with simple bit manipulation. But seven bits can cover only enough codes to provide 95 printable characters and 33 non-printing control codes. It isn’t possible to also include codes for Greek and Cyrillic characters, let alone code for Chinese, Japanese, Korean, or any of the other 100s of writing systems in use for living languages. Then consider all the dead languages…
The first version of Unicode sought to code the characters in use in all the world’s newspapers, which they imagined would only require 16-bits to represent. That turned out to be somewhat shortsighted (not really a surprise) and the standard today has mapped out a code space ranging from U+0000
to U+10FFFF
following the standard’s preferred notation.
One factor that slowed Unicode’s adoption was the observation that it immediately doubled (or more) file sizes for files holding primarily ASCII text. A representation was needed that allowed for a transition from ASCII to Unicode. Enter UTF-8, which built on the original clever idea to assign all of ASCII the first 128 code points of Unicode (U+0000
through U+007F
) by using sequences of bytes ranging from 0x80 to 0xFF to encode U+0080
through U+10FFFF
while allowing bytes ranging from 0x00 to 0x7F to stand alone. That means that any plain ASCII text file is immediately also a valid UTF-8 file and by extension a valid Unicode file.
TECO vs. UTF-8?
So can you edit a UTF-8 document with TECO?
Sadly, not with many releases made for platforms where 7-bit ASCII ruled, and there either wasn’t an eighth bit (some DEC machines that ran TECO used 36 bit words, and stored strings in groups of five seven bit characters in each word) or the eighth bit was used for parity checking or ignored.
The DEC PDP-11 (and the VAX family that descended from it) however were an 8-bit byte and 16-bit word architecture. TECO on PDP-11s was perfectly happy to handle text with the eighth bit preserved.
The surviving port of TECO to C that runs today on Windows, Linux, and Macs also preserves the eighth bit, and in theory will work with UTF-8 text.
To prove this, I retrieved the latest release of TECOC sources, and built it using Visual Studio 2010 for 64-bit Windows. Working in a console window set to the UTF-8 codepage (CHCP 65001
) I opened a file containing some UTF-8 characters and displayed it:
There are four non-ASCII characters in smile.txt
. The micro sign, a plain smiley face emoji, an ohm sign, and a degree sign.
C:...>chcp 65001 Active code page: 65001 C:...>type smile.txt micro-smile µ☺ 500Ω 451°F C:...>hexdump -C smile.txt 00000000 6d 69 63 72 6f 2d 73 6d 69 6c 65 20 c2 b5 e2 98 |micro-smile ....| 00000010 ba 0d 0a 35 30 30 e2 84 a6 20 34 35 31 c2 b0 46 |...500... 451..F| 00000020 0d 0a |..| C:...>teco *ERsmile.txt$$ *Y$$ *HT$$ micro-smile µ☺ 500Ω 451°F *
So far, so good. Basic editing operations work too, but you have to know how many bytes are occupied by each character, and there is no easy way to tell. Also, there is some confusion in the output code, which causes the character near the current position to be broken in the Windows console. The exact cause is not clear, but the UTF-8 codepage is known to be “quirky” and so is likely to share the blame.
C:...>teco *ERsmile.txt$YHT$$ micro-smile µ☺ 500Ω 451°F *12CHT$$ micro-smile ��☺ 500Ω 451°F *-CHT$$ micro-smile µ☺ 500Ω 451°F *CI($L-2CI)$HT$$ micro-smile (µ☺) 500Ω 451°F **-1EX$$ memory expand as per page 84 C:...>
As can be seen, when the current position is just before the micro sign, the output stream shows two replacement character glyphs �
in the console rather than the single character µ
.
When the current position is just before the micro and smile at the end of the first line, the commands I($L-2CI)$
do succeed in wrapping just those two characters in a set of parenthesis.
TL;DR.
The almost 54 year old ASCII text editor TECO in fact could be used to edit a file full of Unicode stored in UTF-8.
Actually doing so is painful.
If you have a project involving embedded systems, micro-controllers, electronics design, audio, video, or more we can help. Check out our main site and call or email us with your needs. No project is too small!
+1 626 303-1602
Cheshire Engineering Corp.
710 S Myrtle Ave #315
Monrovia, CA 91016
(Written with StackEdit.)
- When I learned TECO the first time, a common challenge was to imagine what it would do if you simply typed your own name at its prompt. For many common names, the results would be surprising and not at all useful. In fact, nearly any blat of text would do something, and almost always something awful to your file. TECO is not a user-friendly editor. But it is more user friendly than using scissors and tape to manually edit a punched paper tape, which is to some degree the bar it was seeking to rise above. ↩
- Ironically, IBM was a strong voice promoting ASCII and contributing heavily to the standard. But market pressures made it impractical to delay the System/360 until the standard was ready, so they ran with an extension of the coding already used on punched cards, and created the monster that is EBCDIC to haunt ASCII to this day. ↩