Plaintext document format

National Archives of Australia
Published 29 April 2003

Contributors: Simon Davis, Chris Bitmead and Andrew Lee

Contents

Abstract
Status
1. Introduction
2. Dependencies
3. XML namespace
4. XML schema
5. Explanation of elements and attributes
5.1 plaintext element
5.2 tabsize attribute
5.3 xml:space attribute
5.4 line element
6. Views
6.1 Plain Text View
7. References
8. Examples
8.1 Example 1
8.2 Example 2

Abstract

This specification documents the rules for the plaintext document format. plaintext represents plain (ie, 'unformatted') text content that is normally encoded in ASCII or some derivative as XML (extensible markup language) document instances. The format consists of both an XML schema and a set of rendering rules.

Status

This document has been released for public comment.

1. Introduction

The plaintext document format provides an archival representation of unformatted text content. Unformatted text content is data that consist of a sequence of lines with each line consisting of a sequence of characters all belonging to a particular character code. There is no textual or font formatting associated with the characters.

Although text files are commonplace in all major computing systems, sufficient differences in the way such files are implemented between platforms (and sometimes between applications) exist to make the meaning of many text file ambiguous outside the context of their initial use. For instance, text files may be encoded in different character encodings, or use different character sets, or use different end-of-line or end-of-file delimiters. Text files are not self-describing (ie, metadata that documents these details is not a routine part of a text file). As a result it can sometimes be difficult to know the exact information set that was meant to be imparted by a particular text file.

The plaintext document format is designed to overcome these ambiguities by specifying a standard encoding and structure on text content. This format depends heavily on other, already accepted, standards such as Unicode (reference 7.8) and XML (reference 7.4). The format consists of both an XML schema to define the structure that instances must conform to as well as a set of view requirements that determine how the various elements within that structure should be rendered. These two components – the schema and the view – can be equated to the concepts of 'data object' and 'information object' in the draft ISO Standard, Open Archival Information System Reference Model (reference 7.6).

Many text files have complex structures that go far beyond the 'sequence of lines' model used here. Initialisation and configuration files, email messages and SGML and XML document instances are commonplace examples. Although this document format can be used to encode such files, specially designed archival document formats that make allowance for these complex structures may be preferable.

2. Dependencies

The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'MAY' and 'OPTIONAL' in this document are to be interpreted as described in RFC 2119 (reference 7.3).

The terms 'block', and 'visual' in this document are to be interpreted as described in Cascading style sheets, level 2 (reference 7.2).

References to XML, XML namespaces, XML schema, and Unicode characters are to be interpreted according to Extensible markup language (reference 7.4), Namespaces in XML (reference 7.5), XML schema part 0 (reference 7.7), XML schema part 1 (reference 7.9), XML schema part 2 (reference 7.1), The Unicode standard: Version 3.0 (reference 7.8) respectively.

3. XML namespace

A plaintext instance SHOULD use the namespace declaration:
http://preservation.naa.gov.au/plaintext/1.0.

A plaintext instance MAY use the namespace prefix: plaintext.

4. XML schema

A plaintext instance MUST conform to the following XML schema:

<?xml version="1.0" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://preservation.naa.gov.au/plaintext/1.0"
version="1.0">
<xsd:annotation>
<xsd:documentation xml:lang="en">
plaintext. A schema to represent plain old text files.
Developed by the National Archives of Australia. Copyright 2003
Commonwealth of Australia.
</xsd:documentation>
</xsd:annotation>
<xsd:element name="plaintext">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="line"
minOccurs="0"
maxOccurs="unbounded">
<xsd:simpleType>
<xsd:extension base="xsd:string" />
</xsd:simpleType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="tabsize"
type="xsd:positiveInteger"
use="optional" />
<xsd:attribute name="xml:space"
fixed="preserve"
use="required" />
</xsd:complexType>
</xsd:element>
</xsd:schema>

5. Explanation of elements and attributes

5.1 plaintext element

This is the root element for a plaintext instance. All the other elements in a plaintext instance MUST be contained within this element. It consists of a sequence of line elements. Sequencing of line elements within the plaintext element is significant and MUST be preserved. A plaintext has two attributes: tabsize and xml:space. The tabsize sets an interval for tab markers throughout the document and a plaintext element MAY have a value associated with this attribute. The xml:space attribute informs processing application that they must preserve all whitespace within the plaintext element. The value of this attribute MUST be "preserve".

XML code

<plaintext>

Example fragment

An empty plaintext element:

<plaintext tabsize="8" xml:space="preserve"/>

Attributes

tabsize
xml:space

5.2 tabsize attribute

This attribute defines the widths of tab-intervals for the instance. When the instance is rendered, any tab characters (Unicode character 9) MUST be rendered as whitespace until the next character column that is a multiple of the tabsize value.

XML code

tabsize

Example fragment

tabsize="4"

5.3 xml:space attribute

This attribute is defined in the XML (reference 7.4) standard. Its value MUST be set to "preserve". The value of "preserve" specifies that any whitespace within a plaintext instance MUST be retained by a compliant application. Whitespace is defined as any instance or combination of any of the following four characters:

The whitespace within an unformatted document can convey significant meaning. For instance, whitespace characters may help to: arrange characters into columns, indent source code (which may be meaningful in some programming languages), or arrange characters into boxes or other graphical objects (known as 'ASCII art'). For this reason it is important to preserve all whitespace characters within a plaintext instance and to render these characters in any view of a plaintext instance.

XML code

xml:space

Example fragment

xml:space="preserve"

5.4 line element

This represents a line of text within the plaintext instance. A line consists of a sequence of Unicode characters. The order of the characters is significant.

XML code

<line>

Example fragment

<line>Electronic records are performances not physical objects.</line>

Attributes

 

6. Views

A compliant application MUST support the requirements of the Plain Text View.

The Plain Text View is the default view for the plaintext document format: when a plaintext instance is rendered with no other arguments, it MUST be rendered in the Plain Text View.

6.1 Plain Text View

The Plain Text View is designed principally for visual devices.

The content of each line element MUST be rendered within its own block.

Empty line elements MUST be rendered.

The view MUST render a tab character (Unicode character 9) as whitespace until the next character column that is a multiple of the value of the tabsize attribute.

If a value for the tabsize attribute is not present, the view MAY use any value to determine tab stops. The user SHOULD be able to change this value.

The view MUST render the content of each line element using a monospace font.

The view MAY wrap line element content to fit the dimensions of the view.

7. References

7.1 Paul V. Biron and Ashok Malhotra (eds), XML schema part 2: datatypes, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-2-20010502)

7.2 Bert Bos, Hakon Wium Lie, Chris Lilley and Ian Jacobs (eds), Cascading style sheets, level 2: CSS2 specification, 12 May 1998. (www.w3.org/TR/1998/REC-CSS2-19980512)

7.3 S. Bradner, RFC 2119: key words for use in RFCs to indicate requirement levels, March 1997. (hwww.ietf.org/rfc/rfc2119.txt)

7.4 Tim Bray, Jean Paoli, C.M. Sperberg-McQueen and Eve Maler (eds), Extensible markup language (XML) 1.0 (second edition), 6 October 2000. (http://www.w3.org/TR/2004/REC-xml-20040204/)

7.5 Tim Bray, Dave Hollander and Andrew Layman (eds), Namespaces in XML, 14 January 1999. (www.w3.org/TR/1999/REC-xml-names-19990114)

7.6 Consultative Committee for Space Data Systems, CCSDS 650.0-R-2: reference model for an open archival information system (OAIS), July 2001. (www.ccsds.org/documents/so2002/spaceops02_p_t5_39.pdf - 240 kb pdf document)

7.7 David C. Fallside (ed.), XML schema part 0: primer, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-0-20010502)

7.8 The Unicode Consortium, The Unicode standard: version 3.0, 2000. (www.unicode.org/unicode/uni2book/u2.html)

7.9 Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn (eds), XML schema part 1: structures, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-1-20010502)

7.10 Henry S. Thompson and Richard Tobin, XSV (XML schema validator), (software), 1.203.2.47.2.4.2.14/1.106.2.25.2.6 of 2002/06/15. (www.ltg.ed.ac.uk/~ht/xsv-status.html)

8. Examples

These examples are non-normative.

8.1 Example 1

A text file consisting of three lines with no tabs.

8.1.1 Source document

line one
line two
line three

8.1.2 XML markup

<?xml version="1.0" encoding="UTF-16"?>
<plaintext:plaintext xmlns="http://preservation.naa.gov.au/plaintext/1.0"
xml:space="preserve">
<plaintext:line>line one</plaintext:line>
<plaintext:line>line two</plaintext:line>
<plaintext:line>line three</plaintext:line>
</plaintext:plaintext>

8.2 Example 2

A text file that uses whitespace to improve the readability of source code. Note: the '&#9;' entity in the XML markup is a XML pre-defined character reference for the tab character (Unicode character 9). A plaintext instance may, but does not have to, use this reference.

8.2.1 Source document

//file: HelloJava.java

public class HelloJava extends javax.swing.JComponent {

public static void main(String[] args) {
javax.swing.JFrame f = new javax.swing.JFrame("Hello Java App");
f.setSize(300,300);
f.getContentPane().add(new HelloJava());
f.setVisible(true);
}

public void paintComponent(java.awt.Graphics g) {
g.drawString("Hello, Java!",125,95);
}
}

8.2.2 XML markup

<?xml version="1.0" encoding="UTF-16"?>
<plaintext:plaintext xmlns="http://preservation.naa.gov.au/plaintext/1.0"
tabsize="4"
xml:space="preserve">
<plaintext:line>//file: HelloJava.java</plaintext:line>
<plaintext:line/>
<plaintext:line>public class HelloJava extends javax.swing.JComponent {</plaintext:line>
<plaintext:line>&#9;public static void main(String[] args) {</plaintext:line>
<plaintext:line>&#9;&#9;javax.swing.JFrame f = new javax.swing.JFrame("Hello Java App");</plaintext:line>
<plaintext:line>&#9;&#9;f.setSize(300,300);</plaintext:line>
<plaintext:line>&#9;&#9;f.getContentPane().add(new HelloJava());</plaintext:line>
<plaintext:line>&#9;&#9;f.setVisible(true);</plaintext:line>
<plaintext:line>&#9;}</plaintext:line>
<plaintext:line/>
<plaintext:line>&#9;public void paintComponent(java.awt.Graphics g) {</plaintext:line>
<plaintext:line>&#9;&#9;g.drawString("Hello, Java!",125,95);</plaintext:line>
<plaintext:line>&#9;}</plaintext:line>
<plaintext:line>}</plaintext:line>
</plaintext:plaintext>