National Archives of Australia
Published 29 April 2003
Contributors: Simon Davis, Chris Bitmead and Andrew Lee
Abstract
Status
1. Introduction
2. Dependencies
3. XML namespace
4. XML schema
5. Explanation of elements and attributes
5.1 plaintext
element
5.2 tabsize
attribute
5.3 xml:space
attribute
5.4 line
element
6. Views
6.1 Plain Text View
7. References
8. Examples
8.1 Example 1
8.2 Example 2
This specification documents the rules for the plaintext
document format. plaintext
represents plain (ie, 'unformatted') text content that is normally encoded in
ASCII or some derivative as XML (extensible markup language) document instances.
The format consists of both an XML schema and a set of rendering rules.
This document has been released for public comment.
The plaintext document format provides an archival representation of unformatted text content. Unformatted text content is data that consist of a sequence of lines with each line consisting of a sequence of characters all belonging to a particular character code. There is no textual or font formatting associated with the characters.
Although text files are commonplace in all major computing systems, sufficient differences in the way such files are implemented between platforms (and sometimes between applications) exist to make the meaning of many text file ambiguous outside the context of their initial use. For instance, text files may be encoded in different character encodings, or use different character sets, or use different end-of-line or end-of-file delimiters. Text files are not self-describing (ie, metadata that documents these details is not a routine part of a text file). As a result it can sometimes be difficult to know the exact information set that was meant to be imparted by a particular text file.
The plaintext
document format is designed to overcome these ambiguities by specifying a standard
encoding and structure on text content. This format depends heavily on other,
already accepted, standards such as Unicode (reference 7.8)
and XML (reference 7.4). The format consists of both an XML
schema to define the structure that instances must conform to as well as a set
of view requirements that determine how the various elements within that structure
should be rendered. These two components – the schema and the view –
can be equated to the concepts of 'data object' and 'information object' in
the draft ISO Standard, Open Archival Information System Reference Model
(reference 7.6).
Many text files have complex structures that go far beyond the 'sequence of lines' model used here. Initialisation and configuration files, email messages and SGML and XML document instances are commonplace examples. Although this document format can be used to encode such files, specially designed archival document formats that make allowance for these complex structures may be preferable.
The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'MAY' and 'OPTIONAL' in this document are to be interpreted as described in RFC 2119 (reference 7.3).
The terms 'block', and 'visual' in this document are to be interpreted as described in Cascading style sheets, level 2 (reference 7.2).
References to XML, XML namespaces, XML schema, and Unicode characters are to be interpreted according to Extensible markup language (reference 7.4), Namespaces in XML (reference 7.5), XML schema part 0 (reference 7.7), XML schema part 1 (reference 7.9), XML schema part 2 (reference 7.1), The Unicode standard: Version 3.0 (reference 7.8) respectively.
A plaintext instance
SHOULD use the namespace declaration:
http://preservation.naa.gov.au/plaintext/1.0.
A plaintext instance
MAY use the namespace prefix: plaintext.
A plaintext instance
MUST conform to the following XML schema:
<?xml version="1.0" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://preservation.naa.gov.au/plaintext/1.0"
version="1.0">
<xsd:annotation>
<xsd:documentation xml:lang="en">
plaintext. A schema to represent plain old text files.
Developed by the National Archives of Australia. Copyright 2003
Commonwealth of Australia.
</xsd:documentation>
</xsd:annotation>
<xsd:element name="plaintext">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="line"
minOccurs="0"
maxOccurs="unbounded">
<xsd:simpleType>
<xsd:extension base="xsd:string" />
</xsd:simpleType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="tabsize"
type="xsd:positiveInteger"
use="optional" />
<xsd:attribute name="xml:space"
fixed="preserve"
use="required" />
</xsd:complexType>
</xsd:element>
</xsd:schema>
plaintext
elementThis is the root element for a plaintext
instance. All the other elements in a plaintext
instance MUST be contained within this element. It consists of a sequence of
line elements. Sequencing of line
elements within the plaintext
element is significant and MUST be preserved. A plaintext
has two attributes: tabsize
and xml:space.
The tabsize sets
an interval for tab markers throughout the document and a plaintext
element MAY have a value associated with this attribute. The xml:space
attribute informs processing application that they must preserve all whitespace
within the plaintext
element. The value of this attribute MUST be "preserve".
XML code |
<plaintext> |
| Example fragment |
An empty <plaintext tabsize="8"
xml:space="preserve"/> |
Attributes |
tabsize |
tabsize
attributeThis attribute defines the widths of tab-intervals for the instance. When the
instance is rendered, any tab characters (Unicode character 9) MUST be rendered
as whitespace until the next character column that is a multiple of the tabsize
value.
XML code |
tabsize |
Example fragment |
tabsize="4" |
xml:space
attributeThis attribute is defined in the XML (reference 7.4) standard. Its value MUST be set to "preserve". The value of "preserve" specifies that any whitespace within a plaintext instance MUST be retained by a compliant application. Whitespace is defined as any instance or combination of any of the following four characters:
The whitespace within an unformatted document can convey significant meaning.
For instance, whitespace characters may help to: arrange characters into columns,
indent source code (which may be meaningful in some programming languages),
or arrange characters into boxes or other graphical objects (known as 'ASCII
art'). For this reason it is important to preserve all whitespace characters
within a plaintext
instance and to render these characters in any view of a plaintext
instance.
XML code |
xml:space |
Example fragment |
xml:space="preserve" |
line
elementThis represents a line
of text within the plaintext
instance. A line
consists of a sequence of Unicode characters. The order of the characters is
significant.
XML code |
<line> |
| Example fragment |
<line>Electronic
records are performances not physical objects.</line> |
Attributes |
|
A compliant application MUST support the requirements of the Plain Text View.
The Plain Text View is the default view for the plaintext
document format: when a plaintext
instance is rendered with no other arguments, it MUST be rendered in the Plain
Text View.
The Plain Text View is designed principally for visual devices.
The content of each line
element MUST be rendered within its own block.
Empty line elements
MUST be rendered.
The view MUST render a tab character (Unicode character 9) as whitespace until
the next character column that is a multiple of the value of the tabsize
attribute.
If a value for the tabsize
attribute is not present, the view MAY use any value to determine tab stops.
The user SHOULD be able to change this value.
The view MUST render the content of each line
element using a monospace font.
The view MAY wrap line
element content to fit the dimensions of the view.
7.1 Paul V. Biron and Ashok Malhotra (eds), XML schema part 2: datatypes, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-2-20010502)
7.2 Bert Bos, Hakon Wium Lie, Chris Lilley and Ian Jacobs (eds), Cascading style sheets, level 2: CSS2 specification, 12 May 1998. (www.w3.org/TR/1998/REC-CSS2-19980512)
7.3 S. Bradner, RFC 2119: key words for use in RFCs to indicate requirement levels, March 1997. (hwww.ietf.org/rfc/rfc2119.txt)
7.4 Tim Bray, Jean Paoli, C.M. Sperberg-McQueen and Eve Maler (eds), Extensible markup language (XML) 1.0 (second edition), 6 October 2000. (http://www.w3.org/TR/2004/REC-xml-20040204/)
7.5 Tim Bray, Dave Hollander and Andrew Layman (eds), Namespaces in XML, 14 January 1999. (www.w3.org/TR/1999/REC-xml-names-19990114)
7.6 Consultative Committee for Space Data Systems, CCSDS 650.0-R-2: reference model for an open archival information system (OAIS), July 2001. (www.ccsds.org/documents/so2002/spaceops02_p_t5_39.pdf - 240 kb pdf document)
7.7 David C. Fallside (ed.), XML schema part 0: primer, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-0-20010502)
7.8 The Unicode Consortium, The Unicode standard: version 3.0, 2000. (www.unicode.org/unicode/uni2book/u2.html)
7.9 Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn (eds), XML schema part 1: structures, 2 May 2001. (www.w3.org/TR/2001/REC-xmlschema-1-20010502)
7.10 Henry S. Thompson and Richard Tobin, XSV (XML schema validator), (software), 1.203.2.47.2.4.2.14/1.106.2.25.2.6 of 2002/06/15. (www.ltg.ed.ac.uk/~ht/xsv-status.html)
These examples are non-normative.
A text file consisting of three lines with no tabs.
line
one |
<?xml version="1.0" encoding="UTF-16"?> |
A text file that uses whitespace to improve the readability of source code.
Note: the '	' entity in the XML markup is a XML pre-defined character
reference for the tab character (Unicode character 9). A plaintext
instance may, but does not have to, use this reference.
//file: HelloJava.java |
<?xml version="1.0" encoding="UTF-16"?> |