multipage Document Format

National Archives of Australia

Date Published: 1 May 2003

Contributors: Simon Davis, Chris Bitmead, and Andrew Lee

Abstract

This specification documents the rules for the multipage document format. A multipage instance is an electronic document that consists of a sequence of 'page'-like objects that should be presented to an end user one object at a time. The specification consists of an XML schema, an explanation of the elements and attributes in that schema, a set of rendering rules for presenting document instances encoded according that schema to end users, and non-normative examples.

Status

This document has been released for public comment.

Table of Contents

Abstract
Status
1. Introduction
2. Dependencies
3. XML namespace
4. XML Schema
5. Explanation of elements and attributes
5.1 multipage element
5.2 page element
5.2.1 id attribute
5.2.2 label attribute
6. Views
6.1 Multipage View
7. References
8. Examples
8.1 Example 1
8.2 Example 2

1. Introduction

The multipage document format represents an electronic document that consists of a sequence of identifiable pages which should be presented to a user individually.

Although the capabilities of computer hardware and software mean that digital data objects can exhibit a far wider range of behaviour than paper data objects, many electronic documents still mimic the presentation of traditional paper documents: a collection of pages each of which has its own identity (for example, page number) that are presented sequentially to a user for use. Multi-image TIFF images or Adobe's Portable Document Format (PDF) are common examples. Other electronic documents might not be rendered one page at a time, but still depend on the structure of a sequence of individual and identifiable pages within a larger work (the 'document'): Internet Engineering Task Force Request for Comments documents are a typical example.

This format provides a simple and open XML vocabulary for representing the structure of such electronic documents. It does not provide elements and attributes for representing the content within a page. Instead, the multipage format is designed to be used together with other, content-focused, XML vocabularies to represent the source document. For instance, this format can be used with the National Archives of Australia's png document format to represent a scanned paper document, a multipage TIFF image, or the basic functionality of a PDF document.

The format consists of both an XML Schema to define the structure that instances must conform to as well as a set of View requirements that determine how the various elements within that structure should be rendered. These two components--the Schema and the View--can be equated to the concepts of 'data object' and 'information object' in the draft ISO Standard, Open Archival Information System Reference Model (reference 7.5).

2. Dependencies

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 (reference 7.2).

References to XML, XML namespaces, and XML Schema are to be interpreted according to Extensible markup language (reference 7.3), Namespaces in XML (reference 7.4), XML schema part 0 (reference 7.6), XML schema part 1 (reference 7.7), and XML schema part 2 (reference 7.1) respectively.

3. XML namespace

A multipage instance SHOULD use the namespace declaration: http://preservation.naa.gov.au/multipage/1.0.

A multipage instance MAY use the namespace prefix: multipage.

4. XML Schema

A multipage instance MUST conform to the following XML Schema:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
           targetNamespace="http://preservation.naa.gov.au/multipage/1.0"
           version="1.0">
 <xs:annotation>
  <xs:documentation xml:lang="en">
   multipage. A schema to represent electronic records that
   consist of a sequence of 'page-like' objects
   such as a multi-image TIFF.
   Developed by the National Archives of Australia. Copyright 
   Commonwealth of Australia.
  </xs:documentation>
 </xs:annotation>
 <xs:element name="multipage">
  <xs:complexType>
   <xs:sequence>
    <xs:element name="page" 
                minOccurs="0" 
                maxOccurs="unbounded">
     <xs:complexType>
      <xs:sequence>
       <xs:any namespace="##any" 
               processContents="lax"/>
      </xs:sequence>
      <xs:attribute name="id"
                    type="xs:ID"
                    use="optional"/>
      <xs:attribute name="label"
                    type="xs:string"
                    use="optional"/>
     </xs:complexType>
    </xs:element>
   </xs:sequence>
  </xs:complexType>
 </xs:element>
</xs:schema>

5. Explanation of elements and attributes

5.1 multipage element

The root element for a multipage instance. All other elements in a multipage MUST be contained within this element. It consists of zero or more page elements where each page element represents a distinct object within the sequence of objects that together form a coherent document. The order of pages is significant.

XML code <multipage>
Example fragment

A multipage element containing fragments of page 1 and page 2 of RFC2396, URI generic syntax encoded in the National Archives of Australia's plaintext document format:

<multipage>
 <page>
  <plaintext>
   <line>Berners-Lee, et. al.        Standards Track                     [Page 1]</line>
   <line/>
   <line>RFC 2396                   URI Generic Syntax                August 1998</line>
   <line/>
   <line>1. Introduction</line>
   <line/>
   <line>Uniform Resource Identifiers (URI) provide a simple and extensible</line>
   <line>means for identifying a resource. This specification of URI syntax</line>
   <line>[...]</line>
  </plaintext>
 </page>
 <page>
  <plaintext>
   <line>Berners-Lee, et. al.        Standards Track                     [Page 2]</line>
   <line/>
   <line>RFC 2396                   URI Generic Syntax                August 1998</line>
   <line/>
   <line>      The resource is the conceptual mapping to an entity or set of</line>
   <line>      entities, not necessarily the entity which corresponds to that</line>
   <line>      mapping at any particular instance in time. Thus a resource</line>
   <line>      [...]</line>
  </plaintext>
 </page>
</multipage>
Attributes

5.2 page element

Container for the data content of a distinct object within the instance. pages MUST be displayed individually to an end-user. Any well-formed XML is allowed, although all the XML content MUST be contained within one 'pseudo-root' element. For instance, the following 'flat' series of paragraph elements is illegal:

<page>
 <paragraph>Paragraph one</paragraph>
 <paragraph>Paragraph two</paragraph>
 <paragraph>Paragraph three</paragraph>
</page>
whereas the following series of paragraphs nested under a 'document' element is legal:
<page>
 <document>
  <paragraph>Paragraph one</paragraph>
  <paragraph>Paragraph two</paragraph>
  <paragraph>Paragraph three</paragraph>
 </document>
</page>

A page MUST NOT be empty; there MUST always be some form of well-formed XML content inside a page element.

An application that uses the multipage format must have access to other sources of information in order to respond appropriately to the contents of a page element. If an application does not have any special processing routines for the content of the element it MUST pass the content to any client application as unmodified XML (ie, an application MUST NOT ignore the contents of a page element simply because it does not understand the page's XML vocabulary). The use of XML namespaces to identify the contents of a page element is strongly advised.

A page element MAY have a value associated with its id attribute. This attribute uniquely identifies the contents of the page throughout the entire multipage instance. A page element MAY also have a value associated with its label attribute. This attribute provides a user-viewable name for the page.

XML code <page>
Example fragment A page element containing a fragment of a XHTML document:
<page id="n27" label="Page 1">
 <html:html xmlns:html="http://www.w3.org/1999/xhtml">
  <html:head>
   <html:title>An approach to digital preservation</html:title>
  </html:head>
  <html:body>
   <html:h1>Introduction</html:h1>
   <html:p>The entry of computer systems into the work environment 
        of organisations over the last two decades has dramatically
        altered the way in which employees work, communicate and share
        information. [...]</html:p>
  </html:body>
 <html:html>
</page>
Attributes id
label

5.2.1 id attribute

Provides a unique identifier for a page within a multipage instance. The identifier MUST be unique throughout the entire instance. The identifier MUST be a valid XML ID.

XML code id
Example fragment id="n27"

5.2.1 label attribute

Provides a user-viewable name for the page. Labels can be used to represent a page number or similar tag to identify a particular page within the multipage instance to an end user. There is no need for the value of a label attribute to be unique. The value of the label attribute MUST be a string.

XML code label
Example fragment label="xviii"

6. Views

A compliant application MUST support the requirements of the Multipage View

The Multipage View is the default view for a multipage instance: when a multipage instance is rendered with no other arguments it MUST be rendered according to the requirements of the Multipage View.

6.1 Multipage View

The Multipage View is designed primarily for visual devices.

The Multipage View presents the contents of a multipage instance to a user one page element at a time. The page element currently presented to the user is known as the current page.

The contents of the current page MUST be rendered according to the requirements of the XML vocabulary used in that content. If the requirements for the XML vocabulary are unknown, then the unmodified XML of the content MUST be rendered as text characters.

The value of the label attribute of the current page (if any) MUST be rendered.

The value of the id attribute of the current page (if any) MAY be rendered.

When a multipage document instance is first presented to a user by an application, the Multipage View MUST select the first page element of the the instance as the current page.

The application MUST offer user interface elements that allow a user to select to the first, last, next, previous page elements as the current page.

7. References

7.1 Paul V. Biron and Ashok Malhotra (editors), XML schema part 2: datatypes, 2 May 2001. (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502).

7.2 S. Bradner, RFC 2119: key words for use in RFCs to indicate requirement levels, March 1997. (http://www.ietf.org/rfc/rfc2119.txt)

7.3 Tim Bray, Jean Paoli, C.M. Sperberg-McQueen and Eve Maler (editors), Extensible markup language (XML) 1.0 (second edition), 6 October 2000. (http://www.w3.org/TR/REC-xml)

7.4 Tim Bray, Dave Hollander and Andrew Layman (editors), Namespaces in XML, 14 January 1999. (http://www.w3.org/TR/1999/REC-xml-names-19990114)

7.5 Consultative Committee for Space Data Systems,CCSDS 650.0-R-2: reference model for an open archival information system (OAIS), July 2001. ( http://www.ccsds.org/documents/pdf/CCSDS-650.0-R-2.pdf)

7.6 David C. Fallside (editor), XML schema part 0: primer, 2 May 2001. (http://www.w3.org/TR/2001/REC-xmlschema-0-20010502)

7.7 Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn (editors), XML schema part 1: structures, 2 May 2001. (http://www.w3.org/TR/2001/REC-xmlschema-1-20010502)

8. Examples

These examples are non-normative.

8.1 Example 1

A paper document consisting of three pages (two pages of text and one page of diagrams) is scanned into a document management system. The textual pages are put through an optical character recognition program that produces two text data files, and the page of diagrams is converted into a PNG data file. The two text files are then transformed into XML documents encoded in the National Archives of Australia's plaintext document format whilst the PNG image is transformed into an XML document instance encoded in the National Archives of Australia's png document format. The three XML documents are then bundled into one multipage instance for indefinite preservation.

8.1.1 Source documents

Fragments of the two plaintext document instances

<?xml version="1.0" encoding="UTF-16"?>
<plaintext">
 <line>Introduction</line>
 <line>The entry of computer systems into the work environment of organisations</line>
 <line>over the last two decades has dramatically altered the way in which</line>
 <line>employees work, communicate and share information. [...]</line>
</plaintext>
<?xml version="1.0" encoding="UTF-16"?>
<plaintext>
 <line>Also underpinning the approach are principles developed to ensure that the</line>
 <line>performance model supports the National Archives' values of comprehensive,</line>
 <line>equitable and sustainable access to the Commonwealth's archival resources [...]</line>
</plaintext>
A fragment of the png document instance:
<?xml version="1.0" encoding="UTF-16"?>
<png>
QYFBgEDABQELlKRp2mTqZNq+e05zYlmyvur+0b1yfiuKAv/38q+3mpjZzrLsmZ07L4pyQEZ/TDC3
[...]
</png>

8.1.2 multipage instance

<?xml version="1.0" encoding="UTF-16"?>
<multipage:multipage xmlns:multipage="http://preservation.naa.gov.au/multipage/1.0">
 <multipage:page label="Page 1"
                 xmlns="http://preservation.naa.gov.au/plaintext/1.0">
  <plaintext>
   <line>Introduction</line>
   <line>The entry of computer systems into the work environment of organisations</line>
   <line>over the last two decades has dramatically altered the way in which</line>
   <line>employees work, communicate and share information. [...]</line>
  </plaintext>
 </multipage:page>
 <multipage:page label="Page 2"
                 xmlns="http://preservation.naa.gov.au/plaintext/1.0">
  <plaintext>
   <line>Also underpinning the approach are principles developed to ensure that the</line>
   <line>performance model supports the National Archives' values of comprehensive,</line>
   <line>equitable and sustainable access to the Commonwealth's archival resources [...]</line>
  </plaintext>
 </multipage:page>
 <multipage:page label="Attachment"
                 xmlns="http://preservation.naa.gov.au/png/1.0">
  <png>
  QYFBgEDABQELlKRp2mTqZNq+e05zYlmyvur+0b1yfiuKAv/38q+3mpjZzrLsmZ07L4pyQEZ/TDC3
  [...]
  </png>
 </multipage:page>
</multipage:multipage>

8.2 Example 2

A multipage TIFF data file consists of two images. The file is transformed into a multipage instance where each image is represented by a png document instance (using the National Archives of Australia's png document format).

8.2.1 Source documents

Fragments of the two png instances:

<?xml version="1.0" encoding="UTF-16"?>
<png:png xmlns:png="http://preservation.naa.gov.au/png/1.0">
iVBORw0KGgoAAAANSUhEUgAAADcAAABECAIAAACqDzp+AAAIa0lEQVR4nM2aL3TjuBaHf/tewAUB
[...]
</png:png>
<?xml version="1.0" encoding="UTF-16"?>
<png:png xmlns:png="http://preservation.naa.gov.au/png/1.0">
Aj3vCBQIBAgUGBQYFBgMMBgQsCBgQMCAgAUBCwIeMBgQMMBggMGedwwGGCwIGGAwwGCAQIFBgUGB
[...]
</png:png>

8.2.2 multipage instance

<?xml version="1.0" encoding="UTF-16"?>
<multipage:multipage xmlns:multipage="http://preservation.naa.gov.au/multipage/1.0">
 <multipage:page>
  <png:png xmlns:png="http://preservation.naa.gov.au/png/1.0">
  iVBORw0KGgoAAAANSUhEUgAAADcAAABECAIAAACqDzp+AAAIa0lEQVR4nM2aL3TjuBaHf/tewAUB
  [...]
  </png:png>
 </multipage:page>
 <multipage:page>
  <png:png xmlns:png="http://preservation.naa.gov.au/png/1.0">
  Aj3vCBQIBAgUGBQYFBgMMBgQsCBgQMCAgAUBCwIeMBgQMMBggMGedwwGGCwIGGAwwGCAQIFBgUGB
  [...]
  </png:png>
 </multipage:page>
</multipage:multipage>