Python docstring markup - the "fat" specification

pytext-fat release 0.1/"and when they were up..."
(first version, for discussion)

Introduction

For the purposes of this document, I will call the format that is being documented "pytext". This is not to be taken as a final name, but I need to have something to use as a hook herein.

Edward Loper is likely to be working on a "minimal" markup specification, which I will refer to as the "thin" spec. This is thus the "fat" spec. It tries to retain those things from STpy that I believe had been sought by previous turns round the Doc-SIG loop, and which seem (to me) worthwhile.

Rationale

pytext is an attempt at a moderately minimal markup system for Python docstrings. This is something that has been vexing the Doc-SIG since 1997 or earlier. There is a definite demand for the ability to place some markup in docstrings. Some of this demand is for presentation markup (things like emphasis) and some for semantic markup (things like #python_code#).

Early on it was realised that most programmers will not use "heavyweight" markup (examples of this would be HTML, SGML or XML-derivatives, TeX variants, etc.), and Guido himself declared that this would be a bad idea [ref to be provided - email to Doc-SIG in 1997].

One alternative proposed was "setext" [ref to be provided], which was intended to be "markup for emails". From this, Jim Fulton and other Zopistas produced StructuredText (or ST).

ST has its problems - not least a very informal and imprecise definition, and an implementation that is (or so I am told) rather unpredictable. Nonetheless, it has been used to great success in a variety of Zope related arenas, by many people.

The Zopistas at Digicool have been working on a replacement, STNG, and a while back it looked as if the Doc-SIG was going to adopt a relative, developed "in-house", with a more precise definition, tentatively called STpy. Part of the motivation for this adoption was a perceived need for compatibility with what was going on within the Zope world.

Recently, however, Guido himself has intimated that he does not like ST (he appears to have been bitten by the implementation), but more importantly, that he does not regard it as necessary to maintain compatibility with the ST family. This generally throws the cat amongst the pigeons, but also makes it possible to consider some more radical changes that we had wished for.

So, on to the fat version of pytext...


Overview

pytext is primarily intended for use in Python docstrings. It may also be used for other short documents, but that is an incidental convenience. It is specifically not targetted at long documents, books, or even articles. It is intended to provide a minimal amount of markup, and relatively little control of presentation - if someone writing a docstring is worrying about presentation, they're probably worrying about the wrong thing.

pytext takes a docstring (or other short text) as input, and produces a DOM tree as output. The decision to output a DOM tree was taken so that simple processing of the output would be possible (for instance, to find all Python literals, for cross-referencing purposes) - Python itself provides both a minimal DOM implementation (minidom) and more advanced tools (via fourthought, etc. [ref]). Also, it seemed preferable to choose a well-know datastructure rather than create yet another ad-hoc one. Finally, producing a DOM tree with a known DTD (specified informally in this document, and still to be finalised) means that one can mix and match pytext parsers and output formatters.

pytext carefully does not assume that it knows anything particular about the final formatter that will be used to represent its data. It is known that all of HTML, XHTML, XML, LaTeX, ST, pytext and PDF are likely to be produced in the future. It is quite possible that an application might never actually output a representation, but just use the DOM tree for some other purpose.

This decision is the other main reason for not making the markup too clever - allowing for all of the possible target systems would make this rather difficult.

As part of the work of developing this specification (and early versions, in the STpy family), I have produced a module called docutils, which provides parsing code for pytext.

Note: at time of writing, still nearing first alpha release, and still needing to be forced sideways from STpy to pytext - but for simplicity I will talk about it in the present tense rather than the near future...

It also supplies a simple command line driven frontend, which allows users to experiment with different possible choices that might be maken as to the details of pytext (these will be described at appropriate places below), and a very simplistic (and horribly colourful) HTML formatter. It is important to remember that this utility is not a production tool, it is a testbed. Also note that if it's easy to program, it may have been left out (can I trust those on the Python lists to know what is easy to program and might thus have been left out? Let's assume so).

Note that despite the fact that pytext does not provide/require a formatter of its own, examples will be presented in this document to illustrate the sort of output that a user would be justified to expect.


The document

So, pytext takes as input a docstring or other relatively short and simple document. I shall use the term "document" for the text, for simplicity.

DOM
The top element of the DOM tree representing a document has the tag "pytext". It may have multiple children, representing the top level blocks within the document.

The document is first processed to replace all tabs, using the "normal Python conventions" (i.e., the appropriate string routine). This means that we can assume that there are no tabs left in the document. It also means that the document cannot use tabs for any specialised purpose.

The document is considered to consist of a series of lines, each containing zero or more text characters followed by a newline (i.e., the normal Python convention). The last line of the document is treated as if it possessed a newline, whether this is actually true or not.

The behaviour of non-printing characters [define more carefully] is undefined - specifically, odd things like vertical tab and so on are not considered whitespace.

A line containing only spaces, or having zero length (ignoring the newline) is a blank line. All other lines are text lines.

Each text line has an indentation - this is simply the number of spaces at the beginning of the line. If the document is a docstring, and the first text line is followed immediately by a blank line, and has an indentation of zero, then it is given the same indentation as the following text line (this is the normal procedure for docstring indentation, but phrased slightly differently than normal).

Trailing spaces are not significant, and are removed from all lines.

Starting blocks

Text within the document is gathered into blocks. There are two forms of block - one form will be colourised (that is, it may contain markup characters and they will be interpreted), and the other form will not be colourised (that is, characters will not be interpreted as markup).

All characters which are not being interpreted as markup will be passed through untouched, and should be represented "as is" by the final formatter, with the following exceptions:

Note in particular that the characters &, <, > (all significant in HTML) are not special in pytext, and will represent as themselves.

The following rules are used to split the document into blocks (my apologies that these involve some forwards references):

  1. The first text line in the document starts a block.
  2. A contiguous sequence of blank lines (a) ends a block, and (b) is treated as a single blank line (except in the case of literal blocks - see below.
  3. A text line after a blank line starts a block (again, with that caveat about literal blocks).
  4. A line that starts like a list item does start a list item, and thus also starts a block (except in non-colourised blocks - see literal and doctest blocks below).
  5. A line that starts like an anchor does start an anchor, and thus also starts a block (again, except in non-colourised blocks - see as above).

In practice, this does much what you would expect. For instance:

    This is a block - it is a paragraph.

    This is another paragraph - it follows a blank line.

    2. This is a list item.
       * So is this - and thus it too starts a block,
         even though it does not have a blank line in
         front of it.
    

which might be formatted as:

This is a block - it is a paragraph.

This is another paragraph - it follows a blank line.

  1. This is a list item.
    • So is this - and thus it too starts a block, even though it does not have a blank line in front of it.

Indentation and levels

Each block has an indentation associated with it. This is determined by the first line of the block. In colourised blocks, this is the only indentation that is deemed significant, and all other leading spaces are regarded as "soft" (i.e., they follow the "many spaces go to one" rule).

For non-colourised blocks, the indentation of each internal line is significant (although not quite directly), but the indentation of the block is still the indentation of the first line.

Conceptually (although not in the final DOM tree), blocks are formed into a hierarchical tree based upon their relative indentations. This happens in a manner which should be fairly familiar to a Python programmer...

Specifically, each block is assigned a level (the first paragraph arbitrarily gets level 0) as follows:

Given a block P at indentation I and level L:
  • If the following block has the same indentation, then it is at the same level, and is a sibling of P.
  • If the following block has more indentation, then it is at the next level (L+1), and is considered a child of P. The indentation corresponding to the new level is remembered.
  • If the following block has less indentation, then the corresponding level is looked up.
    If there is no corresponding level, then an error has occurred. An implementation may choose to continue after such an error, in which case the erroneous block shall be assigned the maximum level corresponding to an indentation that is less than the "bad" indentation.
    The block is added to the hierarchical structure as a sibling of the previous block with the new level, and the indentations corresponding to all greater levels are forgotten.

This makes more sense when explained with an example:

    The first paragraph is at level 0.                 [1]

    This second paragraph is also at level 0.          [2]
       1. This list item has indentation 3 and is      [3]
          at level 1.

          This paragraph has indentation 6 and is      [4]
          thus at level 2.

       This paragraph is back at level 1. The          [5]
       previous indentation for level 2 goes.

            This paragraph has indentation 8, but      [6]
            is at level 2 again.

          This paragraph has indentation 6, which      [7]
          is a mistake...
    

As we process the paragraphs we find:

paragraph indentation level {level:indent}
1 0 0 {0:0}
2 0 0 {0:0}
3 3 1 {0:0, 1:3}
4 6 2 {0:0, 1:3, 2:6}
5 3 1 {0:0, 1:3}
6 8 2 {0:0, 1:3, 2:8}
7 6 invalid {0:0, 1:3, 2:8}

Paragraph 7 is a problem because its indentation does not match any appropriate preceding indentation (it can't "see" paragraph 4). This mimics the way that Python requires indentation to be consistent.

One would expect an implementation to output some form of error for paragraph 7, and if it continued, to assign it level 1.

docutils performs in this manner, and the example formatter produces some stunningly obvious (or perhaps obnoxious) representation around the offending block. "Real" utilities should preferably be subtler...

Thus we now have a tree structure where the level 0 blocks are children of the "document" as a whole, the level 1 blocks are their children, and so on.

Later in the document the terms "child" and "parent" will be used to identify the relationship of blocks within this hierarchy.

If you have been exposed to ST or STNG in the past, you may be worrying about the uses to which indentation is put within the document, and how important this hierarchy actually is. Please don't worry - it is used much less than in the ST family, and in a much simpler fashion.

Paragraphs

A colourised block which is not a list item, header, anchor, label or other specialised structure is (sensibly enough) termed a paragraph.

DOM
A simple paragraph is represented as a "para" element. It may not contain any other block elements.

Lists

There are three types of list in pytext - descriptive, ordered and unordered. When colourised blocks are being identified, text lines are checked for them in that order (so if there is ambiguity between a descriptive list item and an ordered list item, the list item will be descriptive).

List items do not occur in non-colourised blocks.

Descriptive lists

A descriptive list item is composed of a key (or title), followed by one or more spaces, followed by three hyphens, and then optionally followed by one or more spaces and some text.

The key may not contain a newline, but is otherwise unconstrained. The key may contain markup characters.

The three hyphens (and their delimiting spaces) are not considered to be part of the key or the text, and are thus not retained in the DOM tree,

Any child paragraphs of the descriptive list item are considered to be part of that list item.

A contiguous sequence of descriptive list items (at the same level) will be aggregated together into a single descriptive list.

For example:

    Descriptive lists look like the following:
      Key  --- and some text.

          And this paragraph is a child.

      Another key -- and some more text
      This key --

        Has text only in a child paragraph.

      ' --- ' --- is what is used to delimit descriptive list items
    

which might be represented as:

Descriptive lists look like the following:

Key
and some text.

And this paragraph is a child.

Another key
and some more text

This key
has text only in a child paragraph.

' --- '
is what is used to delimit descriptive list items

or perhaps as:

Descriptive lists look like the following:

Key .. and some text.
  And this paragraph is a child.
Another key .. and some more text
This key .. has text only in a child paragraph.
' --- ' .. is what is used to delimit descriptive list items

Note: The ST family uses a double hyphen to delimit descriptive lists. Various people (including Guido and Edward Loper) dislike this because they use double hyphens within text -- like this. Three hyphens together is not a normal usage, so should be safe.

Are there other, better alternatives worth considering?

DOM
The key is rendered as a "key" element, and the text (if present) as a "paragraph".

The example would be stored as follows:

    <dlist>
        <ditem>
            <key> Key
            <para> and some text
            <para> And this paragraph is a child
        <ditem>
        ...
      

Ordered lists

An ordered list item is composed of an enumeration, optionally followed by one or more spaces and some text.

An enumeration sequence is a single letter (upper or lower case), or a number, followed by a dot.

Any child paragraphs of the ordered list item are considered to be part of that list item.

A contiguous sequence of ordered list items (at the same level, and of the same enumeration type) will be aggregated together into a single ordered list. Note that upper case and lower case letters are not considered of the same type.

For example:

    Ordered lists look like the following:
      1. This is the first item
      3. This is the second item (yes it is)
      a. This is a new list
      b.

         This list has disjoint text.
         A. and a sublist
    

which might be represented as:

Ordered lists look like the following:

  1. This is the first item
  2. This is the second item (yes it is)
  1. This is a new list
  2. This list has disjoint text.
    1. and a sublist

DOM
The enumeration is rendered as a "sequence" attribute, and the text (if present) as a "paragraph".

The example would be stored as follows:

    <olist>
        <oitem sequence="1">
            <para> This is the first item
        <oitem sequence="3">
        ...
      

Note that allowing letters in enumerations means that ambiguity is possible - for instance:

    Who am
    I. Me.
      

which will be parsed as:

Who am

  1. Me.

(that is, the "I" is interpreted as introducing a list item).

So, should we only allow numbers? One can still have the same problem, of course!

On the whole, this is something people will need to learn to work around.

Unordered lists

An unordered list item is composed of a bullet, optionally followed by one or more spaces and some text.

A bullet is one of "*", "-" or "+".

Any child paragraphs of the unordered list item are considered to be part of that list item.

A contiguous sequence of unordered list items (at the same level, and with the same bullet) will be aggregated together into a single unordered list.

For example:

    Unordered lists look like the following:
      * This is the first item
      * This is the second item
      - This is a new list
      -

         This list has disjoint text.
         * and a sublist
    

which might be represented as:

Unordered lists look like the following:

  • This is the first item
  • This is the second item
  • This is a new list
  • This list has disjoint text.
    1. and a sublist

DOM
The bullet is rendered as a "bullet" attribute (of the unordered list), and the text (if present) as a "paragraph".

The example would be stored as follows:

    <ulist bullet="*">
        <uitem>
            <para> This is the first item
        <uitem>
        ...
      

Note that allowing plus and minus as bullets is potentially confusing if someone is doing lots of maths. I think this is an acceptable risk.

Mixing lists

As one might hope, lists can be intermingled in the natural manner, with the obvious results.

Headings and sections

It can be useful to split a document up into named sections. Three levels of section are provided (which should be more than enough).

A block is a heading if:

  1. It is colourised.
  2. It has exactly two lines
  3. The first line contains text which is not composed entirely of "=", "-" or "~".
  4. The second line contains text which is composed entirely of (one of) "=", "-" or "~".

A level 1 heading looks like this:

    A title
    =======
    

A level 2 heading looks like this:

    A subtitle
    ----------
    

A level 3 heading looks like this:

    A subsubtitle
    ~~~~~~~~~~~~~
    

In each case, at least 2 of the "underlining" character must be present (ideally, the right number, of course, but it seems overly pedantic to check, and two seems like a reasonable compromise - it also stops the "-" case being misinterpreted as starting a list item).

A heading block starts a new section of the appropriate level, with the first line of the heading as its title. Level 1 is the "top" level. A section continues until another heading of the same level. The blocks within a section need not be indented more than the header block (but they should not be indented less).

The representation of sections will normally only be evident in the representation of the section title (unless, for instance, sections were relatively indented in the output format). One might expect an HTML formatter to choose (for instance) <h3>, <h4> and <h5>.

DOM
Sections are represented by "section1", "section2" and "section3" elements, which may (optionally) contain a "title" element, followed by the (top level) elements for that section. For instance:
    <section1>
        <title> A title
        <para> The first paragraph of that section.
      

If, for some reason, a user has indented a section, then that indentation should be taken as meaningful, and I think that this means that the first block of lesser indentation should end the section. Whether this is useful or not, I'm not entirely sure...

So, for instance, if the user types:

    Here is some text.

        Section 1
        ---------

        Section 1 text.

    This text is not in section 1.
      

then an implementation should treat this as:

    <para> Here is some text.
    <section2>
        <title> Section 1
        <para> Section 1 text.
    <para> This text is not in section 1.
      

If the user specifies a heading of level N before a heading for level N-1 has occurred, an untitled occurrence of a level N-1 section will be inserted "around" the level N section, just to keep the DOM tree pretty.

Headings and sections are optional within a document, and the first heading may occur at any point within the document.

Anchor blocks

Anchor blocks are colourisable blocks which start with two dots, an opening square bracket, a anchor and a closing square bracket, optionally followed by spaces and some text.

An anchor is either a sequence of one or more characters, starting with a letter or an underline, and continuing with zero or more letters, digits, underlines, hyphens or ampersands, or a number.

For instance:

    ..[Tibs] My home page is <http://www.tibsnjoan.co.uk/>
    ..[K&R] Many people regard this as the standard reference for
            the C programming language.
    ..[3] Gosh, reference number 3.
    

Anchors should be represented "as is", but omitting the initial two dots - for instance:

[Tibs] My home page is <http://www.tibsnjoan.co.uk/>

[K&R] Many people regard this as the standard reference for the C programming language.

[3] Gosh, reference number 3.

Anchors are the "far end" of local references, described below in the section on colourising.

DOM
The anchor itself (without the square brackets) is held as an attribute on the "anchor" element. For instance:
    <anchor name="Tibs">
        <para> My home page is ...
      

Question: Should anchors support the use of the "inline" element, like labels, so that the formatter can detect that the text for an anchor fitted on one line? I'm not sure...

Label blocks

Label blocks are colourisable blocks which start with a valid label, followed by a colon, optionally followed by spaces and some text.

A label is a sequence of one or more characters, starting with a letter or an underline, and continuing with zero or more letters, digits, underlines or hyphens. Labels are case-insensitive.

A label is only valid if it is in the current set of defined labels. This defaults to:

Question: do we need/want Author(s), Version and History, given the (possible) existence of __author__, __version__ and history__ (I believe at least the first two are fairly standard)?

If pytext is primarily intended for use within a tool such as pydoc or HappyDoc, it would make sense to use the interrogatable variables rather than the embedded-in-text values.

If an implentation finds text that looks like a label (i.e., appropriate text followed by colon and space), but is not a valid label, then it should be able to provide the user with an appropriate warning.

For each valid label, three properties are defined.

Firstly, a label must state to what DOM element it translates. It is allowed for different labels (for instance, "Author" and "Authors") to translate to the same DOM element (for instance, "author").

Secondly, some labels may be presented in either of two forms:

    Authors: Guido van Rossum and Tim Peters

    Author:
         * Guido van Rossum
         * Tim Peters
    

In the first form, the label block must be one line long, must have text after the colon (and space) and may not have children. In the second form, the label may not have text after the colon, and must have child paragraphs.

DOM
The examples above would be represented as:
    <author>
        <inline> Guido van Rossum and Tim Peters

    <author>
        <ulist bullet="*">
            <uitem> Guido van Rossum
            <uitem> Tim Peters
      

The "inline" element is used to allow the formatter to know that this label block was presented as a single line. This could equally be done by having an "inline" attribute on the label's tag, and using a "para" for the "inline" text, but that would make specifying whether a label could have inline data more complex (see below).

The implementation must provide a way of defining if the "one line" form is allowed for a particular label. It must also provide a way of specifying which elements (in the DOM sense) may be present as children of the label.

As an illustration, consider the docutils implementation. This has a dictionary which defines the valid labels and their translations:

    label_dict = {"Author":"author",
                  "Authors":"author",
                  "Arguments":"arguments"}
    

and another dictionary which indicates the allowable forms and child elements:

    label_children = {"Author":["inline","para"],
                      "Authors":["inline","para","ulist","olist"],
                      "Arguments":["dlist"]}
    

This is not a required or even recommended way of holding the data, it is merely intended as an illustration. One could imagine using a DTD for the same purpose.

An implementation should check that the child blocks of a label are valid according to their specification, and produce an appropriate warning if they are not.

Note that the child elements specified are the immediate child elements - a "dlist" may still contain "para" elements internally, for instance.

Why have this construct? The Doc-SIG perceived the need to allow semi-arbitrary DOM elements (well, they were talking SGML/XML at the time, but the principal stands), with some control over their content. Particular examples given were "author" and "arguments". The latter was felt to be especially important.

It would be possible to use headings instead, but the markup for headings is generally wrong for how people lay out these items, and also the constraint on content would be missing.

The matter of which characters are to be allowed in a valid label is still open for debate - there is a case for multinational labels, for instance.

Literal blocks

Literal blocks are one of the two forms of non-colourisable block.

A literal block is started when:

  1. A colourisable block ends with two colons, and
  2. it has children

The children form the literal block.

Specifically, if a colourisable block with indentation N ends with two colons, and the next block appears to be a child, then the literal block will extend until the start of a block which has indentation N or less.

It is an error for a colourisable block to end with two colons and not have any children.

Note that the two colons at the end of the parent block are replaced by a single colon after the literal block has been recognised.

Within a literal block, blank lines are retained (and the correct number of blank lines is retained). List items, anchors and labels are not recognised within a literal block.

Within a literal block, the indentation of each line is remembered.

When formatting a literal block, if a line has indentation L and the parent (colourisable) block has indentation N, then the line will be output with indentation 'N - L'.

For instance:

    This would be the paragraph introducing literal text::

        This is the first part of the literal block.


        This is another literal "paragraph" at the same indentation
        (of course, in practice it is actually still part of the
        *same* literal block, despite the blank lines above).

     *This* "paragraph" is not allowed by the Pythonic indentation
     rules, but is perfectly OK as a *literal* "paragraph", since
     it is still "under" the introductory ("main") paragraph.

    This paragraph is no longer literal, as it has the same
    indentation as the "main" paragraph.
    

which might be formatted as:

This would be the paragraph introducing literal text:

    This is the first part of the literal block.


    This is another literal "paragraph" at the same indentation
    (of course, in practice it is actually still part of the
    *same* literal subparagraph, despite the blank line above).

 *This* "paragraph" is not allowed by the Pythonic indentation
 rules, but is perfectly OK as a *literal* "paragraph", since
 it is still "under" the introductory ("main") paragraph.

This paragraph is no longer literal, as it has the same indentation as the "main" paragraph.

If the indentation of a line within the literal block extends to the left of the 'N - L' margin, then the literal block is invalid. An implementation should warn the user of the problem. A good implementation will still try to present the literal text in a suitable manner - perhaps by shifting the whole block rightwards by an appropriate number of spaces.

For instance:

    Here is some literal text::

        This is clearly the literal text,
     but its *internal* indentation is created
  to go too far to the left - naughty, naughty.
    

might be presented as if it were actually:

Here is some literal text:

      This is clearly the literal text,
   but its *internal* indentation is created
to go too far to the left - naughty, naughty.
      

There is no absolute requirement for a literal block's parent block to have any content. Specifically, it is allowed to write:

    1. This is a list item.

       ::

          This is a literal block.
    

which would produce something like:

  1. This is a list item.
       This is a literal block.
    	  

This is (a) not worth preventing, and (b) occasionally useful ((explain why)).

Doctest blocks

Doctest blocks are one of the two forms of non-colourisable block.

A doctest block starts with the characters >>>, and continues until the next blank line (or end of file).

Within a doctest block, list items, anchors and labels are not recognised.

Doctest blocks are intended to be "tested" by the Python doctest utility, whose documentation (see the Python library documentation) describes their form and purpose in detail.

Since doctest blocks are clearly valuable and since their use is encouraged, it makes sense to recognise them so they can be presented nicely in documentation (it helps that they're easy to recognise, too).

Within a doctest block, the indentation of each line is remembered.

When formatting a doctest block, if a line has indentation L and the preceding (colourisable) block has indentation N, then the line will be output with indentation 'N - L'.

For example:

    The following block is a doctest block:

        >>> 1+1
        2

        But this block isn't...
    

which might be formatted as:

The following block is a doctest block:

    >>> 1+1
    2
      

But this block isn't...

Note that literal blocks are detected before doctest blocks, so that:

    This is not a doctest block::

        >>> # maybe the bash shell?
    

can be written.

Paragraph indentation

We have seen how block indentation is used to produce a block hierarchy, which is used to identify parents and children in the context of list items, literal blocks and so on.

However, sometimes a user wishes to use indentation solely as indentation. The simplest rule that appears to work reasonably well is that if a "para" element has children, then they should be indented with respect to that "para". So, for instance:

    An ordinary paragraph.

        This is a child. It is indented.

        1. This list item is indented as well.
    

might be represented in the DOM tree as:

DOM
    <para> An ordinary paragraph.
    <indent>
        <para> This is a child...
        <olist>
            <oitem sequence="1"> This list item...
      

which might look like:

An ordinary paragraph.

This is a child. It is indented.

  1. This list item is indented as well.

The case where a paragraph is immediately followed by an indented list item is one that deserves consideration - I am not entirely convinced that:

    This is a paragraph.
    1. And a list item
    

should be formatted differently than:

    This is a paragraph.
        1. And a list item
    

On the other hand, if the "indent" element is present in the DOM tree, the formatter might be deemed at liberty to "optimise it out" in such circumstances.

Colourising

After the document structure has been determined, the colourisable blocks are, well, colourised. Specifically, legal markup is located and converted to appropriate structures in the DOM tree.

Markup detection is done in the following order:

  1. Python literals
  2. Literals
  3. References
  4. "Bare" URIs
  5. Local references
  6. Emphasis

In the first release of pytext, markup may not be nested. Thus, whilst it is not forbidden to write:

    This is an *emphasised 'literal'* - also see "*Tibs*":
    <http://www.tibsnjoan.co.uk/>.
    

the result will be:

This is an *emphasised literal* - also see *Tibs*.

and a nice implementation should attempt to warn the author that something is probably amiss.

In future versions of pytext, nested markup may become available, and at that time the fragment above would be expected to be rendered as something like:

This is an emphasised literal - also see Tibs.

(The provision of nested markup is not regarded as of high priority.)

Python literals

If there is a character before the starting hash of a Python literal, it must be space. If there is a character after the closing hash of a Python literal, it must be space, newline or punctuation (specifically, one of ).,;:!?" or ').

Python literals may not contain newlines or hashes. The first and last characters in a Python literal may not be space.

For example:

    #fred# and #'spam'#
    

might be presented as:

fred and 'spam'

DOM
    <python> fred
    <text>  and 
    <python> 'spam'
      

Literals

If there is a character before the starting quote of a literal, it must be space. If there is a character after the closing quote of a literal, it must be space, newline or punctuation (specifically, one of -).,;:!?").

Literals may not contain newlines or single quotes. The first and last characters in a literal may not be space.

For example:

    'fred' and '#spam'
    

might be presented as:

fred and #spam

DOM
    <literal> fred
    <text>  and 
    <literal> #spam
      

Emphasis

If there is a character before the starting asterisk of an emphasised string, it must be space. If there is a character after the closing asterisk of an emphasised string, it must be space, newline or punctuation (specifically, one of -).,;:!?" or ').

For example:

    One can emphasise *one* or *more than one*.
    

might be presented as:

One can emphasise one or more than one.

DOM
    <text> One can emphasise
    <emph> one
    <text>  or 
    <emph> more than one
    <text> .
      

Note: Within the context of docstrings, there seems little need for more than one form of emphasis.

URIs and references

URIs are delimited by < and >>. This is similar to practice in email headers, and is preferred to trying to recognised URIs by themselves because it is quite difficult to do, especially in the presence of trailing punctuation.

If there is a character before the starting < of a URI, it must be space (or double quote and colon - see below). If there is a character after the closing > of a URI, it must be space, newline or punctuation (specifically, one of -).,;:!?" or ').

URIs may be presented "bare":

    See <http://www.tibsnjoan.co.uk/>.
    

or with a representation text:

    See "Tibs-and-Joan": <http://www.tibsnjoan.co.uk/>.
    

The representation text form is written as a double quote, some text (anything except a double quote), a double quote, optional spaces and/or newlines, and a URI.

The examples should be presented as follows (if the format being output supports URI links, of course):

See http://www.tibsnjoan.co.uk/.

See Tibs-and-Joan.

Note that the < and >> are not preserved, nor are the double quotes around the representation text.

DOM
The two examples would be represented as:
    <reference uri="http://www.tibsnjoan.co.uk/"> http://www.tibsnjoan.co.uk/
    <reference uri="http://www.tibsnjoan.co.uk/"> Tibs-and-Joan
      

Local references

Local references are the "other end" of anchors [reference]. They are intended to look somewhat like footnotes or citations.

A local reference is composed of an opening square bracket, an anchor, and a closing square bracket.

If there is a character before the starting bracket of a local reference, it must be space. If there is a character after the closing bracket of a local refernce, it must be space, newline or punctuation (specifically, one of -).,;:!?" or ').

For example:

    My name is [Tibs]. Personally, I'm not too keen on [K&R].
    Note that local references can also be numbers [3].
    

might be presented as:

My name is [Tibs]. Personally, I'm not too keen on [K&R]. Note that local references can also be numbers [3].

DOM
The examples would be presented as:
    <localref anchor="Tibs"> [Tibs]
    <localref anchor="K&R> [K&R]
    <localref anchor="3"> [3]
      


Author: Tibs (tibs@tibsnjoan.co.uk)

Last modified: Thu Mar 29 22:41:06 BST 2001