The SARA Protocol

 

Version 1.065

SARA version 0.95

This is the second version of level 1.05 of the protocol spec to accompany the first build of software release 0.940. The specification, especially the format of the .dsc file, may change as a result of testing. Comments in square brackets refer to corrections needed before this specification is finalised.

 

Note. At the same time as the Sara indexing toolkit we shall be releasing version 1.0 of the client/server software. This will use version 1.07 of the protocol. At this point all calls that are being maintained solely for compatibility will be removed. There are marked with an asterisk below.

 

 

The SARA protocol is a client-server protocol allowing a central database of texts with SGML markup to be queried by remote servers. The protocol was designed for use with TCP, though any other network could be used. The only assumption made about the network is that it is capable of delivering null-terminated strings in the order they were sent.

 

This document describes the protocol (Section 1) and the associated query language CQL (Section 2). It does not cover the procedures required to build and set up and index, nor does it document the operation of the SARA server. Some trivial SARA related tools are described in Section 4.16.

 

1.     The protocol

 

All strings used as messages are ASCII strings. The lengths of strings is variable. Strings must be null-terminated; the final 0 cannot be omitted.

 

Some arguments (to be specified later) allow Unicode characters to be embedded in ASCII text. An embedded Unicode value is represented by the character ^U (ASCII 21) followed by a four character text representation of the Unicode value as a hex number.

 

All transactions consist of a message sent from the client to the server followed by a reply from the server to ther client. Client messages begin with a keyword and may contain other data, depending on the keyword. Server responses begin either OK or NO. Extra data depends on the client message keyword.

 

Implementation limit: The maximum length of a client message is 6000 characters.

 

There is one exception to this rule. Certain transactions are classified as interruptable. If any urgent data is available from the client socket during an interruptable transaction then the transaction is halted and the server writes the string NO ABORT to the socket. In this case there is one more client message than server reply. The content of the data package sent to interrupt the server should be the string INT. The MSG_OOB flag must be specified when sending the string. The behaviour of the server on receiving any other data after a read and before a write is undefined.

 

1.    A SARA session consists of these phases:The client connects to server and server accepts the call.

2.    The server tries to create a process to accept data packages from the client. If it cannot do this (say because memory is short on the server) then the server closes the socket.

3.    The user logs on

4.    Setup messages are exchanged

5.    A client session takes place

6.    The user logs off

7.    The server closes the socket

 

Setup messages are messages that are used in phase 4 of this process.

 

Once a connection has been established, the server must receive packages regularly. If the timeout period elapses without any packages being received then the server will close the connection. The timeout period is determined by the server: 10 minutes is a typical value. To keep the connection alive, send any package that is not a legal server command: it is guaranteed that the command TIMER will always be illegal.

 

Note that the timeout only operates while the server is waiting to receive packages. It does not prevent the server from spending a long time in a calculation.

 

Do not send keepalives between a client read and the subsequent write: these may be interpreted as interrupts.

 

The rest of this section documents all messages. Messages marked ‘Deleted’ were in [1] but have since been removed.

 

1.1     ACSCORE

 

 

ACSCORE query l r m word

query   the query to collocate with

l           the left window

r           the right window

m         the scoring algorithm (see 1.8)

word    the word to collocate.

 

Scores a collocate.

 

Response:

OK s

 

Returns the score s.

1.2     ADICT 

 

ADICT n

n          Offset

 

Returns the nth element of the attribute dictionary specified by the most recent LOOKUPA (1.34) call.

 

Response:

OK v f

where v is the value and f its frequency.

 

1.3     BIB*

 

BIB ccc

ccc       three character code for the text

 

Any enquiry for bibliographic data starts with this message. The form is

 

BIB ccc

 

where ccc is the three character code for the text. The reply, if bibliographic data is found, is of the formResponse:

 

 

OK t n

 

where t indicates the type of data available and n is the number of items. Currently the assigned types are:

 

0 (written type): there are two items, a title and a description

1 (spoken type): there are multiple items, the first a title and the rest descriptions of speakers

 

NO BIB

If tThere is no bibliographic data then the reply will be NO BIB.

 

 

 

This system is deprecated. It assumes that the server keeps a separate file of bibliographic data. Bibliographic data ought for preference to be embedded in the individual text headers. See 5.3. The Sara client will only send BIB strings if there are no BIB strings in the dsc file.

1.21.4     BIBITEM*

 

The message BIBITEM ccc n

ccc       Three char text id

n          Index of desired item

 

eExtracts the nth bibliography string for text ccc;

 

the reply is Response:

OK s,

where s is the string.

 

NO BIB

No such item

1.31.1     CSCORE

 

Obtain a collocation score. The arguments are a text string s, a number n and a CQL query q.

 

The server responds NO SYNTAX if the CQL query cannot be parsed. Otherwise it established the number of occurrences of the word s within n words of a solution to q. It replies OK l where l is this number.

 

1.5     CTAB

 

CTAB s n m cql

s           query whose collocations are required

n          left window

m         right window

p          pattern used to restrict the words tested

 

Build the collocation table. The settings used depend on the CTABOPTIONS string last sent; the client always sends a CTABOPTIONS string just before a CTAB string.

 

The window of a query is defined as follows: for a given hit, the window of the hit is the collection of words from the nth word to the left to the mth word to the right of the first word of the hit. The window of a query is the union of the windows of the hits.

 

For a given query window there is for any word (collocate) a number, called the co-frequency, that records how frequently a form of the word occurs within the given window of the hit (for forms see 1.15). The co-frequency is transformed into a score using one of a number of rules (see 1.8). A collocation table is a list of words with their scores; the words will all match the pattern p and will be the highest scoring words found; how many words are entered into the table depends on the CTABOPTIONS in force.

 

There is at most one collocation table at any time. It must be freed before another table is built.

 

Response:

OK n

where n is the number of entries in the table.

 

NO 0

If no entries were found

1.6     CTABENTRY

 

CTABENTRY n

n          Index

 

Return the nth member of the collocation table.

 

Response:

OK {w} f sc

w         headword (collocate)

f           co-frequency

sc         score

 

See 1.5 for terminology.

1.7     CTABFREE

 

CTABFREE

 

Free the collocation table. The table must be freed before a new one is built; there can only be one collocation table at any time.

 

Response:

OK

1.5     CTABOPTIONS

 

CTABOPTIONS n m s k

 

Set options for collocations

 

n          scoring algorithm.

            0: Z score

            1: MI score

m         0 if the returns are to be limited by number, 1 if they are limited by score

s           limiting value in either case

k          frequency below which collocates should be ignored

 

As explained in 1.5 a scoring rule is a formula that transforms the frequency of a collocate into a score. Typically the purpose of a score is to distinguish words that are of high frequency because they are significant collocates of the query from words that are frequent merely because they are common.

 

In the following table:

x denotes the co-frequency of the collocate, that is, the number of times it occurs in the window around a hit

p denotes the number of occurrences of the collocate in the corpus (its frequency)

d denotes the window size multiplied by the number of hits for the query

n denotes the total size of the corpus in words.

 

It may help to point out that we would expect the collocate to occur p*d/n  times within the window. Therefore if x is significantly greater than p*d/n we have a good collocate. The actual formulae used are as follows:

 

Z-score

MI-score

 

For details of these and other scoring formulae the reader is referred to Statistics for Corpus Linguistics, Oakes, Edinburgh (1998) Chapter 4.

 

 

In calculating a collocation table a word will be included if

(a)     its frequency is greater than or equal to k

(b)     either m=0 and the word is among the s collocates with highest score, or m=1 and the score of the word exceeds s.

 

Response:

OK

1.6     CTABTHIN

 

CTABTHIN s1 s2 n m w

 

s1         query to be thinned

s2         name of the new query to be created (obtained as usual using QNAME)

n          left window

m         right window

w         is the headword

 

This creates a new query consisting of those hits for s1 that have the collocate w within the stipulated window.

 

Response:

OK i j

i is the number of solutions, j the number of texts that contain a hit.

 

NO FILES

Could not create query s2.

1.41.10     DMATCH

 

DMATCH n

n          location in word list created by LOOKUP

 

Must follow a LOOKUP (1.33). The argument is a single integer n. The nth member of the wordlist created by LOOKUP is found (s) together with its frequency (f) and form count (f1).

 

Response:

The form of the reply is OK f s {us} f1.

The returned form u is obsolete. The server no longer translates character entities in the returned word.

 

The first string is the matching word in a form suitable for display while the second should be used in any subsequent queries using the word. The two forms differ, essentially, in having and not having character entities replaced. The display form may contain embedded Unicode characters.

 

1.51.11     DOWNLOAD

 

DOWNLOAD header

 

Get the corpus headera file from the server. Currently the three files available are the corpus description file, pagefmt.txt and linefmt.txt. These files are described in Section 4.

 

The form of the message is DOWNLOAD file where file is

 

1.header to download the corpus header. This file should be saved with the corpus name and extension .dsc.

 

2.1.<filename> to download filename.txt

 

Response:

. The server replies OK provided the file is available, or NO FILE if it is not.

 

Subsequent LINE messages retrieve the file, block by block.

 

Note: Files for download are stored on the server as Unix text files, with lines terminated by newline characters. When downloaded they are automatically corrected to PC text format with lines terminated by carriage return/newline pairs. If you must move files between a PC and a Unix machine by ftp rather than in SARA, ensure that ftp is set to text and not to binary.

 

 

1.61.12     FILTER*

 

FILTER q f

q          Query name

f           Name of filter

 

Assign a filter to a query. The arguments are the query name n and the name of the filter. Filters are used to process individual solutions before returning them to the client.

 

Response:

OK

 

The following table describes the filters available.

 

1.6.11.12.1     ADJPOS

Trim solution so that no partial POS codes are transmitted.

1.6.21.12.2     ADJSGML

Trim solution so that no partial SGML tags are transmitted.

1.6.31.12.3     CMAP

Map characters as defined in1.23.

1.6.41.12.3     NOPOSX

Delete all old-style POS entities.

1.6.51.12.4     NORMCR0

Turn single <LF> characters into <CR><LF>.

1.6.61.12.5     NORMSPACE

Normalise all white space so that sequences of white space characters become single spaces.

1.6.71.12.6     NOSGMLX

Remove all SGML markup.

1.6.81.12.7     WTOPOS

Convert new-style POS markup (w-tags and c-tags) to old style.

 

This mechanism is only supported for compatibility. The latest Sara client software performs filtering itself.

1.13     FENTRY

 

FENTRY n

n          Index

 

Get the nth entry in the frequency table. This must follow an FTAB (1.16) string creating a frequency table.

 

Response:

OK {s} k kk

 

s is the headword and k the frequency. kk is the number of forms.

1.14     FFREE

 

FFREE

 

Free the current frequency table.

 

Response:

OK

1.15     FORM

 

FORM i s

 

i           index of the form required

s           headword

 

Response:

OK {s1} d

where s1 is the form and d its frequency.

 

NO

if the headword has no forms

 

A corpus consists of words. Each word is tagged with part-of-speech and variant information[1]. Therefore for every word there exists a string of characters that is its spelling, a POS (part-of-speech) code and a variant number. Most corpora do not use variant numbers.

 

Each such word also has a headword. The headword is a sequence of strings of which the first is always the spelling of the headword and the remaining strings (optional) convey lexical information of any kind desired. Any word that has headword hw is called a form of hw.

 

The lemmatisation scheme that simply preserves the POS code and variant recorded by the indexer is called null lemmatisation. In null lemmatisation each extended word is a headword and each headword has exactly one form for each frob in which it occurs.

 

A scheme for deducing a headword from a word is called a lemmatisation rule. The indexer supports a number of different lemmatisation schemes: see 1.26 for descriptions. The decision which rules to apply to a particular corpus is taken by the indexer and the server may only access lemmatisation rules that were selected for the corpus when it was indexed.

 

Both s and s1 are extended words, that is, they are strings separated by a separator character prescribed by a lemmatisation scheme. s uses the lemmatisation scheme in force, s1 the null lemmatisation scheme. By convention the equal sign is used as the separator where possible.

 

1.13     FTAB

 

FTAB n ll ul patt

 

n          Max words in table (-1 for no limit)

ll          Lower limit: exclude words with lower than this frequency (-1 for no limit)

ul         Upper limit: exclude words with higher than this frequency (-1 for no limit)

patt      Pattern

 

Build a frequency table for words matching patt with the desired parameters.

 

If a word limit is imposed, the most frequent words that satisfy the other constraints will be selected.

 

Note that FTAB always returns headwords of the current lemmatisation rule.

 

Response:

OK n

where n is the number of words in the table.

 

Use FENTRY (1.13) to get individual entries.

 

1.71.14     GET*

 

GET q n s gets a single solution from query q. The solution is solution number n in sequence. s denotes the SGML element to be used to bound the solution, or an integer.

 

The reply is OK text s i0 i1 pos ss where:

 

·       text is the text identifier

·       s is the number of the sentence containing the solution, that is, the value of the n attribute of the sentence, or the string “?” if the sentence number cannot be found;

·       i0 is the offset of the solution in the returned text

·       i1 is the length of the solution

·       pos is the part-of-speech code of the solution (obsolete)

·       ss is the solution text; it may include embedded Unicode characters.

 

The string s used to specify the bounding element will usually be the name of an SGML element. However, it is permissible to supply a number of elements separated by commas. In this case, the element whose last start tag before the hit is latest in the file will be used to bound the solution. This is useful if different texts use different markup conventions.

 

If s is an integer then the amount downloaded is the smallest collection of sentences that contain the hit and s words in front of it inclusive.

 

If the text is unavailable an OK response will still be generated. Its solution text will be a message that says that the text is unavailable, its text identifier will be valid, but it can be distinguished from a genuine solution by the fact that the sentence number is set to –1.

 

The reply is NO SOL if for some other reason the solution is not available.

 

The client no longer uses GET; see GETSOL (1.23).

 

1.15     GET1SOL

 

GET1SOL txt elt att

txt        Three char text id

elt        element name

att        attribute name

 

This provides a simple way to query a text for an attribute value. The first occurrence of element elt is found and then, if att is the string the content of the element is extracted (including any markup) or otherwise the attribute with that name on the element is found and its value extracted. The value is returned.

 

OK s

s is the value.

1.81.18     GETHEAD

 

GETHEAD txt pos d

txt        text id

pos       offset of the desired information

d          initialises the depth count

GETHEAD is used to extract data from a given position for browsing.  The format is GETHEAD txt pos d where

 

·txt is the text id

·       pos is the offset of the desired information

·       d initialises the depth count

 

If the server finds content at the specified offset, it reads all the content into a string s, sets variable bTag false, reads and discards any end tags following the content, adjusting the depth count accordingly but never allowing it to become negative, and sets newpos to the new offset and newd to the new depth.

 

If the server finds an XSGML start tag, it reads the text of the tag into s. It sets variable bTag true and sets newd one greater than the depth count. It sets newpos to the offset of the end of the tag. Finally it sets jump to be the offset at which the element being opened ends.

 

Empty elements are treated as content. w-tags and s-tags are treated as content. s is trimmed of leading and trailing blanks, has spacing normalised and characters mapped.

 

The return string is Response:

OK newpos jump newd bTag s.

 

If the text cannot be found the reply is NO TEXT.

If the text cannot be found the reply is

NO TEXT.

 

1.91.19     GETHEAD2

 

GETHEAD2 txt pos i0 i1

txt        Text id as in GETHEAD

pos       Position as in GETHEAD

i0         Character offset

i1         Characters in hit

 

This call is used to locate a string whose file position is known in a string returned from GETHEAD. The location cannot be deduced without such a call because of the tidying GETHEAD performs on solutions. i0 and i1 will be recovered using LOC (1.30).

 

The format is GETHEAD2 txt pos i0 i1 where txt and pos were the values used to extract the solution in GETHEAD and i0 and i1 are the coordinates of the solution returned by LOC. The server calculates the offset and length of the solution in the string (j0 and j1 say) and replies

 

Response:

OK j0 j1.

 

If the text cannot be found the reply is

NO TEXT.

 

Note: Now that solution tidying is performed on the client this and LOC could be removed.

1.101.21     GETPOS

 

GETPOS s

s           Word

 

The argument is a string s. Thiss is a word and the server finds all possible parts of speech for it.

 

Response:

The reply takes the form OK n s1...sn

where n is the number of solutions and s1...sn the different POS codes.

 

1.111.22     GETSC

 

GETSC s n

 

The arguments are a string s and a number n. The string s    must be the coCorpus name.

n          The number is a tText number.

 

Get the name of the nth text.

 

Response:

If s is the corpus name:

The return is OK s1 b

where s1 is the name of the text with this number, and b is 1 if the text is available, 0 otherwise; or, if the corpus name is wrong, it is NO.

 

 

Otherwise:

NO

 

1.23     GETSOL

 

GETSOL q n s

q          Query name

n          Sequence number

s           Scope to be retrieved.

 

Gets a single solution from query q.

 

Response:

OK text s i0 i1 pos ss

text      numerical text identifier – use GETSC to get the name

s          number of the sentence containing the solution, that is, the value of the n attribute of the sentence, or the string “?” if the sentence number cannot be found;

i0         offset of the solution in the returned text

i1         length of the solution

pos       part-of-speech code of the solution (obsolete)

ss         solution text; it may include embedded Unicode characters.

 

NO SOL

This indicates an installation error.

 

The string s used to specify the bounding element will usually be the name of an SGML element. However, it is permissible to supply a number of elements separated by commas. In this case, the element whose last start tag before the hit is latest in the file will be used to bound the solution. This is useful if different texts use different markup conventions.

 

If s is an integer then the amount downloaded is the smallest collection of default scopes that contain the hit and s words in front of it inclusive.

 

If the text is unavailable an OK response will still be generated. Its solution text will be a message that says that the text is unavailable, its text identifier will be valid, but it can be distinguished from a genuine solution by the fact that the sentence number is set to –1.

 

1.121.23     INFO

 

INFO cp

 

cp        number of the code page that should be used to translate character references

 

This string allows the client and server to exchange information. It is the only message that can legally be sent before the user logs on.

 

The form of the message is INFO cp  where cp is the number of the code page that should be used to translate character references. The following code pages are available:

 

850

Windows ANSI

 

 

The cp parameter is ignored by the server, which no longer always translates character reffeerences using the Unicode code page.

 

Response:

The response will be OK n v sv cv nm sc

where

 

1.       n is the server timeout value in seconds,

2.       v is the version number of the corpus description file. The DOWNLOAD message may be used to obtain files whose versions have changed.

3.       sv is the server version number multiplied by 1000

4.       cv is the smallest acceptable client version number, multiplied by 1000

5.       nm is the corpus name

6.       sc is the number of registered subcorpora

 

1.24     LEMMDESC

 

LEMMDESC name

name   short name of lemmatisation scheme

 

Response:

OK desc

 

If the name is not recognised

NO

 

desc is a one line description of the named scheme.

1.25     LEMMINFO

 

LEMMINFO

 

Get a string describing the current lemmatisation schema

 

Response:

OK s

 

Name

Description

Info

Notes

Extra parameters

Null

The null scheme maps each word to itself

0 =

1 = Pos

2 = Pos=Sense

Pos is part of speech tag. Sense is variant number.

 

Bnc

The bnc scheme maps each word to a headword with the same spelling and no part of speech information.

0 =

 

 

Lancaster

The Lancaster scheme was developed at the University of Lancaster.

1 = Tag

Tag is a loose lexical classification

c6 or c7 (the tagset)[2]

Inline

Lemmatisation from an inline attribute.

0 =

 

 

 

The table shows the various lemmatisation schemes available (for terminology see 1.15). In the description strings the first token is an integer prescribing the number of extra lexical fields are allowed in an extended word. The next character is the separator used to separate fields and the final entry consists of names for the extra fields separated by the separator.

 

Several forms occur for null lemmatisation because different corpora mark parts of speech differently (or not at all).

1.25     LEMMSEL

 

LEMMSEL name

 

Set the current lemmatisation scheme.

 

Response:

OK

 

If the name is not recognised

NO

1.26     LEMMTEST

 

LEMMTEST name word pos sense

name   name of lemmatisation scheme

word    spelling of form

pos       POS  code of form

sense   Variant number

 

Get the headword of a given form.

 

Response:

OK headword

 

If the name is not recognised

NO

 

Note: it would have been more consistent to pass the three parameters word, pos and sense as an extended word.

1.131.27     LINE

 

LINE

 

Get a block of a downloaded file (see 1.11).

 

Response:

The line is sent in the form OK s

s is the next block

 

and the first three characters should be discarded. When there are no more blocks the reply will be

NO MORE.

1.141.28     LOC

 

LOC s n

s           Query name

n          Solution index

 

The arguments are a string s and a number n. The call finds the location of a solution (unlike GETSOL, which gets the text). s is the query name and n the number of the desired solution.

 

Response:

The form of the return is OK nt nc nw where nt is the text number, nc is the character offset of the solution and nw is the word number.

 

NO FILE

if the query name is invalid

 

Note: LOC and GETHEAD2 (1.20) were implemented to allow the text of a hit to be marked while examining a text in the tree-browse window. The normal way to recover solutions is to call GETSOL repeatedly.

 

1.151.31     LOG

 

LOG name pwd

name   User name

pwd     Password

 

Used to log on. Before a successful LOG the system replies NO LOGIN to any message.

 

Response:

If login succeedsThe two string arguments are the user’s name and password.

 

 

The response to a correct login is OK followed by a copyright message.

 

The response to a bad login is NO BADLOG

otherwise.

 

In response to failure of the last allowed login attempt, the server may close the connection without a reply.

 

1.161.32     LOGOUT

 

LOGOUT

 

Used to log off. There is no reply.

 

1.171.33     LOOKUP

 

LOOKUP pattern

pattern Initial string to match to words

 

Lookup a patternword in the dictionary. The pattern is just an initial string; for wildcard patterns see RLOOKUP.

 

Response:

 The sole argument is the pattern and the reply is OK n

where n is the number of words in the dictionary that begin with the string pattern.

 

If there are no words:

NO 0

 

DMATCH can be used to retrieve the words.

 

Note that folding to lower case should be left to the server.

 

1.34     LOOKUPA

 

LOOKUPA elt att

elt        element name

att        attribute name

 

Look up an attribute in an attribute dictionary, that is, a list of all values of an attribute.

 

Response:

OK n

where n is the number of values in the dictionary.

NO

if there are no values.

 

The values can be retrieved using the ADICT (1.2) call.

 

Pre 0.95 servers have no attribute dictionaries and therefore always return NO.

1.181.35     MAXLENGTH

 

MAXLENGTH n

n          Desired limit on solution length returned.

 

Tells the server the maximum length of a solution that may be returned. The desired limit is the argument.

 

Response:

, and the server responds OK n

where n is the limit actually set, which may be smaller than requested.

 

Implementation limit: If n exceeds 5000 it is set to 5000.

 

1.191.36     MOTD

 

MOTD

 

Gets the Message of the Day from the server.

 

Response:

 The response is OK s

 where s is the message.

 

1.201.1     OPEN

 

Open a saved query. The argument is the query name. The reply is OK n if the operation is successful and the file contains n solutions; it is NO if the file cannot be opened.

 

1.211.37     PWD

 

PWD old new

old       Old password

new     New password

 

Change password. The two arguments are the old and the new password. The r

 

Response is

OK

if the change is allowed.

 

NO

otherwise

1.221.38     QNAME

 

QNAME

 

Allocate a query name. There are no arguments. The response is

 

Response:

OK s

where s is the name.

 

1.231.39     REMOVE

 

REMOVE s

s           Query name

 

The argument is a query that is no longer needed. The server reclaims associated resourcestidies it up.

 

Response:

 and replies OK

1.40     RFREE

 

RFREE

 

Releases the resources committed by RLOOKUP. After this call there must be no RGET call until another RLOOKUP has taken place.

 

Note: Actually this does nothing at the moment, but it will be enforced in a future release.

1.241.41     RGET

 

RGET n

n          Index of desired dictionary entry

 

This behaves just like DMATCH (1.4) but recovers a word from the last regular expression word lookup (RLOOKUP).

Must follow a RLOOKUP (1.42). The nth member of the wordlist created by RLOOKUP is found (s) together with its frequency (f) and form count (f1).

 

Response:

OK f s {u} f1

The returned form u is obsolete. The server no longer translates character entities in the returned word.

 

1.251.42     RLOOKUP

 

RLOOKUP regexp

regexp Regular expression

 

Find all words matching a given regular expression. RLOOKUP stores its results in an internal buffer of limited size. The call is RLOOKUP regexp where regexp is the regular expression.

 

Response:

The reply is OK n

 if there are n solutions.

 

NO 0

if there are no solutions.

 

NO TOOMANY k

if there are more solutions than can be stored (k gives the maximum).

 

Implementation limit: The largest number of matches allowed is 100000.

 

RGET may be used to recover individual solutions.

 

RLOOKUP stores its results in an internal buffer of limited size. If there are more solutions than can be stored the return string will be NO TOOMANY.

 

The following one-character regular expressions match a single character:

 

C

An ordinary character (not one of the  special  characters   discussed  below)  is  a  one-character regular expression that matches that character.

\c

A backslash (\) followed by any special character is a one-character  regular expression that matches the special character itself.  The special characters are + . * [ \: (period,  asterisk, left square  bracket, and backslash, respectively), which are always special,  except when they appear within square brackets ([]).

.

A . (period) is a  one-character  regular  expression that matches any character

[string]

A non-empty string of  characters  enclosed  in  square brackets  is  a  one-character  regular expression that matches any one character in that string.  If, however, the  first  character of the string is a ^ (a circumflex or caret), the  one-character  regular  expression  matches any character other than the remaining characters in the string.  The  ^  has  this  special meaning only if it occurs first in the string.  The -(minus) may be used to indicate a range of  consecutive ASCII  characters;  for example, [0-9] is equivalent to [0123456789]. The - loses this special meaning if it occurs  first (after an initial ^, if any) or last in the string.  The ] (right square  bracket)  does  not terminate  such a string when it is the first character within it (after an initial  ^ if  any);  that  is, []a-f]  matches either ] (a right square bracket ) or one of the letters a through  f  inclusive.   The  four characters  +. * [ \ stand for themselves within such a string of characters.

 

The following rules may be used to construct regular expressions:

 

*

A regular expression followed by * (an asterisk) is a regular expression that matches zero or more occurrences of the one-character regular expression.  

+

A regular expression followed by + (a plus sign) is a  regular expression that matches one or more occurrences of the one-character regular expression.  

?

A regular expression followed by ? (a question mark) is a regular expression that matches zero or one occurrences of the one-character regular expression.  

Concatenation

The  concatenation of regular expressions is a regular expression that matches the concatenation of the strings matched by each component of the regular expression.

(  )

A regular expression enclosed in parentheses matches a match for the regular expression

|

The disjunction of two regular expressions matches anything that matches either of theexpressions.

 

The order of precedence of operators at the same parenthesis level  is  [ ] (character  classes),  then  * + ? (closures),then  concatenation,  then |  (alternation).

 

The regular expression evaluator can detect multiple pattern matches. Thus tea? will match both te and tea.

1.261.1     SAVE

This message changes the saved state of a named solution set. The format is SAVE b q where b is 0 to turn off saving and 1 to turn it on, and q is the query name.

1.43     SCADD

 

SCADD name n1 … nk

name   partition

ni         values

 

Add class values for texts to named partition. k must not exceed the parameter g returned when the partition was defined. After all texts have been added, add an extra 0 identifier to the end of the list; until this is done the partition cannot be used.

 

The integer assigned to a text specified which class of the partition the text belongs to. The names of classes are uploaded separately.

 

Response:

OK

1.44     SCCOUNT

 

SCCOUNT query partition p

query               Name of the query as assigned by QNAME,

partition           Partition name

p                      Class

 

Count hits that fall within the specified class of the specified partition. For compatibility with SCSET, add 1 to the class number; 0 means use each class. Note that this call does not affect the active class.

 

OK n nt

n                      Number of hits in class

nt                     Number of texts hit in class.

1.45     SCDEF

 

SCDEF name desc

name   Name for new partition

desc     Description for new partition

 

Add a new partition.

 

Response:

OK g

g is the size of the buffer allocated to allow SCADD to add class values to this partition.

 

If a partition with that name exists

NO

 

1.46     SCENUM

 

SCENUM name p n

name   partition name

p          First position required

n          Number of texts required

 

Return a range of class values for texts identifiers from a partition. p denotes the index of the first one required and n the number required.

 

Response:

OK t1…tn

 

System limitation: n may not exceed 50.

1.47     SCENUMNAMES

 

SCENUMNAMES name

name   Partition name

 

Return a space separated list of class names for a partition.

 

Response

OK s1…sn

 

1.48     SCLIST

 

SCLIST n

n          Index of partition

 

Get details of a partition.

 

Response:

OK name nt bReg fm fs nw nu desc

name   Name

nt         Number of texts

bReg    1 if registered 0 otherwise

fm        Mean frequency

fs          Standard deviation of frequency

nw       Number of different words

nu        Number of words

desc     Text description

 

1.49     SCNAMES

SCNAMES name s1 … sn

name               Partition name

si                                            ith class name

 

Set class names for a partition

Response:

OK

1.50     SCSAVE

 

SCSAVE

 

Save the current partition in a temporary file for registration. Registration is described in the indexer documentation.

 

Response:

OK

 

NO

if file could not be created.

1.51     SCSET

 

SCSET name p

name   Partition to be activated

p          Class number to be activated

 

Activate a class; from now on all queries &c will be evaluated in texts of this class only. Use one plus the class offset; if p is 0 then the whole corpus is activated.

 

Response:

OK

 

If the partition does not exist

NO

1.52     SCWORDS

SCWORDS name p

name               Partition name

p                      Class number

 

Counts the words in all the texts of this class.

 

Response:

OK d

d is the number.

1.271.53     SOLVE*

 

SOLVE q cql

q          Query name

cql       Query

 

This is the call used to solve a CQL query. The form of the call is SOLVE q cql where q is the query name and cql is the query. The client must use a query name allocated by QNAME.

 

Response:

The reply is NO 0

if there are no solutions and

 

OK n nt

if there are n solutions occurring in nt texts.

 

NO SYNTAX

means that the query could not be parsed.

 

Individual solutions may be retrieved using GET. NO SPACE

means that the server cannot save the solution because its disk is full.

 

NO STREAMS

means that there are not enough free streams to solve the query.

 

NO FILES

means that the system has more open queries than it can handle.

 

NO ABORT

The client interrupted the operation.

 

Individual solutions may be retrieved using GETSOL.

 

SOLVE is an interruptable transaction.

 

This is maintained for compatibility for the sake of old clients still using CQL. The latest Sara Client uses SOLVEX instead.

1.54     SOLVEX

 

SOLVEX q xcql

q          Query name

xcql     Query

 

This is the call used to solve an XCQL query. The form of the call is SOLVE q cql where q is the query name and cql is the query. The client must use a query name allocated by QNAME. The format of XCQL queries is documented in 2.

 

Response:

The reply is NO 0

if there are no solutions and

 

OK n nt

if there are n solutions occurring in nt texts.

 

NO SYNTAX

means that the query could not be parsed.

 

Individual solutions may be retrieved using GET. NO SPACE

means that the server cannot save the solution because its disk is full.

 

NO STREAMS

means that there are not enough free streams to solve the query.

 

NO FILES

means that the system has more open queries than it can handle.

 

NO ABORT

The client interrupted the operation.

 

Individual solutions may be retrieved using GETSOL.

 

SOLVEX is an interruptable transaction.

 

1.281.55     SQTABLE

 

Thins a query to a specified list of solutions. The syntax is

 

SQTABLE q1 q2 n1 ... nm

 

where q1 is the            initial query,

q2        the thinned result

, and n1...nm     are the indices of the solutions to be retained.

 

Thins a query to a specified list of solutions.

 

Response:The reply is

 

OK m mt

 

where m should be the number of solutions requested

and mtmt         will be the number of texts represented.

 

NO FILES

Cannot create query q2

 

NO 0

m=0

 

Implementation limit: m cannot exceed 1000.

 

Note: Recall also the general limit on the length of client messages mentioned earlier.

1.56     SUBCORPUS

 

SUBCORPUS name

name   Name of corpus

 

Get details of a corpus.

 

Response:

OK nt fm fs nw nu

nt         Number of texts

fm        Mean frequency

fs          Standard deviation of frequency

nw       Number of different words

nu        Number of words

 

Gets details of a subcorpus. SUBCORPUS sc  is the required form, but sc must be the corpus name. The reply is OK n m s where n is the number of words in the dictionary, m the mean frequency and s the standard deviation of frequency.

 

SUBCORPUS is not the right name for this.

1.301.57     THIN

 

Cut a solution set down using one of a number of criteria. The general form is

 

THIN name newname method size seed

 

where name is the                   query name,

newname        the name for the thinned query set,

method            an integer saying how thinning is to be performed and

size                  is the desired number of solutions.

seed                 seed for random thinning

 

Cut a solution set down using one of a number of criteria.

 

Available methods are:

 

0: thin to initial segment of solutions

1: thin to random subset using seed as random number generator

2: thin to one solution per text. In this case the size parameter is ignored.

 

The reply is Response:

OK n m

where n is the number of solutions after thinning and mn the number of texts represented in the result.

 

NO FILES

Iif the new query file cannot be created the reply is NO FILES.

 

1.311.58     WEB

 

WEB

 

If the server advertises a web site it will return the url.ply

 

Response:

OK url

if there is a site

 

NO

otherwise

 

where url is the web site url. Otherwise it will reply NO.

This setting will be moved to the dsc file.

1.59     Obsolete strings

 

The following are now withdrawn and will give the protocol error

 

NO DELETED

 

CHAR

CSCORE

CUT

DIR

GETCHEAD

GETDTD

OPEN

PURGE

SAVE

SORT

SORTFILTER

TRACE

WORDLIST

 

FILTER and GET are supported for compatibility but will be withdrawn in a future release; see note at the head of this document.

2.     CQL query structure

In this section we shall explain the structure of CQL queries.

 

<!ENTITY % anyq "seq|or|lemma|form|pos|phrase|word|element|pattern">

<!ELEMENT cql (%anyq;|scope)>

<!ELEMENT seq (%anyq;|neg|all)+>

<!ELEMENT or (%anyq;)+>

<!ELEMENT and (%anyq;)+>

<!ELEMENT prod (%anyq;)+>

<!ELEMENT bprod (%anyq;)+>

<!ELEMENT neg (%anyq;)>

<!ELEMENT all EMPTY>

<!ELEMENT lemma (#PCDATA)>

<!ELEMENT form (#PCDATA)>

<!ELEMENT span EMPTY>

<!ATTLIST span

size NMTOKEN #REQUIRED>

<!ELEMENT scope ((%anyq;|prod|bprod),(element|span))>

<!ELEMENT poscode EMPTY>

<!ATTLIST poscode

tag NMTOKEN #REQUIRED>

<!ELEMENT pos (word|all,poscode)>

<!ELEMENT phrase (#PCDATA)>

<!ATTLIST phrase

case (yes|no) "no"

header (yes|no) "no">

<!ELEMENT word (#PCDATA)>

<!ELEMENT element (attribute)*>

<!ATTLIST element

end (yes|no) "no"

name NMTOKEN #REQUIRED>

<!ELEMENT attribute (#PCDATA)>

<!ATTLIST attribute

name NMTOKEN #REQUIRED

var (yes|no) "no">

<!ELEMENT pattern (#PCDATA)>

 

Any CQL query is either a combined or atomic query or a scope query. Any query identifies a number of hits. A hit is a range of adjacent characters from the same text; for each query we shall define it by its start and end.

2.1     Atomic queries

Atomic queries are not made up of smaller queries, though they may have components.

 

Before explaining the details of the cql language we remind the reader of the Sara lemmatisation system. It is important to distinguish between searching for a word, for a headword and for a form. These terms have already been defined (1.15).

 

The search for a word will find all usages in the corpus that have exactly the same spelling as the word in the query.

 

The search for a headword will find all usages in the corpus that are lemmata of the headword in the query.

 

The search for a form will find all usages in the corpus that have the same spelling, postag and variant number as in the query. Note that in a corpus with neither pos tagging nor lemmatisation this is just an l-word.

 

In each atomic query, the start and the end a hit and identify the location (text and offset) of the start and end of the token matched.

2.1.1     Word query

Syntax: <word  [case=”yes”] [header=”yes”]>spelling</word>

 

This will find every usage with the given spelling, irrespective of case unless case=”yes” is specified.. It will not find usages in the headers of texts unless header=”yes” is specified.

 

In a word query the user takes responsibility for ensuring that the word supplied is what is elsewhere called an l-word, that is, a token in whatever tokenisation scheme was used by the indexer. For example, if the character sequence can’t is tokenised can ‘t then it will never be found in a word query.

 

Example: <word>godly</word>

 

Word queries are not used by Sara servers after 0.93; but the are indirectly used in phrase queries.

 

2.1.2     Headword query

Syntax: <lemma>headword</lemma>

 

This will find every usage that is a lemma of the given headword. The form of the headword is prescribed by the lemmatisation scheme in force. In the null lemmatisation scheme a headword is precisely the same as a form, but other lemmatisation schemes group various forms under a single headword.

 

Example: <lemma>find=VERB</lemma>

The example uses lancaster lemmatisation and will find forms of find such as found that are verbs. N the text:

We found many pieces of bone. The finds were deposited in the Ashmolean Museum

It will find a hit in the first but not the second sentence.

 

2.1.3     Form query

Syntax: <form>form</form>

 

This will find only usages with the prescribed spelling, pos tag and variant number.

 

Note that the required format may be determined by a client by using the LEMMDESC query for null lemmatisation (see 1.25).

 

Example:

<form>rat=NN1</form>

2.1.32.1.4     POS query

Syntax: <pos><word>spelling</word><poscode tag="postag"/></pos>

<pos><all/><poscode tag="postag"/></pos>

 

This is just a simplified form query that is independent of headword/form format. In BNC2 the query

 

<pos><word>rat</word><postag tag="NN1"/></pos>

 

is just the same as

 

<form>rat=NN1</form>

 

This query is still used by server 0.95 but will become obsolete. However the all pos form has no similar replacement.

 

2.1.42.1.5     Pattern query

Syntax: <pattern>regexp</pattern>

 

The solutions are all usages whose spellings match the given pattern. The syntax of regular expressions was given in 1.42.[3]

2.1.42.1.6     All query

Syntax: <all/>

 

This query hits all words. It can only be used in seg queries.

 

2.1.52.1.7     Phrase query

Syntax: <phrase [case=”yes”] [header=”yes”]>chars</phrase>

 

The string chars is parsed into tokens and then a sequence query () for the resulting words is performed. For example

 

<phrase case=”yes”>French King</phrase>

 

is just the same as

 

<seq><word case=”yes”>French</word><word case=”yes”>King</word></seq>

 

Note: If the corpus doesn’t use part of speech tags then the same tokenisation rules will be applied to the phrase as were applied by the indexer. This means that each sequence of tokens matching the phrase will be found provided they start at a token boundary. If part of speech tags are used then it is the responsibility of the author of the DSC file to see that the tokenisation rules give the same results as the pos tags. The only exception to this is that phrase queries automatically allow for BNC style frobs.

 

In a phrase query the token _ is matches any word. It may not occur at the start or the end of the phrase.

 

It is always possible to prescribe exactly the tokenisation of a character sequence by using a seq query.

2.32.1.8     SGML queries

Syntax: <element name=”eee”><attribute name=”aa1”>val1</attribute>…<attribute name=”aan”>valn</attribute></element>

 

This will find markup with the prescribed element name and attribute/value pairs. For example, to search for

 

<header rend=”it”>

 

use

 

<element name=”header”><attribute name=”rend”>it</attribute></element>

 

There are two extra attributes available:

 

  • Adding end=”yes” to an element causes a search for the end tag. In this case attributes must not be specified.
  • Adding var=”yes” to an attribute makes a variable with the supplied value as name. In a search this does not limit the value of the attribute except insofar as in a match each occurrence of a given variable in the query must match the same value of the attribute. Variables only work if the attribute In question is indexed as ID or REFID (3) and variable names must be integers.

 

When matching markup, the order of attributes is not significant.

 

2.42.2     Combining queries

2.4.12.2.1     Seq query

Syntax: <seq>op1…opn</seq>

 

This finds a sequence of hits for the n queries specified.

 

Two hits hit1 and hit2 are in sequence provided hit1 precedes hit2 and there are no words in between them. Intervening markup is allowed.

 

The arguments can be any atomic or combined query.

 

The start of a hit is the start of the hit for op1. The end of the hit is the end of the hit for opn.

2.2.2     Neg query

Syntax: <neg>op</neg>

 

This finds a word that is not a hit for the query op.

 

Neg queries only occur in seq queries and may not occur at the start or the end.

 

 

2.4.22.2.3     Or query

Syntax: <or>op1 … opn</or>

 

This finds the first occurrence of any of the specified queries (that is, the one whose start occurs first)..

 

The arguments can be any atomic or combined query.

 

The start and end of the hit are the start and end of whichever query occurred first.

2.52.3     Scoping queries

Syntax: <scope>op span</scope>

 

This finds all solutions to op within a given span.

 

A span can be specified either as a number of words or as an element query. A numeric span is written <span size=”n”>.

 

Scoping a query restricts all elements of the solution to fall within a given area. The area may be specified as an SGML element or as a number of words.

 

Op can be any atomic or combined query, or an ordered or unordered product.

 

2.3.1     Atomic or combined query

If the span is an element this will find all hits whose start is preceded by the end of a hit for the element query whose matching end token occurs after the end of the query hit.

 

If the span is a number n then the query finds all hits in which fewer than n words occur between the start and the end of the hit.

 

2.3.2     Ordered product

An ordered product is written <prod>op1,,,opn</prod> where op1…opn are atomic or combined queries.

 

An ordered product finds all hits for opn where either

  • the scope is an element query, and the start of the hit for opn is preceded by the end of a hit for the element query whose corresponding end tag follows the end of the hit for opn, and there is a sequence of hits for op1 …opn-1 which is ordered in the sense that the end of one hit precedes the start of the next hit, where the start of the hit for op1 follows the end of the element hit and the end of the hit for opn-1 precedes the hit for opn; or
  • the scope is a number x and there is a sequence of hits for op1 …opn-1 which is ordered in the sense that the end of one hit precedes the start of the next hit, where the end of the hit for opn-1 precedes the hit for opn and the number of words between the start of the hit for op1 and the end of the hit for opn is fewer than x.

 

The start and end of the hit are as for opn.

2.3.1     Unordered product

An unordered product is written <bprod>op1,,,opn</bprod> where op1…opn are atomic or combined queries.

 

An unordered product finds all hits for opn where either

  • the scope is an element query, and the start of the hit for opn is preceded by the end of a hit for the element query whose corresponding end tag follows the end of the hit for opn, and there is a sequence of hits for op1 …opn-1 where the start of each hit for opi follows the end of the element hit and the end of the hit for opi precedes the start of the end element hit; or
  • the scope is a number x and there is a sequence of hits for op1 …opn-1 such that, letting m be the start of the earliest hit opi and mm the end of the latest hit, there are fewer than x words between m and mm.

 

The start and end of the hit are the start and end of whichever of the hits for opi starts latest.

3.     How attributes are indexed

 

The general approach to indexing attributes is to index the value as a text string of each attribute. There are a few cases, however, where special treatment is required. These special cases are explained in this section. The treatment of sttribute values in SARA emerged in an extremely ad hoc fashion and it is not difficult to see in retrospect how the whole apparatus could be simplified.

 

Certain attributes are declared as plural. When an attribute is plural, the indexer treats its value as a string of values, decomposes the string into a list of values and indexes each of these separately.

 

Individual attributes are treated in a number of different ways.

 

CDATA

The attribute value is indexed just as it is.

CAT

The attribute value is indexed in upper case

NUMBER

The attribute value is indexed in upper case

NAME

The attribute value is indexed in upper case

ID

The attribute root is indexed. The value is stored as position data

REFID

The attribute root is indexed. The value is stored as position data.

NULL

Suppresses indexing altogether

MULTID

The attribute root is indexed. The value is stored as position data.

MULTIDREF

The attribute root is indexed. The value is stored as position data.

 

The terminology is most misleading. An ID attribute can have any name. A REFID attribute can have any name but must refer to an attribute of another element called ID.

 

The values of ID and similar attributes is composed from a text identifier, the root and an integer using a shifting set of rules, radices etc.

 

This is not the place to explain the reason for storing ID values as position data.

 

4.     File formats

 

As of version 0.930 of the client/server software, all elements of the SARA package (that is, the indexer, the server and the client) use one common file to access information about a BNC-style corpus. The files elements.txt and header.txt, as well as cdif.txt and various parameter files only used by the indexer, may be deleted.

 

This new file is called the corpus description file. Its name must be the corpus name and its extenstion must be dsc. The following corpus names are in use:

 

bnc1

The main corpus

bncsam1

The old C6 sampler

bncsam2

The new C6 sampler

bnc2

World edition

 

The server tells the client the name of the corpus it uses as part of the INFO exchange (see 1.23).

 

The description file consists of a number of lines each with a keyword and an argument string. Arguments are separated by blanks. Lines beginning with the character # are treated as comments.

 

The rest of this section lists the keywords supported.

 

4.1     VER n

[All]

 

n is the version number multiplied by 100. This must be the first line of the file and may not be preceded by a comment. With software version 0.930 all header versions have been set to 1.00.

 

4.2     OPTION o

 

o is an option string. At present the only option strings in use are

 

noposindex

 

which tells the indexer that the corpus does not contain BNC-style pos tags,

 

and namecase, telling the indexer to treat SGML element and attribute names as case sensitive.

 

4.2.1     crisspace

Tells the client that a newline is to count as a space.

4.2.2     namecase

Tells the indexer to treat SGML element and attribute names as case sensitive.

4.2.3     noposindex

Tells the indexer that the corpus does not contain BNC-style pos tags,

 

 

4.3     LABEL e/a

[Server]

 

Sets the label to be used to identify the position of a hit in the corpus. e must be an element name and a an attribute of that element. The server finds the label by searching for the most recent occurrence of the element/attribute pair before the hit and using the value of the specified attribute.

 

4.4     SCOPE s

[Client]

 

Sets one of the search scopes used by the client in GET requests, qv. The first scope statement should be the smallest scope, called the default scope. Sara will always try to trim solutions to multiples of this scope. The last scope statement should represent a whole text. No more than three scopes are allowed. The scopes are referred to in the client as sentence, paragraph and maximum.

4.5     ATT s1 n d desc

[All]

 

s1 is an attribute of the last declared element. n must be one of the following strings

 

Type

Use of d field

CDATA

Unused

CAT

Alternatives as in ATTRIB declaration

NUMBER

Unused

NAME

Unused

REFID

Referenced element

ID

Root

NULL

Unused

MULTID

Root

MULTIDREFS

Root

 

The terminology of this table is explained in Section 3.

 

Example:

att default CAT YES|NO Whether a default is available

Note that attributes must be listed after the element to which they belong. Attributes listed before the first ELT statement are treated as global, that is, they may belong to any element.

4.5     CHAR name n c

[All]

 

Declares a character entity. For example

 

char yacute 253

 

If the number n is present then it must denote the Unicode character value of the character as a hex number. Note that the indexer replaces character entities with replacement values less than 256 by the characters themselves.

 

If the number is not present, as in

 

char Ycirc

 

then the entity will appear in SGML entity notation.

 

c has the same force as in a LEX statement. If omitted the entity is treated as a character.

 

4.6     LEX name c

 

Declares how a character or character entity should be treated in tokenisation. name should be either a single character or a character entity already defined in a CHAR statement.

 

c

Meaning

c

Treat as letter character

p

Treat as punctuation

s

Treat as space

 

 

Example

 

lex ‘ c

 

This causes a quote mark to be treated as a letter character.

 

It is only necessary to make this declaration if the character is to be treated otherwise than in the Sara default character table.

 

Note that this table is used by the indexer unless explicit word division is used in a corpus (as it is in BNC). However, this table is always used to parse phrase queries. If explicit word division is in force it is important to edit this table to bring it as close as possible to the actual tokenisation system used.

4.8     LC name1 name2

[Indexer]

 

Declares a lower case form for a character; name1 and name2 must be characters, or character entities already declared in CHAR statements.

 

Example

 

lc &AElig; &aelig;

 

It is not possible to change the case mapping of ASCII characters.

4.9     ELT s n h desc

[All]

 

s is an element of type n. The values of types n are:

 

0

Empty element

e

Non-empty element

s

s-type element

 

By an s-type element is meant an element with omitted end tags where the end tag is always found immediately before the next start tag of the same kind, or at the end of the document if there are no more such start tags.

 

h is a string of flags as follows:

 

·         b if the tag appears in text bodies only

·         e if the tag should be preceded by a line break in page view

·         i if the tag is internal. This causes the structure browser to treat this tag as content.

·         h if it only appears in headers.

·         t if the tag is transparent, that is, if it may occur within a word without breaking the word.

 

desc is a brief description of the element.

 

Example:

elt locale e h description of a place where speech recorded

 

4.10     ENT name s

[All]

 

Declares a special entity and its printed form. For example

 

ent alien [alien]

 

Special entities are always displayed in their printed form when non-SGML output is required.

 

4.11     ITEM

[Client]

 

See MENU.

 

4.12     MENU k shortname

[Client]

 

List a menu to be used to solicit the value of a parameter. All menus must begin with a line such as

 

MENU 0 spoken_class

 

What follows depends on the value of the parameter k. If it is 0 then a number of item statements such as

 

ITEM 1 AB

ITEM 2 C1

ITEM 3 C2

ITEM 4 DE

 

give the actual parameter values and their menu equivalents.

 

If k is non-zero then it must refer to an enumeration declared in a TYPE statement. Note however that the value k=1 is reserved.

 

Note: as of software version 0.930 any attribute may have a menu. Declaring an empty menu for an attribute suppresses its display.

 

A MENU statement must directly follow the attribute it serves.

4.13     POS n desc

[All]

 

A POS statement lists a part-of-speech code. For example

 

POS VBD past form of the verb "BE" , i.e. WAS, WERE

 

4.14     PUN n desc

[All]

 

A PUN statement lists a part-of-speech code that is classed as punctuation. For example

 

PUN PUL left bracket (i.e. ( or [ )

 

4.15     TYPE n

[Client]

 

n must be an integer greater than 1. The items that follow the statement are just the same as after a MENU statement. The only point of the type statement is that it saves having to list the same enumeration of items (eg ISO country codes) in several places. The integer n is used in a MENU statement to show that the items from type n should be used.

[Note that the break table in the client is still hard-coded]

 

Need to add linefmt etc here!

4.16     HIDE e a

[Client]

 

Prevents element e with attribute a from appearing in selection menus in the client.

4.17     RADIX n

[All]

 

Sets the radix for subsequent CHAR statements to n.

 

4.18     BIB s

[Client]

 

This declares element s to be a bibliography element. Bibliographic data for a text is generated by concatenating in sequence the content of each bibliography element in the text, having transformed that content using the bibfmt format (see 5.3).

 

4.19     COL t w e a f

[Client]

 

This adds a column to the list of texts. Like bibliographic data, data in the table of texts is extracted from element content and attribute values in the text itself.

 

t           Column title

w         Column width in dialog units

e          Element name

a          Attribute name or to request element content

f           Flags

 

The only flag allowed is N, which stipulates that the column is numeric this affects how it is sorted.

4.16     LEMMATA n

[All]

 

Stipulates which lemmatisation schemes (see 1.26) are in use in the corpus. Include a line for each scheme supported.

 

The lemmatisation name is the name from the table just referred to followed by any extra parameters required for that scheme.

 

Example:

lemmata bnc

lemmata lancaster c6

4.17     TAGSETHELP n

[Client]

 

n is a help tag indicating a page in the Sara help file documenting the POS tags in use in a corpus. Use c6tags or c7tags as appropriate.

4.18     WTAG e a

[All]

 

Designates a POS tag. When a word is indexed and POS indexing is switched on the value of the a attribute of its closest containing e tag is used as the POS value.

4.19     LTAG e a

[Indexer]

 

Designates a lemmatisation tag. When a word is indexed and inline lemmatisation is switched on the value of the a attribute of its closest containing e tag is used as the headword.

 

4.20     FMT n s

[Clent]

 

Adds a string to a format file. n must name the file and s is the string to be added. These files are described in more detail in 5)

4.21     LEMMEX w tag

[Client]

 

Stipulate a word (with optional tag) (that is, an extended word of the null lemmatisation scheme) to be used as an example if the user asks for an example of how different schemes work.

 

The sense is always 0.

4.22     LEMMDEF name

[Server, client]

 

Stipulate the default lemma scheme when a Sara session begins.

4.23     LISTSOURCE s

[Client]

 

Stipulate the content of the source tag used in XML listings.

5.     Format files

 

A format is a set of rules for transforming an XML-encoded string into a display string. A rule is an element rule or an entity rule. All rules consist of a head and a body. The head is a white-space terminated name for the rule; the body a number of double quote delineated strings in which the following escapes are allowed: \n \t \\ \. Entity rules have at most one replacement string: element rules at most two.

 

The name of an entity rule always begins with the character &. In transforming an XML string, two transformations are processed: first the string is processed for element rules, then the result is processed for entity rules.

 

In the first transformation, the string is searched for strings of the form

 

<elt att1=val1 attn=valn>

 

or

 

</elt>

 

When such a string is found

(a)     if there is no element rule for elt then it is deleted

(b)     if the string is of the first form there is an element rule for elt then it is replaced by the first string in the rule body after variable substitution

(c)     if the string is of the second form and the rule body has two strings then it is replaced by the second string after variable substitution.

 

Variable substitution means that the string %[att] is replaced by the value of attribute att.

 

In the second transformation the string is searched for the character & starting a token; if that token is the name of an entity rule then the name is replaced by the rule body. If the name is followed immediately by a semicolon character this will be deleted.

 

It should be noted that although page, line and bibliographic formats are supplied in the dsc file, the client user is allowed to edit them. The client holds local copies of all page formats that it knows about in a single file called pagefmt.txt. This file has a section for each dsc file the client knows of: the section begins with the dsc name in square brackets. When the version number of the dsc file on the server exceeds that on the client the new dsc file is automatically downloaded to the client. This has the side effect that all user edits to these format strings are overwritten,

5.1     page

 

Page formats are applied to the page-per-hit solution window when it is set to  custom format.

5.2     line

 

Line formats are applied to the line-per-hit solution window when it is set to  custom format.

5.3     bib

 

Bib formats are applied to the strings returned for the various BIB elements listed in the dsc file.

6.     Sara Script

 

SaraScript is an objected-oriented extension of Javascript that allows access to the Sara lexixon and solution mechanism. Readers who are not acquainted with JavaScript are advised to go away and learn it.

 

SaraScript is a higher level version of the Sara Protocol. It is a wrapper around the string-based protocol and is implemented partly on the server and partly on the client. The simplest way to execute SaraScript is from the client, and this is the approach taken below. However, it is possible to execute SaraScript in any Windows scripting host. In that case the functions in 6.1 behave slightly differently. Instead of being functions they become methods of a Sara server object.

 

Consider the function list from the example in 6.8. In the Sara script window it begins:

 

function list(lemma,tag,ff)

{

            var i, k, kk, s;

            s=Solve("^\""+lemma+"="+tag+"=\"");

            k=s.Count(); p=0;

 

but in another application it would look like this:

 

var ss=new ActiveXObject("SaraServer");

 

function list(lemma,tag,ff)

{

            var i, k, kk, s;

            s=ss.Solve("^\""+lemma+"="+tag+"=\"");

            k=s.Count(); p=0;

 

When the server object is created the Sara client will be launched and the user will be prompted to log on to a corpus. At present there is no way to select the corpus in script.

 

Note: in reading examples, remember that JScript treats all string as Unicode.

6.1     Top level functions

6.1.1     ActivateClass

void ActivateClass(String part,String class)

 

Set the active class.

 

Example:

ActivateClass(“type”,“spoken”);

6.1.2     OpenPartition

void OpenPartition(String path)

 

Open a partition

 

Example:

OpenPartition(“type.sc”)

6.1.3     Server

String Server(String s1);

 

This offers raw access to the Sara protocol described above. The argument s1 is sent to the server and the string returned by the server is returned.

 

Example:

 

s=Server(“MOTD”);

 

s gets set to the message of the day.

 

6.1.4     SetLemmata

void SetLemmate(String l)

 

Set the lemmatisation scheme in use.

 

Example:

SetLemmata(“bnc”);

6.1.5     Solve

 

SaraSolve Solve(String)

 

Get a SaraSolve object that is the solution to a query. If there are no solutions an exception is signalled.

 

6.1.6     GetWordList

 

SaraWordList GetSaraWordList(String)

 

Get a SaraWordList object that represents the set of words in the dictionary whose headwords match the pattern supplied as the argument.

 

Example:

wl=GetSaraWordList("mang.*");

 

6.2     The SaraSolve class

6.2.1     Count

 

Int s.Count()

 

Returns solution size, that is, the number of hits.

 

Example:

SaraSolve s=Solve(“hapax legomenon”);

WriteLine(s.Count());

 

1

6.2.2     GetCollocateTableByCount

SaraColloc ct=s.GetCollocateTableByCount(int left,int right,int k,int rare);

 

Build a collocation table for the solution set where

left                    left window

right                  right window

k                      number of collocates wanted

rare                   frequency below which words are to be omitted

 

See CTABOPTIONS (1.8). The k best collocates are returned.

 

Note: at present SaraScript collocation tables always use the Z-score score.

6.2.3     GetCollocateTableByScore

SaraColloc ct=s.GetCollocateTableByScore(int left,int right,double k,int rare);

 

Build a collocation table for the solution set where

left                    left window

right                  right window

k                      minimum score

rare                   frequency below which words are to be omitted

 

See CTABOPTIONS (1.8). Collocates scoring more than k are returned.

 

Note: at present SaraScript collocation tables always use the Z-score score.

6.2.4     GetSol

SaraHit h=s.GetSol(Int n)

 

Get the nth hit from a solution. If n is out of range an exception is raised.

6.2.5     Release

void s.Release();

 

Dispose of unwanted SaraSolve object.

 

6.3     The SaraHit class

A SaraHit object represents a single hit for a query. In what follows the term solution denotes the entire string found (the default scope is used) whereas the term hit denotes the part of the solution that matches the query.

6.3.1     Filter

void h.Filter(String name);

 

Apply a filter to the solution. The next table lists the filters available. An exception occurs if the filter is not in this table.

 

NOSGMLX

Strip all XML markup

NORMSPACE

Replace multiple white space characters by single spaces

CMAP

Map character entities to characters

XMLENT

Replace & and < by XML entities

 

6.3.2     GetStart

int n=h.GetStart();

 

Get location in solution of start of hit.

6.3.3     GetLength

int n=h.GetLength();

 

Get length of hit.

6.3.4     GetString

String s=h.GetString();

 

Get the solution.

6.3.5     GetText

String text=h.GetText();

 

Get the name of the text in which the solution is found.

6.3.6     GetUnit

String text=h.GetText();

 

Get the name of the unit of the text in which the solution is found. In the BNC this will be the sentence number. In general the attribute returned is controlled by the LABEL line of the dsc file (4.3).

 

6.4     The SaraColloc class

6.4.1     Count

Int ct.Count()

 

Return the number of collocates in the collocation table.

6.4.2     GetCollocate

SaraCollocate ct.GetCollocate(int n)

 

Get the nth collocate from a collocation table.

6.4.3     Release

void ct.Release()

 

Free an unwanted collocation table.

6.5     The SaraCollocate class

6.5.1     frequency

int co.frequency

 

This read only property is the co-frequency of the collocate.

6.5.2     GetHits

SaraSolution co.GetHits();

 

For a given collocate return the solution set, that is, the set of hits for the collocate headword that occur in the collocation window.

6.5.3     score

double co.score

 

This read only property is the score of the collocate.

6.5.4     word

String co.word

 

This read-only property is the headword of the collocate.

6.6     The SaraWordList class

6.6.1     Count

Int Count()

 

Returns the number of entries in the word table.

 

6.6.2     GetWordListEntry

SaraWordListEntry GetWordListEntry(index)

 

Returns a single from the given position in the word list.

 

6.6.3     Release

Void Release()

 

Free resources associated with the word list.

6.7     The SaraWordListEntry class

6.7.1     word

This is a String-valued property giving the matching headword.

6.7.2     frequency

This is a long property giving the number of occurrences of forms of the headword.

6.7.3     forms

This is a long property giving the number of forms of the headword.

6.8     Example

 

The following example shows a function report that is used to list to file a random set of hits and some collocation information for a word.

 

Example:

 

var p, ff;

var r=new Array();

var fso=new ActiveXObject("Scripting.FileSystemObject");

 

function report(word,tag)

{

            ff=fso.CreateTextFile("listing.xml",true,true);

            header(ff);

            ff.WriteLine("<listLems>");

            list(word,tag,ff);

            ff.WriteLine("</listLems>");

            ff.Close();

}

 

function list(lemma,tag,ff)

{

            var i, k, kk, s;

            s=Solve("^\""+lemma+"="+tag+"=\"");

            k=s.Count(); p=0;

            kk=k;

            if (k>10) kk=10;

            ff.WriteLine("<lemma freq=\""+s.Count()+"\">");

            ff.WriteLine("<form tag=\"" + tag +"\">"+lemma+"</form>");

            ff.WriteLine("<listHits>");

            for (i=0;i<kk;i++) {

                        t=x(k);

                        sol=s.GetSol(t);

                        ShowSol(sol);

                        delete sol;

            }

            ff.WriteLine("</listHits>");

            Collocates(s,ff,1,0);

            Collocates(s,ff,0,3);

            ff.WriteLine("</lemma>");

            s.Release();

            delete s;

}

 

function ShowSol(sol)

{

            var hit;

            sol.Filter("CMAP");

            sol.Filter("NOSGMLX");

            sol.Filter("NORMSPACE");

            sol.Filter("XMLENT");

            hit=sol.GetString();

            ff.WriteLine("<hit text=\""+sol.GetText()+

                        "\" n=\""+sol.GetUnit()+"\">"+

                        hit.substr(0,sol.GetStart()) +

                        "<kw>"+hit.substr(sol.GetStart(),sol.GetLength()) +

                        "</kw>"+hit.substr(sol.GetStart()+sol.GetLength()) +

                        "</hit>");

}

 

function Collocates(sol,file,left,right)

{

            var i, k, ct, c, ss, hits, sol;

            ct=sol.GetCollocateTableByScore(left,right,10.0,2);

            k=ct.Count();

            file.WriteLine("<listCollocs left=\""+left+"\" right=\""+right+"\">");

            for (i=0;i<k;i++) {

                        c=ct.GetCollocate(i);

                        ss=c.word.split("=");

                        file.WriteLine("<colloc freq=\""+c.frequency

                                    +"\" zscore=\""+decimal(c.score)

                                    +"\"><form tag=\""+ss[1]                                   

                                    +"\">"+ss[0]+"</form>");

                        hits=c.GetHits();

                        sol=hits.GetSol(0);

                        ShowSol(sol);

                        file.WriteLine("</colloc>");

                        delete sol;

                        hits.Release();

                        delete hits;

                        delete c;

            }

            file.WriteLine("</listCollocs>");

            ct.Release();

            delete ct;

}

 

function x(k)

{

            var i, n;

            if (k <= 10) return p++;

            n=Math.random()*k;

            n=Math.floor(n);

            if (n==k) n=k-1;

            while (1) {

            for (i=0;i<p;i++) {

                        if (r[i]==n) break;

            }

            if (i==p) {

                        r[p++]=n;

                        return n;

            }

            n++;

            if (n==k) n=0;

            }

}

 

function decimal(z)

{

            var zz;

            zz=z*100;

            zz=Math.floor(zz);

            zz=zz/100;

            return zz;

}

 

function header(file)

{

            ff.WriteLine("<?xml version=\"1.0\" encoding=\"UTF-16\" ?>");

            ff.WriteLine("<?xml:stylesheet type=\"text/xsl\" href=\"listlems.xsl\"?>");

            ff.WriteLine("<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>");     

            ff.WriteLine("<?xml:stylesheet type=\"text/xsl\" href=\"l:\wb\eng\pep\bnc\plist.xsl\" ?>");

            ff.WriteLine("<!DOCTYPE listLems SYSTEM \"listlems.dtd\">");

}

 

report(idle,VERB);

 

5.7.     Utility programs

5.17.1     solve

 

The program solve returns solutions to a CQL query, which must appear on the command line. Thus

 

solve frog

 

will enumerate all solutions to the query frog in the format returned by GET (minus the OK, of course). This format was documented in 1.13.

 

solve uses a connection to localhost to contact the server. It can therefore only be run on a machine that is running the server.

5.27.2     testnet

 

testnet is used to test the server. It accepts ASCII packages as documented in Section 1, which the user types at the prompt. The replies are echoed.

 

5.37.3     sample

 

sample is a simple filter that may be used to reduce the number of solutions returned by solve. sample n reads lines from standard input and sends every nth line to standard output. Thus

 

solve frog|sample 3

 

displays every 3rd hit from the query frog.

 

Appendix. The DSC file for BNC2

 

_ver 100

# =================================================

# BNC2.DSC file

#

# 1. First version AD

# 2. Modified by LB 29 apr 97: attribute descs

# 3. Modified by AD 1 may 97: restored portmanteau tags, element text

# 4. Modified by LB 10 May 97: replaced POS table for sampler

#    removed wbpsel and writim (not used in sampler)

#    lowercase all catref values

# 17 nov 98: LB : change prefix for setting ID from LC to SE

# revised for BNC2 : LB : 12 Sep 00

label s/n

scope s

scope p/u

scope bncDoc

option namecase

lemmata bnc

lemmata lancaster c6

lemmdef bnc

#bib sourceDesc

col Title 300 title -

col Size 100 extent -

radix 16

tagsethelp c6tags

wtag w pos

wtag c pos

# default page and line formats

fmt line u "<%[who]>: "

fmt line div "|"

fmt line div1 "|"

fmt line div2 "|"

fmt line div3 "|"

fmt line head "||"

fmt line spkr "||"

fmt line caption "|"

fmt line p "|  "

fmt line item "* "

fmt line pause " ... "

fmt line unclear " [ ... ] "

fmt line event " [%[desc]] "

fmt line vocal " [%[desc]] "

fmt line kinesic " [%[desc]] "

fmt line shift "[%[new]]"

fmt page u "\n%[who] >: "

fmt page div "\n"

fmt page div1 "\n"

fmt page div2 "\n"

fmt page div3 "\n"

fmt page head "\n\n"

fmt page spkr "\n\n"

fmt page caption "\n"

fmt page p "\n  "

fmt page item "\n*\t"

fmt page pause " ... "

fmt page unclear " [ ... ] "

fmt page event " [%[desc]] "

fmt page vocal " [%[desc]] "

fmt page kinesic " [%[desc]] "

fmt page shift "[%[new]]"

#------------------------------------

# First the element and attribute table

# -------------------------------------

att n CDATA 0

att rend CDATA 0

elt w e b word

att pos CDATA 0

elt c e b word

att pos CDATA 0

elt bncDoc e b an individual text in the BNC

att id CDATA 0

elt s s b sentence-like linguistic segment

att p CAT Y|N    manually checked at Lancaster

# not currently used

elt seg e b

att part CAT Y|N|I|M|F

att type CDATA 0

att subtype CDATA 0

elt gap 0 b point where part of source text has been omitted

att desc CDATA 0  description of the part omitted

att resp CDATA 0 identifier of the editor responsible

att reason CDATA 0  reason for making change

elt corr e b editorially regularized part of a written text

att sic CDATA 0  uncorrected form

att resp CDATA 0 identifier of the editor responsible

att reason CDATA 0 reason for making change

elt sic e b apparently erroneous transcription

att corr CDATA 0  suggested correction

att resp CDATA 0 identifier of the editor responsible

att reason CDATA 0  reason for making change

elt ptr 0 b link to a displaced element or to synchronisation point

att target CDATA 0  identifier of target

elt text e b an individual written text

att complete CAT Y|N  Is sample complete (Y or N)?

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls CDATA 0 Applicable editorial declarations

elt div1 e b first-level subdivision of a written text

att complete CAT Y|N Is division complete (Y or N)?

att type CDATA 0 Function of division eg chapter, part, section, toc, etc.

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls CDATA 0 Applicable editorial declarations

elt div2 e b second-level subdivision of a written text

att complete CAT Y|N Is division complete (Y or N)?

att type CDATA 0 Function of division eg chapter, part, section, toc, etc.

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls CDATA 0 Applicable editorial declarations

elt div3 e b third-level subdivision of a written text

att complete CAT Y|N Is division complete (Y or N)?

att type CDATA 0 Function of division eg chapter, part, section, toc, etc.

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls CDATA 0 Applicable editorial declarations

elt div4 e b fourth-level subdivision of a written text

att complete CAT Y|N Is division complete (Y or N)?

att type CDATA 0 Function of division eg chapter, part, section, toc, etc.

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls CDATA 0 Applicable editorial declarations

elt head e b head heading at the start of a division of a written text

att type CAT main|sub|byline|unspec (MAIN, SUB, BYLINE, UNSPECified)

elt caption e b floating heading or caption within a written text

att id NULL 0

att type CAT attached|display|byline|unspec  (ATTACHED, DISPLAY, BYLINE, UNSPECified)

elt p e b paragraph in a written text

elt sp e b speech in a written text

att who CDATA 0

elt spkr e b speaker of a speech in a written text

elt stage e b stage direction in a written text       

att id NULL 0

att type CAT m|s|a|d|x|u 

elt age e h

elt occupation e h

elt dialect e h

elt particLinks e h

elt poem e b group of verse lines in a written text

elt l e b line of verse in a written text

att part CAT y|n|u Is verse line complete (Yes, No, Unknown)?

elt lg e b

att type CDATA 0

elt quote e b quotation in a written text

att type CAT inline|display|unspec (INLINE, DISPLAY, or UMSPECified)?

elt list e b list of items in a written text

att type CDATA 0

elt item e b item within a list in a written text

att id CDATA 0

elt label e b label of a list item in a written text

elt note e b note or comment of any kind

att id NULL 0

att resp CDATA 0 identifier of the editor responsible

att place CAT side|foot|end|unspec (SIDE, FOOT, END, or UNSPEC)

att type CAT ed|orig (EDitorial or ORIGinal)

elt bibl e b bibliographic reference in a written text

elt hi e b typographically highlighted phrase in a written text

elt salute e b salutation or greeting in a written text

elt lb 0 b position of line break in written sourcetext

elt pb 0 b position of page break in written sourcetext

elt xref e b

elt editor e h

elt publisher e h

elt body e b

elt stext e b an individual spoken text transcript

att complete CAT Y|N Is transcript complete (Y or N)?

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls MULTIDREFS SD

att decls MULTIDREFS CN

att decls MULTIDREFS QT

att decls MULTIDREFS HN

att decls MULTIDREFS TR

att decls MULTIDREFS QN

att decls MULTIDREFS SN

att decls MULTIDREFS TN

att decls MULTIDREFS RE

att decls MULTIDREFS SE

elt div e b any subdivision of a spoken text

att complete CAT Y|N Is division complete (Y or N)?

att type CDATA 0

att org CAT compo|seq Organization (COMPOsite or SEQential)

att decls MULTIDREFS RE

att decls MULTIDREFS SE

elt align e b alignment map for synchronizing overlap points in a spoken text

elt loc 0 b synchronisation point within an alignment map in a spoken text

att id ID LC

elt u e b utterance in a spoken text

att who REFID person

elt vocal 0 b non-verbal vocalization in a spoken text

att desc CDATA 0 kind of sound made

att dur NUMBER 0 duration in seconds

elt pause 0 b noticeable pause in a spoken text

att dur NUMBER 0  duration in seconds

elt shift 0 b change in voice quality in a spoken text

att new CDATA 0 voice quality after the shift

elt event 0 b non-verbal event within a spoken text

att desc CDATA 0 description of the event

att dur NUMBER 0  duration in seconds

elt unclear 0 b inaudible or incomprehensible passage in a spoken text

att who REFID person speaker identifier

att dur NUMBER 0  duration in seconds

elt trunc e b truncated form in a spoken text

elt teiHeader e h meta-information describing a corpus text

att date.updated CDATA 0

att creator CDATA 0

att status CAT new|update (NEW or UPDATEd)

att update CDATA 0

att type CAT corpus|text (CORPUS or TEXT)

att id NULL 0

elt fileDesc e h documentation of an electronic text

elt titleStmt e h title statement for a text

elt title e h title within a bibliographic entry

elt respStmt e h statement of responsibility in a bibliographic entry

elt resp e h nature of responsibility

elt ednStmt e h information about a particular edition

elt extent e h size of a corpus text

att kb NUMBER 0 Number of Kbytes

att words NUMBER 0 Number of <w> elements contained

elt publicationStmt e h publication or distribution information

elt address e h postal or other address

elt idno e h identifying number for a text

att type CDATA 0

elt availability e h availability code for file

att status CAT restrict|unknown|free

att region CDATA 0

elt sourceDesc e h description of the source for a written text

elt biblStruct e h structured bibliographic entry

att default CAT YES|NO (YES or NO)

elt analytic e h analytic analytic bibliographic entry

elt monogr e h monographic bibliographic entry

elt author e h author in bibliographic entry

att domicile CDATA 0

att born NUMBER 0

elt edition e h edition in a bibliographic entry

elt imprint e h imprint within a bibliographic entry

elt name e h proper name of person, place etc.

att type CAT place|org|person

att id CDATA 0

elt date e h a date

att value CDATA 0

elt pubPlace e h place of publication within bibliographic entry

elt biblScope e h page range within bibliographic entry

att type CAT vol|issue|pp

elt bibNote e h note within a bibliographic entry

elt editionStmt e h details of an edition

elt distributor e h distributor of an edition

elt addrLine e h part of an address

elt para e h textual note in the header

att id CDATA 0

elt encodingDesc e h encoding description

att id CDATA 0

elt projectDesc e h background information about BNC project

elt recordingStmt e h information about an audio recording

att id ID RS

att default CAT YES|NO

elt recording e h recording details    

att id NULL 0

att dur NUMBER 0  duration in seconds (?)

att date CDATA 0 date of recording

att time CDATA 0  time of day when recording made

att type CAT DAT DAT, WALKman, or UNKNOWN

elt samplingDecl e h description of sampling policy

att id ID SD

elt editorialDecl e h

elt classDecl e h description of classification scheme

elt taxonomy e h

att id CDATA 0

elt tagsDecl e h list of tags used in a particular text

elt tagUsage e h count for a particular tag in a text

att gi CDATA 0 name of tag

att occurs NUMBER 0 frequency of tag in text

elt refsDecl e h description of reference system used

att id CDATA 0

elt category e h a category-value pair

att id MULTID allava

att id MULTID alltyp         

att id MULTID alltim

att id MULTID scgdom

att id MULTID sdeage

att id MULTID sdecla

att id MULTID sdesex

att id MULTID spolog

att id MULTID sporeg 

#att id MULTID wbpSel

att id MULTID wmipub

att id MULTID wriaag

att id MULTID wriabp

att id MULTID wriad

att id MULTID wriaet

att id MULTID wriase

att id MULTID wriaty

att id MULTID wriaud 

att id MULTID wridom

att id MULTID wrilev

att id MULTID wrimed

att id MULTID wripp

att id MULTID wrisam 

att id MULTID wrista

att id MULTID writas

elt profileDesc e h additional information about a text

elt langUsage e h description of languages used in a text

elt particDesc e hy description of spoken text participants

elt catDesc e h description of a category

elt creation e h information about creation of a text

att date CDATA 0

elt person e h information about a speaker

att role CAT resp|other (RESPondent or OTHER)

att sex CAT m|f|u       (Male, Female or Unknown)

att soc CAT AB|C1|C2|DE|UU (AB, C1, C2, DE, UU)

att resp CDATA 0  

att age CAT 0|1|2|3|4|5|X

att dialect CDATA 0

att flang CDATA 0 first language

att educ CDATA 0 educational level reached

att id ID PS

elt relation 0 h relationship between participants in a spoken text

att desc CDATA 0 Relationship type e.g. aunt, mother

att passive CDATA 0 Identifiers for other speakers related

att type CDATA 0

att active CDATA 0

att mutual CAT Y|N

elt settingDesc e h description of setting in which speech occurs

elt setting e h an individual setting in which speech occurs

att county CDATA 0 county name

att spont CAT L|M|H|U Low, Medium, High, or Unknown

att who CDATA 0 Identifiers of speakers at this location

att audSize CDATA 0 Audience size at this location

att id ID SE    

elt locName e h name of place where speech recorded

elt locale e h description of a place where speech recorded

elt activity e h participants' activity during recording

att spont CDATA 0

elt textClass e h text classification

att default CAT YES|NO

elt classCode e h Genre or other classification of a text

att scheme CDATA 0

elt catRef 0 h category codes applicable to a text

att target MULTIDREFS allava Text availability

MENU 0

#all_availability

#ITEM 0 Information not available

#ITEM 1 Freely available worldwide

#ITEM 2 Available worldwide

#ITEM 3 Not available in North America

#ITEM 4 Not available in USA.

#ITEM 5 Not available outside the European Union

att target MULTIDREFS alltyp Text type

MENU 0 all_type

ITEM 0 Information not available

ITEM 1 Spoken demographic

ITEM 2 Spoken context-governed

ITEM 3 Written books and periodicals

ITEM 4 Written-to-be-spoken

ITEM 5 Written miscellaneous

att target MULTIDREFS alltim Publication date

MENU 0 all_time

ITEM 0 unknown

ITEM 1 1960-1974

ITEM 2 1975-1984

ITEM 3 1985-1993

att target MULTIDREFS scgdom  Domain

MENU 0 spoken_domain

ITEM 0 Information not available

ITEM 1 Educational/Informative

ITEM 2 Business

ITEM 3 Public/Institutional

ITEM 4 Leisure

att target MULTIDREFS sdeage  Respondent age

MENU 0 spoken_age

ITEM 0 Information not available

ITEM 1 Under 15

ITEM 2 15-24

ITEM 3 25-34

ITEM 4 35-44

ITEM 5 45-59

ITEM 6 60 or over          

att target MULTIDREFS sdecla Respondent social class

MENU 0 spoken_class

ITEM 0 Information not available

ITEM 1 AB

ITEM 2 C1

ITEM 3 C2

ITEM 4 DE

att target MULTIDREFS sdesex Respondent gender

MENU 0 spoken_sex

ITEM 0 Information not available

ITEM 1 Male

ITEM 2 Female

att target MULTIDREFS spolog Interaction type

MENU 0 spoken_type

ITEM 0 Information not available

ITEM 1 Monologue

ITEM 2 Dialogue     

att target MULTIDREFS sporeg Region of capture

MENU  0 spoken_region

ITEM 0 Information not available

ITEM 1 South

ITEM 2 Midlands

ITEM 3 North        

#att target MULTIDREFS wbpSel  Selection method

#MENU 0 written_selection

#ITEM 0 Information not available

#ITEM 1 Selective

#ITEM 2 Random

att target MULTIDREFS wmipub  Publication status

MENU 0 written_pubstatus

ITEM 0 Information not available

ITEM 1 Published

ITEM 2 Unpublished

att target MULTIDREFS wriaag  Author age band

MENU 0 written_age

ITEM 0 Information not available

ITEM 1 Under 15

ITEM 2 15-24

ITEM 3 25-34

ITEM 4 35-44

ITEM 5 45-59

ITEM 6 60 and over

att target MULTIDREFS wriabp

MENU 0

att target MULTIDREFS wriad  Author domicile

MENU 2 written_domicile

#att target MULTIDREFS wriaet

#MENU 0

att target MULTIDREFS wriase  Author sex

MENU 0 written_sex

ITEM 0 Information not available

ITEM 1 Male

ITEM 2 Female

ITEM 3 Mixed

ITEM 4 Unknown      

att target MULTIDREFS wriaty  Type of author

MENU 0 written_type

ITEM 0 Information not available

ITEM 1 Corporate

ITEM 2 Multiple

ITEM 3 Sole

ITEM 4 Unknown      

att target MULTIDREFS wriaud  Audience type

MENU 0 written_audience

ITEM 0 Information not available

ITEM 1 Child

ITEM 2 Teenager

ITEM 3 Adult

ITEM 4 Any          

att target MULTIDREFS wridom  Domain

MENU 0 written_domain

ITEM 0 Information not available

ITEM 1 Imaginative

ITEM 2 Natural and pure sciences

ITEM 3 Applied sciences

ITEM 4 Social science

ITEM 5 World affairs

ITEM 6 Commerce and finanace

ITEM 7 Arts

ITEM 8 Belief and thought

ITEM 9 Leisure

att target MULTIDREFS wrilev   Level of circulation

MENU  0 written_level

ITEM 0 Information not available

ITEM 1 Low

ITEM 2 Medium

ITEM 3 High         

att target MULTIDREFS wrimed Medium

MENU 0 written_medium

ITEM 0 Information not available

ITEM 1 Book 

ITEM 2 Periodical

ITEM 3 Misc. published

ITEM 4 Misc. unpublished

ITEM 5 To-be-spoken

att target MULTIDREFS wripp  Place of publication

MENU 2 written_place

att target MULTIDREFS wrisam  Sample type

MENU 0 written_sample

ITEM 0 Information not available

ITEM 1 Whole text

ITEM 2 Beginning sample

ITEM 3 Middle sample

ITEM 4 End sample

ITEM 5 Composite

att target MULTIDREFS wrista  Reception status

MENU 0 written_status

ITEM 0 Information not available

ITEM 1 Low

ITEM 2 Medium

ITEM 3 High

att target MULTIDREFS writas  Target audience sex

MENU 0 written_gender

ITEM 0 Information not available

ITEM 1 Male

ITEM 2 Female

ITEM 3 Mixed

ITEM 4 Unknown      

#att target MULTIDREFS writim  Time period

#MENU 0 written_time

#ITEM 0 Information not available

#ITEM 1 1960-1974

#ITEM 2 1975-1993

elt keywords e h descriptive keywords for topics of a text

att scheme CDATA 0

elt term e h individual term in a list of keywords

elt revisionDesc e h revision description

elt change e h change note

# ----------------------

# C6 part of speech tags

# ----------------------

POS AJ0 adjective (unmarked) (e.g. GOOD, OLD)

POS AJC comparative adjective (e.g. BETTER, OLDER)

POS AJS superlative adjective (e.g. BEST, OLDEST)

POS AT0 article (e.g. THE, A, AN)

POS AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER, FURTHEST)

POS AVP adverb particle (e.g. UP, OFF, OUT)

POS AVQ wh-adverb (e.g. WHEN, HOW, WHY)

POS CJC coordinating conjunction (e.g. AND, OR)

POS CJS subordinating conjunction (e.g. ALTHOUGH, WHEN)

POS CJT the conjunction THAT

POS CRD cardinal numeral (e.g. 3, FIFTY-FIVE, 6609) (excluding ONE)

POS DPS possessive determiner form (e.g. YOUR, THEIR)

POS DT0 general determiner (e.g. THESE, SOME)

POS DTQ wh-determiner (e.g. WHOSE, WHICH)

POS EX0 existential THERE

POS ITJ interjection or other isolate (e.g. OH, YES, MHM)

POS NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)

POS NN1 singular noun (e.g. PENCIL, GOOSE)

POS NN2 plural noun (e.g. PENCILS, GEESE)

POS NP0 proper noun (e.g. LONDON, MICHAEL, MARS)

POS ONE the word ONE (including numeral and non-numeral uses)

POS ORD ordinal (e.g. SIXTH, 77TH, LAST)

POS PNI indefinite pronoun (e.g. NONE, EVERYTHING)

POS PNP personal pronoun (e.g. YOU, THEM, OURS)

POS PNQ wh-pronoun (e.g. WHO, WHOEVER)

POS PNX reflexive pronoun (e.g. ITSELF, OURSELVES)

POS POS the possessive (genitive) morpheme 'S or '

POS PRF the preposition OF

POS PRP preposition (except for OF) (e.g. FOR, ABOVE, TO)

POS TO0 infinitive marker (i.e. TO)

POS UNC unclassified: i.e. items which are not part of the English lexicon

POS VBB the base forms of the verb "BE" , except infinitive, i.e. AM, ARE

POS VBD past form of the verb "BE" , i.e. WAS, WERE

POS VBG -ing form of the verb "BE" , i.e. BEING

POS VBI infinitive of the verb "BE"

POS VBN past participle of the verb "BE" , i.e. BEEN

POS VBZ -s form of the verb "BE" , i.e. IS, 'S

POS VDB base form of the verb "DO" , except the infinitive

POS VDD past form of the verb "DO" , i.e. DID

POS VDG -ing form of the verb "DO" , i.e. DOING

POS VDI infinitive of the verb "DO"

POS VDN past participle of the verb "DO" , i.e. DONE

POS VDZ -s form of the verb "DO" , i.e. DOES

POS VHB base form of the verb "HAVE" , except the infinitive

POS VHD past tense form of the verb "HAVE" , i.e. HAD, 'D -->

POS VHG -ing form of the verb "HAVE" , i.e. HAVING

POS VHI infinitive of the verb "HAVE

POS VHN past participle of the verb "HAVE" , i.e. HAD

POS VHZ -s form of the verb "HAVE" , i.e. HAS, 'S

POS VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, 'LL)

POS VVB base form of lexical verb, except the infinitive (e.g. TAKE, LIVE)

POS VVD past tense form of lexical verb (e.g. TOOK, LIVED)

POS VVG -ing form of lexical verb (e.g. TAKING, LIVING)

POS VVI infinitive of lexical verb

POS VVN past participle form of lexical verb (e.g. TAKEN, LIVED

POS VVZ -s form of lexical verb (e.g. TAKES, LIVES)

POS XX0 the negative NOT or N'T

POS ZZ0 alphabetical symbol (e.g. A, B, c, d)         

POS AJ0-AV0 Adjective or adverb

POS AV0-AJ0 Adjective or adverb

POS AJ0-NN1 Adjective or singular noun

POS NN1-AJ0 Adjective or singular noun

POS AJ0-VVD Adjective or past tense form

POS VVD-AJ0 Adjective or past tense form

POS AJ0-VVG Adjective or -ing form of lexical verb

POS VVG-AJ0 Adjective or -ing form of lexical verb

POS AJ0-VVN Adjective or past participle

POS VVN-AJ0 Adjective or past participle

POS AVP-PRP Adverb particle or preposition

POS PRP-AVP Adverb particle or preposition

POS AVQ-CJS wh-adverb or subordinating conjunction

POS CJS-AVQ wh-adverb or subordinating conjunction

POS CJS-PRP Subordinating conjunction or preposition

POS PRP-CJS Subordinating conjunction or preposition

POS CJT-DT0 THAT as conjunction or determiner

POS DT0-CJT THAT as conjunction or determiner

POS CRD-PNI ONE as number of pronoun

POS PNI-CRD ONE as number of pronoun

POS NN1-NP0 Singular common noun or proper noun

POS NP0-NN1 Singular common noun or proper noun

POS NN1-VVB Singular common noun or base verb form

POS VVB-NN1 Singular common noun or base verb form

POS NN1-VVG Singular common noun or -ing form of verb

POS VVG-NN1 Singular common noun or -ing form of verb

POS NN2-VVZ Plural common noun or -s form of verb

POS VVZ-NN2 Plural common noun or -s form of verb

POS VVD-VVN Past tense verb or past participle

POS VVN-VVD Past tense verb or past participle

PUN PUL left bracket ( or [

PUN PUN punctuation marks . ! , : ; - ? ...

PUN PUQ quotation mark

PUN PUR right bracket ) or ]

# ------------------

# next the character entities

# --------------------

char apos 0027

char ast 002a

char colon 003a

char comma 002c

char equals 003d

char excl 0021

char hyphen 002d

char lpar 0028

char percnt 0025

char period 002e

char plus 002b

char quest 003f

char rpar 0029

char semi 003b

char sol 002f

char brvbar 00a6

char half 00bd

char horbar 2015

char lowbar 005f

char nbsp 00a0

char shy 00ad

char emsp

char ensp

char emsp13

char emsp14

char numsp

char puncsp

char thinsp

char hairsp

char dash

char blank

char nldr

char incare

char block 2588

char uhblk 2580

char lhblk 2584

char blk14

char blk12

char blk34

char marker

char male 2642

char female 2640

char phone

char telrec

char caret

char fflig

char filig f001

char fjlig

char ffilig

char ffllig

char fllig f002

char mldr

char sext

char target

char dlcrop

char drcrop

char ulcrop

char urcrop

char Aacute 00c1

char aacute 00e1

char Abreve 0102

char abreve 0103

char Acirc 00c2

char acirc 00e2

char acute 00b4

char AElig 00c6

char aelig 00e6

char Agr 0391

char agr 03b1

char Agrave 00c0

char agrave 00e0

char Amacr 0100

char amacr 0101

char amp 0026

char Aogon 0104

char aogon 0105

char Aring 00c5

char aring 00e5

char Atilde 00c3

char atilde 00e3

char Auml 00c4

char auml 00e4

char Bgr 0392

char bgr 03b2

char bquo 2018 p

char breve 02d8

char bsol 005c

char bull 2022

char Cacute 0106

char cacute 0107

char caron 02c7

char Ccaron 010c

char ccaron 010d

char Ccedil 00c7

char ccedil 00e7

char Ccirc 0108

char ccirc 0109

char Cdot 010a

char cdot 010b

char cedil 00b8

char cent 00a2

char check

char cir

char circ 02c6

char clubs 2663

char commat 0040

char copy  00a9

char copysr

char cross

char curren 00a4

char dagger 2020

char Dagger

char darr 2193

char dblac

char Dcaron 010e

char dcaron 010f

char deg 00b0

char Dgr 0394

char dgr 03b4

char diams 2666

char die 00a8

char divide 00f7

char dollar 0024

char dot 02d9

char Dstrok

char dstrok

char dtri

char dtrif 25bc

char Eacute 00c9

char eacute 00e9

char Ecaron 011a

char ecaron 011b

char Ecirc 00ca

char ecirc 00ea

char Edot 0116

char edot 0117

char EEgr 0397

char eegr 03b7

char Egr 0395

char egr 03b5

char Egrave 00c8

char egrave 00e8

char Emacr 0112

char emacr 0113

char ENG 014a

char eng 014b

char Eogon 0118

char eogon 0119

char equo 2019 p

char ETH 208

char eth 00f0

char Euml 00cb

char euml 00eb

char flat

char frac12 00bd

char frac13

char frac14 00bc

char frac15

char frac16

char frac17

char frac18 215b

char frac19

char frac23

char frac25

char frac27