The SARA Protocol
Version 1.065
SARA version 0.95
This is the second version of level 1.05 of the
protocol spec to accompany the first build of software release 0.940. The
specification, especially the format of the .dsc file, may change as a result
of testing. Comments in square brackets refer to corrections needed before this
specification is finalised.
Note. At the same time as the Sara indexing toolkit we shall be releasing version 1.0 of the client/server software. This will use version 1.07 of the protocol. At this point all calls that are being maintained solely for compatibility will be removed. There are marked with an asterisk below.
The SARA protocol is a client-server protocol allowing a central database of texts with SGML markup to be queried by remote servers. The protocol was designed for use with TCP, though any other network could be used. The only assumption made about the network is that it is capable of delivering null-terminated strings in the order they were sent.
This document describes the protocol (Section 1) and the associated query language CQL (Section 2). It does not cover the procedures required to build and set up and index, nor does it document the operation of the SARA server. Some trivial SARA related tools are described in Section 4.16.
All strings used as messages are ASCII strings. The lengths of strings is variable. Strings must be null-terminated; the final 0 cannot be omitted.
Some arguments (to be specified later) allow Unicode
characters to be embedded in ASCII text. An embedded Unicode value is
represented by the character ^U (ASCII 21) followed by a four character text
representation of the Unicode value as a hex number.
All transactions consist of a message sent from the client to the server followed by a reply from the server to ther client. Client messages begin with a keyword and may contain other data, depending on the keyword. Server responses begin either OK or NO. Extra data depends on the client message keyword.
Implementation limit: The maximum length of a client message is 6000 characters.
There is one exception to this rule. Certain transactions are classified as interruptable. If any urgent data is available from the client socket during an interruptable transaction then the transaction is halted and the server writes the string NO ABORT to the socket. In this case there is one more client message than server reply. The content of the data package sent to interrupt the server should be the string INT. The MSG_OOB flag must be specified when sending the string. The behaviour of the server on receiving any other data after a read and before a write is undefined.
1. A SARA session consists of these phases:The client connects to server and server accepts the call.
2. The server tries to create a process to accept data packages from the client. If it cannot do this (say because memory is short on the server) then the server closes the socket.
3. The user logs on
4. Setup messages are exchanged
5. A client session takes place
6. The user logs off
7. The server closes the socket
Setup messages are messages that are used in phase 4 of this process.
Once a connection has been established, the server must receive packages regularly. If the timeout period elapses without any packages being received then the server will close the connection. The timeout period is determined by the server: 10 minutes is a typical value. To keep the connection alive, send any package that is not a legal server command: it is guaranteed that the command TIMER will always be illegal.
Note that the timeout only operates while the server is waiting to receive packages. It does not prevent the server from spending a long time in a calculation.
Do not send keepalives between a client read and the subsequent write: these may be interpreted as interrupts.
The rest of this section documents all messages. Messages
marked ‘Deleted’ were in [1] but have since been removed.
ACSCORE
query l r m word
query the query to collocate
with
l the left window
r the right window
m the scoring algorithm
(see 1.8)
word the word to collocate.
Scores a
collocate.
Response:
OK s
Returns the score s.
ADICT n
n Offset
Returns the nth element of the attribute dictionary specified by the most recent LOOKUPA (1.34) call.
Response:
OK v f
where v is the value and f its frequency.
BIB ccc
ccc three character code for the text
Any enquiry for bibliographic data starts with this
message. The form is
BIB ccc
where ccc is the three character code for the text.
The reply, if bibliographic data is found, is of the formResponse:
OK t n
where t indicates the type of data available and n is the number of items. Currently the assigned types are:
0 (written type): there are two items, a title and a description
1 (spoken type): there are multiple items, the first a title and the rest descriptions of speakers
NO BIB
If tThere is no
bibliographic data then the reply will be NO BIB.
This system is deprecated. It
assumes that the server keeps a separate file of bibliographic data. Bibliographic
data ought for preference to be embedded in the individual
text headers. See 5.3. The
Sara client will only send BIB strings if there are no BIB strings
in the dsc file.
The message BIBITEM ccc n
ccc Three char text id
n Index of desired item
eExtracts
the nth bibliography string for text ccc;
the reply is Response:
OK
s,
where s is the string.
NO BIB
No such item
Obtain a collocation score. The arguments are a
text string s, a number n and a CQL query q.
The server responds NO SYNTAX if the CQL
query cannot be parsed. Otherwise it established the number of occurrences of
the word s within n words of a solution to q. It replies OK l where l is
this number.
CTAB s n m cql
s query whose collocations
are required
n left window
m right window
p pattern used
to restrict the words tested
Build
the collocation table. The settings used
depend on the CTABOPTIONS
string last sent; the client always sends a CTABOPTIONS string
just before a CTAB
string.
The
window of a query is defined as follows: for a given hit, the window of the hit
is the collection of words from the nth
word to the left to the mth
word to the right of the first word of the hit. The
window of a query is the union of the windows of the hits.
For a
given query window there is for any
word (collocate) a
number, called the co-frequency, that
records how frequently a form of the word
occurs within the given window of the hit (for
forms see 1.15). The co-frequency
is transformed into a score using one of a number of rules (see 1.8). A collocation table is
a list of words with their scores; the words will all match the pattern p and
will be the highest scoring words found; how many words are entered into the
table depends on the CTABOPTIONS
in force.
There
is at most one collocation table at any time. It must be freed before
another table is built.
Response:
OK n
where n is the
number of entries in the table.
NO 0
If no
entries were found
CTABENTRY n
n Index
Return
the nth member of the collocation table.
Response:
OK {w} f sc
w headword
(collocate)
f co-frequency
sc score
See 1.5 for terminology.
CTABFREE
Free
the collocation table. The table must be freed
before a new one is built; there can only be one collocation table at any time.
Response:
OK
CTABOPTIONS n m s k
Set options
for collocations
n scoring algorithm.
0: Z score
1: MI score
m 0 if the returns are to be limited by
number, 1 if they are limited by score
s limiting value in either case
k frequency below which
collocates should be ignored
As
explained in 1.5 a scoring rule is a
formula that transforms the frequency of a collocate into a score. Typically
the purpose of a score is to distinguish words that are of high frequency
because they are significant collocates of the query from words that are frequent
merely because they are common.
In the
following table:
x denotes
the co-frequency of the
collocate, that is, the number of times it occurs in the window around a hit
p
denotes the number of occurrences of the collocate in the corpus (its frequency)
d
denotes the window size multiplied by the number of hits for the
query
n
denotes the total size of the corpus in words.
It may
help to point out that we would expect the collocate to occur p*d/n times within
the window. Therefore if x
is significantly greater than p*d/n
we have a good collocate. The actual formulae used are as follows:
|
Z-score |
|
|
MI-score |
|
For
details of these and other scoring formulae the reader is referred to Statistics for Corpus
Linguistics, Oakes,
In
calculating a collocation table a word
will be included if
(a) its
frequency is greater than or equal to k
(b) either m=0 and the
word is among the s
collocates with highest score, or m=1
and the score of the word exceeds s.
Response:
OK
CTABTHIN s1 s2 n m w
s1 query to be thinned
s2 name of the new query to
be created (obtained as usual using QNAME)
n left window
m right window
w is the headword
This
creates a new query consisting of those hits for s1 that
have the collocate w within
the stipulated window.
Response:
OK i j
i is the
number of solutions, j the
number of texts that contain a hit.
NO FILES
Could
not create query s2.
DMATCH
n
n location in word list
created by LOOKUP
Must follow a LOOKUP (1.33). The
argument is a single integer n. The nth member of the wordlist
created by LOOKUP
is found (s)
together with its frequency (f) and
form count (f1).
Response:
The form of the reply is OK
f s {us} f1.
The returned form u is obsolete. The
server no longer translates character entities in the returned word.
The first string is the
matching word in a form suitable for display while the second should be used in
any subsequent queries using the word. The two forms differ, essentially, in
having and not having character entities replaced. The display form may contain
embedded Unicode characters.
DOWNLOAD header
Get the corpus headera file from
the server. Currently the three files available are the corpus
description file, pagefmt.txt and linefmt.txt. These files are described in
Section 4.
The form of the message is DOWNLOAD
file where file is
1.header to
download the corpus header. This file should be saved with the corpus name and
extension .dsc.
2.1.<filename> to
download filename.txt
Response:
. The server replies OK
provided
the file is available, or NO FILE if it is not.
Subsequent LINE messages retrieve the file, block by block.
Note: Files for download are stored on the server as Unix
text files, with lines terminated by newline characters. When downloaded they
are automatically corrected to PC text format with lines terminated by carriage
return/newline pairs. If you must move files between a PC and a Unix machine by
ftp rather than in SARA, ensure that ftp is set to text and not to binary.
FILTER
q f
q Query name
f Name of filter
Assign a filter to a query. The arguments are the
query name n and the name of the filter. Filters are used to
process individual solutions before returning them to the client.
Response:
OK
The following table describes the filters available.
Trim solution so that no partial POS codes are transmitted.
Trim solution so that no partial SGML tags are transmitted.
Map characters as defined in1.23.
Delete all old-style POS entities.
Turn single <LF> characters into <CR><LF>.
Normalise all white space so that sequences of white space characters become single spaces.
Remove all SGML markup.
Convert new-style POS markup (w-tags and c-tags) to old style.
This mechanism is only supported for compatibility. The latest Sara client software performs filtering itself.
FENTRY n
n Index
Get the
nth entry
in the frequency table. This must follow an FTAB (1.16) string creating a
frequency table.
Response:
OK {s} k kk
s is the
headword and k the
frequency. kk
is the number of forms.
FFREE
Free
the current frequency table.
Response:
OK
FORM i s
i index of the form required
s headword
Response:
OK {s1} d
where s1 is the
form and d its frequency.
NO
if the
headword has no forms
A
corpus consists of words. Each word is tagged with
part-of-speech and variant information[1]. Therefore
for every word there exists a string of characters that is its spelling, a POS
(part-of-speech) code and a variant number. Most corpora do not use
variant numbers.
Each
such word also has a headword. The headword is a
sequence of strings of which the first is always the spelling of the headword
and the remaining strings (optional) convey lexical information of any kind
desired. Any word that has headword hw
is called a form
of hw.
The lemmatisation
scheme that simply preserves the POS code and variant recorded by the indexer
is called null lemmatisation. In null lemmatisation
each extended word is a headword and each headword has exactly one form for
each frob in which it occurs.
A scheme
for deducing a headword from a word is called a lemmatisation rule. The
indexer supports a number of different lemmatisation schemes: see 1.26 for descriptions. The
decision which rules to apply to a particular corpus is taken by the indexer
and the server may only access lemmatisation rules that were selected for the
corpus when it was indexed.
Both s
and s1 are extended words, that is, they are strings
separated by a separator character prescribed by a
lemmatisation scheme. s uses
the lemmatisation scheme in force, s1
the null lemmatisation scheme. By convention the equal sign is
used as the separator where possible.
FTAB n ll ul patt
n Max
words in table (-1 for no limit)
ll Lower limit: exclude
words with lower than this frequency (-1 for no limit)
ul Upper limit: exclude
words with higher than this frequency
(-1 for no limit)
patt Pattern
Build a
frequency table for words matching patt
with the desired parameters.
If a
word limit is imposed, the most frequent words that satisfy the other
constraints will be selected.
Note
that FTAB
always returns headwords of the current lemmatisation rule.
Response:
OK n
where n
is the number of words in the table.
Use FENTRY (1.13) to get individual
entries.
GET q n s gets a single solution from query q. The solution is solution number n in sequence. s denotes the SGML element to be used to bound the solution, or an integer.
The reply is OK text s i0 i1 pos ss where:
· text is the text identifier
· s is the number of the sentence containing the solution, that is, the value of the n attribute of the sentence, or the string “?” if the sentence number cannot be found;
· i0 is the offset of the solution in the returned text
· i1 is the length of the solution
· pos is the part-of-speech code of the solution (obsolete)
· ss is the solution text; it may include embedded Unicode characters.
The string s used to specify the bounding element will usually be the name of an SGML element. However, it is permissible to supply a number of elements separated by commas. In this case, the element whose last start tag before the hit is latest in the file will be used to bound the solution. This is useful if different texts use different markup conventions.
If s is an integer then the amount downloaded is the smallest collection of sentences that contain the hit and s words in front of it inclusive.
If the text is unavailable an OK response will still be generated. Its solution text will be a message that says that the text is unavailable, its text identifier will be valid, but it can be distinguished from a genuine solution by the fact that the sentence number is set to –1.
The reply is NO SOL if for some other reason the solution is not available.
The client no longer uses GET; see GETSOL (1.23).
GET1SOL txt elt att
txt Three
char text id
elt element name
att attribute name
This provides a simple
way to query a text for an attribute value. The first occurrence of element elt is found and then, if att is the
string – the content of the element
is extracted (including any markup) or otherwise the
attribute with that name on the element is found and its value extracted. The
value is returned.
OK s
s is the
value.
GETHEAD
txt pos d
txt text id
pos offset of the desired information
d initialises the depth count
GETHEAD
is used to extract data from a given position for browsing. The format is GETHEAD txt pos d where
·txt is
the text id
· pos is
the offset of the desired information
· d
initialises the depth count
If the server finds content at the specified offset, it reads all the content into a string s, sets variable bTag false, reads and discards any end tags following the content, adjusting the depth count accordingly but never allowing it to become negative, and sets newpos to the new offset and newd to the new depth.
If the server finds an XSGML
start tag, it reads the text of the tag into s.
It sets variable bTag
true and sets newd
one greater than the depth count. It sets newpos
to the offset of the end of the tag. Finally it sets jump to be the offset at
which the element being opened ends.
Empty elements are treated as content. w-tags and s-tags are treated as content. s is trimmed of leading and trailing blanks, has spacing normalised and characters mapped.
The return string is Response:
OK newpos jump newd bTag s.
If the text cannot be
found the reply is NO TEXT.
If the text cannot be
found the reply is
NO TEXT.
GETHEAD2 txt pos i0 i1
txt Text id as in
GETHEAD
pos Position as in GETHEAD
i0 Character offset
i1 Characters in hit
This call is used to locate a string whose file position is
known in a string returned from GETHEAD.
The location cannot be deduced without such a call because of the tidying GETHEAD
performs on solutions. i0 and i1 will be recovered using LOC (1.30).
The format is GETHEAD2 txt pos i0 i1 where
txt and pos were the values used to extract the solution in GETHEAD and i0
and i1 are the coordinates of the solution returned by LOC. The
server calculates the offset and length of the solution in the string (j0 and
j1 say) and replies
Response:
OK j0 j1.
If the text cannot be
found the reply is
NO TEXT.
Note: Now that solution tidying is performed on the client this and LOC could be removed.
GETPOS
s
s Word
The
argument is a string s. Thiss
is a word and the server finds all possible parts of speech for it.
Response:
The reply takes the form OK
n s1...sn
where n is the number of solutions and s1...sn the different POS codes.
GETSC s
n
The
arguments are a string s and a number n. The string s must be the coCorpus
name.
n The number is a tText
number.
Get the name of the nth
text.
Response:
If s is the corpus name:
The return is OK
s1
b
where s1 is the name of the
text with this number, and b is 1 if the text is available, 0 otherwise;
or,
if the corpus name is wrong, it is NO.
Otherwise:
NO
GETSOL
q n s
q Query name
n Sequence number
s Scope to be retrieved.
Gets a
single solution from query q.
Response:
OK text
s i0 i1 pos ss
text numerical text identifier
– use GETSC
to get the name
s number of the sentence
containing the solution, that is, the value of the n attribute of the
sentence, or the string “?” if the sentence number cannot be found;
i0 offset of the solution in
the returned text
i1 length of the solution
pos part-of-speech code of
the solution (obsolete)
ss solution text; it may
include embedded Unicode characters.
NO SOL
This indicates an
installation error.
The
string s used to specify the bounding element will usually be the name of an
SGML element. However, it is permissible to supply a number of elements
separated by commas. In this case, the element whose last start tag before the
hit is latest in the file will be used to bound the solution. This is useful if
different texts use different markup conventions.
If s
is an integer then the amount downloaded is the smallest collection of default
scopes that contain the hit and s words in front of it inclusive.
If the
text is unavailable an OK response will still be generated. Its solution text
will be a message that says that the text is unavailable, its text identifier
will be valid, but it can be distinguished from a genuine solution by the fact
that the sentence number is set to –1.
INFO cp
cp number of the code page that should be
used to translate character references
This string allows the client and server to exchange information. It is the only message that can legally be sent before the user logs on.
The form of the message is INFO cp where cp is the number of the
code page that should be used to translate character references.
The following code pages are available:
|
850 |
Windows ANSI |
The cp parameter is ignored by the server, which no
longer always translates character reffeerences using
the Unicode code page.
Response:
The response will be OK n v sv cv nm sc
where
1. n is the server timeout value in seconds,
2. v is the version number of the corpus description file. The DOWNLOAD message may be used to obtain files whose versions have changed.
3. sv is the server version number multiplied by 1000
4. cv is the smallest acceptable client version number, multiplied by 1000
5. nm is the corpus name
6. sc is
the number of registered subcorpora
LEMMDESC
name
name short
name of lemmatisation scheme
Response:
OK desc
If the name is not
recognised
NO
desc is a one line
description of the named scheme.
LEMMINFO
Get a string describing
the current lemmatisation schema
Response:
OK s
|
Name |
Description |
Info |
Notes |
Extra
parameters |
|
Null |
The
null scheme maps each word to itself |
0 = 1 =
Pos 2 =
Pos=Sense |
Pos
is part of speech tag. Sense is variant
number. |
|
|
Bnc |
The
bnc scheme maps each word to a headword with the same spelling and no part of
speech information. |
0 = |
|
|
|
|
The |
1 =
Tag |
Tag
is a loose lexical classification |
c6 or
c7 (the tagset)[2] |
|
Inline |
Lemmatisation
from an inline attribute. |
0 = |
|
|
The table shows the various
lemmatisation schemes available (for terminology see 1.15).
In the description strings the first token is an integer prescribing the number
of extra lexical fields are allowed in an extended word. The
next character is the separator used to
separate fields and the final entry consists of names for the extra fields
separated by the separator.
Several forms occur for
null lemmatisation because different corpora mark parts of speech differently
(or not at all).
LEMMSEL name
Set the current
lemmatisation scheme.
Response:
OK
If the name is not
recognised
NO
LEMMTEST
name word pos sense
name name of lemmatisation scheme
word spelling of form
pos POS code of form
sense Variant number
Get the headword of a
given form.
Response:
OK
headword
If the name is not
recognised
NO
Note: it would have been
more consistent to pass the three parameters word, pos and sense as an extended
word.
LINE
Get a block of a downloaded file (see 1.11).
Response:
The line is sent in the
form OK s
s is the
next block
and the first three
characters should be discarded. When there are no more blocks the
reply will be
NO MORE.
LOC s n
s Query name
n Solution index
The arguments are a string s and a number n. The
call finds the location of a solution (unlike GETSOL, which
gets the text). s is the query name and n the number of the desired
solution.
Response:
The form of the return is
OK nt nc nw where nt is the text number, nc is the character
offset of the solution and nw is the word number.
NO FILE
if the query name is
invalid
Note: LOC and GETHEAD2 (1.20) were implemented to allow the text of a hit to be marked while examining a text in the tree-browse window. The normal way to recover solutions is to call GETSOL repeatedly.
LOG
name pwd
name User
name
pwd Password
Used to log on. Before a successful LOG the system replies NO LOGIN to any message.
Response:
If login succeedsThe two
string arguments are the user’s name and password.
The
response to a correct login is OK followed
by a copyright message.
The response to a bad
login is NO BADLOG
otherwise.
In response to failure of the last allowed login attempt, the server may close the connection without a reply.
LOGOUT
Used to log off. There is no reply.
LOOKUP
pattern
pattern Initial string to match to words
Lookup a patternword in the
dictionary. The pattern is just an initial string; for
wildcard patterns see RLOOKUP.
Response:
The sole argument is the pattern and the reply
is OK n
where n is the number of words in
the dictionary that begin with the string pattern.
If there are no words:
NO 0
DMATCH can be used to retrieve the words.
Note that folding to lower case should be left to
the server.
LOOKUPA elt att
elt element name
att attribute name
Look up an attribute in an attribute dictionary, that is, a list of all values of an attribute.
Response:
OK n
where n is the number of values in the dictionary.
NO
if there are no values.
The values can be retrieved using the ADICT (1.2) call.
Pre 0.95 servers have no attribute dictionaries and
therefore always return NO.
MAXLENGTH
n
n Desired limit on solution length
returned.
Tells the server the maximum length of a solution that may
be returned. The desired limit is the argument.
Response:
, and the server responds
OK n
where n is the limit actually set,
which may be smaller than requested.
Implementation limit: If n exceeds 5000 it is set to 5000.
MOTD
Gets the Message of the Day from the server.
Response:
The response is OK s
where s is the message.
Open a saved query. The
argument is the query name. The reply is OK n if the operation is
successful and the file contains n solutions; it is NO if the file
cannot be opened.
PWD old new
old Old password
new New password
Change password. The two arguments are the
old and the new password. The r
Response is
OK
if the change is allowed.
NO
otherwise
QNAME
Allocate a query name. There are no arguments.
The response is
Response:
OK s
where s is the name.
REMOVE
s
s Query name
The argument is a query that is no longer needed. The server
reclaims
associated resourcestidies it up.
Response:
and replies OK
RFREE
Releases the resources committed by RLOOKUP. After this call there must be no RGET call until another RLOOKUP has taken place.
Note: Actually this does nothing at the moment, but it will
be enforced in a future release.
RGET n
n Index of desired dictionary entry
This behaves just like DMATCH (1.4) but
recovers a word from the last regular expression word lookup (RLOOKUP).
Must follow a RLOOKUP (1.42). The nth member
of the wordlist created by RLOOKUP is
found (s) together with its
frequency (f) and form count (f1).
Response:
OK f s {u} f1
The returned form u is obsolete. The server no longer
translates character entities in the returned word.
RLOOKUP regexp
regexp Regular expression
Find all words matching a given regular expression. RLOOKUP
stores its results in an internal buffer of limited
size. The call is RLOOKUP regexp where regexp is
the regular expression.
Response:
The reply is OK
n
if there are n solutions.
NO 0
if there are no
solutions.
NO TOOMANY k
if there are more
solutions than can be stored (k gives the maximum).
Implementation limit: The
largest number of matches allowed is 100000.
RGET may be used to recover individual solutions.
RLOOKUP stores its results in an
internal buffer of limited size. If there are more solutions than can be stored
the return string will be NO TOOMANY.
The following one-character regular expressions match a single character:
|
C |
An ordinary character (not one of the special characters discussed below) is a one-character regular expression that matches that character. |
|
\c |
A backslash (\) followed by any special character is a one-character regular expression that matches the special character itself. The special characters are + . * [ \: (period, asterisk, left square bracket, and backslash, respectively), which are always special, except when they appear within square brackets ([]). |
|
. |
A . (period) is a one-character regular expression that matches any character |
|
[string] |
A non-empty string of characters enclosed in square brackets is a one-character regular expression that matches any one character in that string. If, however, the first character of the string is a ^ (a circumflex or caret), the one-character regular expression matches any character other than the remaining characters in the string. The ^ has this special meaning only if it occurs first in the string. The -(minus) may be used to indicate a range of consecutive ASCII characters; for example, [0-9] is equivalent to [0123456789]. The - loses this special meaning if it occurs first (after an initial ^, if any) or last in the string. The ] (right square bracket) does not terminate such a string when it is the first character within it (after an initial ^ if any); that is, []a-f] matches either ] (a right square bracket ) or one of the letters a through f inclusive. The four characters +. * [ \ stand for themselves within such a string of characters. |
The following rules may be used to construct regular expressions:
|
* |
A regular expression followed by * (an asterisk) is a regular expression that matches zero or more occurrences of the one-character regular expression. |
|
+ |
A regular expression followed by + (a plus sign) is a regular expression that matches one or more occurrences of the one-character regular expression. |
|
? |
A regular expression followed by ? (a question mark) is a regular expression that matches zero or one occurrences of the one-character regular expression. |
|
Concatenation |
The concatenation of regular expressions is a regular expression that matches the concatenation of the strings matched by each component of the regular expression. |
|
( ) |
A regular expression enclosed in parentheses matches a match for the regular expression |
|
| |
The disjunction of two regular expressions matches anything that matches either of theexpressions. |
The order of precedence of operators at the same parenthesis level is [ ] (character classes), then * + ? (closures),then concatenation, then | (alternation).
The regular expression evaluator can detect multiple pattern matches. Thus tea? will match both te and tea.
This message changes the
saved state of a named solution set. The format is SAVE b q where b is 0
to turn off saving and 1 to turn it on, and q is the query name.
SCADD
name n1 … nk
name partition
ni values
Add class
values for texts to named partition. k must
not exceed the parameter g
returned when the partition was
defined. After all texts have
been added, add an extra 0 identifier to the end of the list; until this is
done the partition cannot
be used.
The integer assigned to a text specified which class of the partition the text belongs to. The names of classes are uploaded separately.
Response:
OK
SCCOUNT query
partition p
query Name of the query as assigned by QNAME,
partition Partition name
p Class
Count hits that fall within the specified class of the specified partition. For compatibility with SCSET, add 1 to the class number; 0 means use each class. Note that this call does not affect the active class.
OK n nt
n Number of hits in class
nt Number of texts hit in
class.
SCDEF
name desc
name Name for new partition
desc Description
for new partition
Add a new partition.
Response:
OK g
g is the
size of the buffer allocated to allow SCADD to add
class values to this partition.
If a partition with
that name exists
NO
SCENUM
name p n
name partition name
p First
position required
n Number
of texts required
Return a range of class
values for texts
identifiers from a partition. p
denotes the index of the first one required and n the number required.
Response:
OK t1…tn
System limitation: n may
not exceed 50.
SCENUMNAMES name
name Partition name
Return a space separated list of class names for a partition.
Response
OK s1…sn
SCLIST
n
n Index of partition
Get details of a partition.
Response:
OK name nt bReg fm fs nw nu desc
name Name
nt Number
of texts
bReg 1 if registered 0 otherwise
fm Mean
frequency
fs Standard
deviation of frequency
nw Number
of different words
nu Number
of words
desc Text description
SCNAMES name s1 …
sn
name Partition name
si ith class name
Set class names for a partition
Response:
OK
SCSAVE
Save the current partition in a
temporary file for registration. Registration is described in the indexer
documentation.
Response:
OK
NO
if file
could not be created.
SCSET
name p
name Partition to be activated
p Class number to be activated
Activate a class;
from now on all queries &c will be evaluated in texts of this class only. Use
one plus the class offset; if p is 0 then the whole corpus is activated.
Response:
OK
If the partition does
not exist
NO
SCWORDS name p
name Partition name
p Class number
Counts the words in all the texts of this class.
Response:
OK d
d is the number.
SOLVE q
cql
q Query name
cql Query
This is the call used to solve a CQL query. The
form of the call is SOLVE q cql where q is the query name and cql is the
query. The client must use a query name allocated by QNAME.
Response:
The reply is NO
0
if there are no solutions and
OK n nt
if there are n solutions occurring
in nt texts.
NO SYNTAX
means that the query could not be
parsed.
Individual solutions may
be retrieved using GET. NO SPACE
means that the server cannot save
the solution because its disk is full.
NO STREAMS
means that there are not enough
free streams to solve the query.
NO FILES
means that the system has more
open queries than it can handle.
NO
ABORT
The client interrupted the operation.
Individual solutions may be retrieved using GETSOL.
SOLVE is an interruptable transaction.
This is maintained for compatibility for the sake of old clients still using CQL. The latest Sara Client uses SOLVEX instead.
SOLVEX q xcql
q Query name
xcql Query
This is the call used to solve an XCQL query. The
form of the call is SOLVE q cql where q is the query name and cql is the
query. The client must use a query name allocated by QNAME. The
format of XCQL queries is documented in 2.
Response:
The reply is NO
0
if there are no solutions and
OK n nt
if there are n solutions occurring
in nt texts.
NO SYNTAX
means that the query could not be
parsed.
Individual solutions may
be retrieved using GET. NO SPACE
means that the server cannot save
the solution because its disk is full.
NO STREAMS
means that there are not enough
free streams to solve the query.
NO FILES
means that the system has more
open queries than it can handle.
NO
ABORT
The client interrupted the operation.
Individual solutions may be retrieved using GETSOL.
SOLVEX is an interruptable transaction.
Thins a query to a specified list of solutions. The
syntax is
SQTABLE q1 q2 n1 ... nm
where q1 is the initial
query,
q2
the thinned
result
, and n1...nm
are the
indices of the solutions to be retained.
Thins a query to a
specified list of solutions.
Response:The reply is
OK m mt
where m should
be the number
of solutions requested
and mtmt will be the number
of texts represented.
NO
FILES
Cannot create query q2
NO 0
m=0
Implementation limit: m cannot exceed 1000.
Note: Recall also the general limit on the length of client messages mentioned earlier.
SUBCORPUS name
name Name of corpus
Get details of a corpus.
Response:
OK nt fm fs nw nu
nt Number
of texts
fm Mean
frequency
fs Standard
deviation of frequency
nw Number
of different words
nu Number
of words
Gets details of a subcorpus. SUBCORPUS sc is the required form, but sc must be the
corpus name. The reply is OK n m s where n is the number of words in the
dictionary, m the mean frequency and s the standard deviation of frequency.
SUBCORPUS is not the right name for this.
Cut a solution set down using one of a number of
criteria. The general form is
THIN name newname method size seed
where name is the query name,
newname
the name
for the thinned query set,
method
an
integer saying how thinning is to be performed and
size
is the desired
number of solutions.
seed seed for random thinning
Cut a solution set down
using one of a number of criteria.
Available methods are:
0: thin to initial segment of solutions
1: thin to random subset using seed as random number generator
2: thin to one solution per text. In this case the size parameter is ignored.
The reply is Response:
OK n m
where n is the number of solutions after
thinning and mn
the number of texts represented in the result.
NO
FILES
Iif the new
query file cannot be created the reply is NO FILES.
WEB
If the server advertises a web site it will return
the url.ply
Response:
OK url
if there is a site
NO
otherwise
where url is the web site url. Otherwise it will
reply NO.
This setting will be
moved to the dsc file.
The following are now withdrawn and will give the
protocol error
NO
DELETED
|
CHAR |
|
CSCORE |
|
CUT |
|
DIR |
|
GETCHEAD |
|
GETDTD |
|
OPEN |
|
PURGE |
|
SAVE |
|
SORT |
|
SORTFILTER |
|
TRACE |
|
WORDLIST |
FILTER and GET are supported for compatibility but will be withdrawn in a future release; see note at the head of this document.
In this section we shall explain the structure of CQL queries.
<!ENTITY
% anyq "seq|or|lemma|form|pos|phrase|word|element|pattern">
<!ELEMENT
cql (%anyq;|scope)>
<!ELEMENT
seq (%anyq;|neg|all)+>
<!ELEMENT
or (%anyq;)+>
<!ELEMENT
and (%anyq;)+>
<!ELEMENT
prod (%anyq;)+>
<!ELEMENT
bprod (%anyq;)+>
<!ELEMENT
neg (%anyq;)>
<!ELEMENT
all EMPTY>
<!ELEMENT
lemma (#PCDATA)>
<!ELEMENT
form (#PCDATA)>
<!ELEMENT
span EMPTY>
<!ATTLIST
span
size
NMTOKEN #REQUIRED>
<!ELEMENT
scope ((%anyq;|prod|bprod),(element|span))>
<!ELEMENT
poscode EMPTY>
<!ATTLIST
poscode
tag
NMTOKEN #REQUIRED>
<!ELEMENT
pos (word|all,poscode)>
<!ELEMENT
phrase (#PCDATA)>
<!ATTLIST
phrase
case
(yes|no) "no"
header
(yes|no) "no">
<!ELEMENT
word (#PCDATA)>
<!ELEMENT
element (attribute)*>
<!ATTLIST
element
end
(yes|no) "no"
name
NMTOKEN #REQUIRED>
<!ELEMENT
attribute (#PCDATA)>
<!ATTLIST
attribute
name
NMTOKEN #REQUIRED
var
(yes|no) "no">
<!ELEMENT pattern (#PCDATA)>
Any CQL query is either a combined or atomic query or a scope query. Any query identifies a number of hits. A hit is a range of adjacent characters from the same text; for each query we shall define it by its start and end.
Atomic queries are not made up of smaller queries, though they may have components.
Before explaining the details of the cql language we remind the reader of the Sara lemmatisation system. It is important to distinguish between searching for a word, for a headword and for a form. These terms have already been defined (1.15).
The search for a word will find all usages in the corpus that have exactly the same spelling as the word in the query.
The search for a headword will find all usages in the corpus that are lemmata of the headword in the query.
The search for a form will find all usages in the corpus that have the same spelling, postag and variant number as in the query. Note that in a corpus with neither pos tagging nor lemmatisation this is just an l-word.
In each atomic query, the start and the end a hit and identify the location (text and offset) of the start and end of the token matched.
Syntax: <word [case=”yes”] [header=”yes”]>spelling</word>
This will find every usage with the given spelling, irrespective of case unless case=”yes” is specified.. It will not find usages in the headers of texts unless header=”yes” is specified.
In a word query the user takes responsibility for ensuring that the word supplied is what is elsewhere called an l-word, that is, a token in whatever tokenisation scheme was used by the indexer. For example, if the character sequence can’t is tokenised can ‘t then it will never be found in a word query.
Example: <word>godly</word>
Word queries are not used by Sara servers after 0.93; but
the are indirectly used in phrase queries.
Syntax: <lemma>headword</lemma>
This will find every usage that is a lemma of the given headword. The form of the headword is prescribed by the lemmatisation scheme in force. In the null lemmatisation scheme a headword is precisely the same as a form, but other lemmatisation schemes group various forms under a single headword.
Example: <lemma>find=VERB</lemma>
The example uses
We found many pieces of bone. The finds were deposited in
the
It will find a hit in the first but not the second sentence.
Syntax: <form>form</form>
This will find only usages with the prescribed spelling, pos tag and variant number.
Note that the required format may be determined by a client by using the LEMMDESC query for null lemmatisation (see 1.25).
Example:
<form>rat=NN1</form>
Syntax: <pos><word>spelling</word><poscode tag="postag"/></pos>
<pos><all/><poscode tag="postag"/></pos>
This is just a simplified form query that is independent of headword/form format. In BNC2 the query
<pos><word>rat</word><postag tag="NN1"/></pos>
is just the same as
<form>rat=NN1</form>
This query is still used by server 0.95 but will become obsolete. However the all pos form has no similar replacement.
Syntax: <pattern>regexp</pattern>
The solutions are all usages whose spellings match the given pattern. The syntax of regular expressions was given in 1.42.[3]
Syntax: <all/>
This query hits all words. It can only be used in seg queries.
Syntax: <phrase [case=”yes”] [header=”yes”]>chars</phrase>
The string chars is parsed into tokens and then a sequence query () for the resulting words is performed. For example
<phrase case=”yes”>French King</phrase>
is just the same as
<seq><word case=”yes”>French</word><word case=”yes”>King</word></seq>
Note: If the corpus doesn’t use part of speech tags then the same tokenisation rules will be applied to the phrase as were applied by the indexer. This means that each sequence of tokens matching the phrase will be found provided they start at a token boundary. If part of speech tags are used then it is the responsibility of the author of the DSC file to see that the tokenisation rules give the same results as the pos tags. The only exception to this is that phrase queries automatically allow for BNC style frobs.
In a phrase query the token _ is matches any word. It may not occur at the start or the end of the phrase.
It is always possible to prescribe exactly the tokenisation of a character sequence by using a seq query.
Syntax: <element name=”eee”><attribute name=”aa1”>val1</attribute>…<attribute name=”aan”>valn</attribute></element>
This will find markup with the prescribed element name and attribute/value pairs. For example, to search for
<header rend=”it”>
use
<element name=”header”><attribute name=”rend”>it</attribute></element>
There are two extra attributes available:
When matching markup, the order of attributes is not significant.
Syntax: <seq>op1…opn</seq>
This finds a sequence of hits for the n queries specified.
Two hits hit1 and hit2 are in sequence provided hit1 precedes hit2 and there are no words in between them. Intervening markup is allowed.
The arguments can be any atomic or combined query.
The start of a hit is the start of the hit for op1. The end of the hit is the end of the hit for opn.
Syntax: <neg>op</neg>
This finds a word that is not a hit for the query op.
Neg queries only occur in seq queries and may not occur at the start or the end.
Syntax: <or>op1 … opn</or>
This finds the first occurrence of any of the specified queries (that is, the one whose start occurs first)..
The arguments can be any atomic or combined query.
The start and end of the hit are the start and end of whichever query occurred first.
Syntax: <scope>op span</scope>
This finds all solutions to op within a given span.
A span can be specified either as a number of words or as an element query. A numeric span is written <span size=”n”>.
Scoping a query restricts all elements of the solution to fall within a given area. The area may be specified as an SGML element or as a number of words.
Op can be any atomic or combined query, or an ordered or unordered product.
If the span is an element this will find all hits whose start is preceded by the end of a hit for the element query whose matching end token occurs after the end of the query hit.
If the span is a number n then the query finds all hits in which fewer than n words occur between the start and the end of the hit.
An ordered product is written <prod>op1,,,opn</prod> where op1…opn are atomic or combined queries.
An ordered product finds all hits for opn where either
The start and end of the hit are as for opn.
An unordered product is written <bprod>op1,,,opn</bprod> where op1…opn are atomic or combined queries.
An unordered product finds all hits for opn where either
The start and end of the hit are the start and end of whichever of the hits for opi starts latest.
The general approach to indexing attributes is to index the value as a text string of each attribute. There are a few cases, however, where special treatment is required. These special cases are explained in this section. The treatment of sttribute values in SARA emerged in an extremely ad hoc fashion and it is not difficult to see in retrospect how the whole apparatus could be simplified.
Certain attributes are declared as plural. When an attribute is plural, the indexer treats its value as a string of values, decomposes the string into a list of values and indexes each of these separately.
Individual attributes are treated in a number of different ways.
|
CDATA |
The attribute value is indexed just as it is. |
|
CAT |
The attribute value is indexed in upper case |
|
NUMBER |
The attribute value is indexed in upper case |
|
NAME |
The attribute value is indexed in upper case |
|
ID |
The attribute root is indexed. The value is stored as position data |
|
REFID |
The attribute root is indexed. The value is stored as position data. |
|
NULL |
Suppresses indexing altogether |
|
MULTID |
The attribute root is indexed. The value is stored as position data. |
|
MULTIDREF |
The attribute root is indexed. The value is stored as position data. |
The terminology is most misleading. An ID attribute can have any name. A REFID attribute can have any name but must refer to an attribute of another element called ID.
The values of ID and similar attributes is composed from a text identifier, the root and an integer using a shifting set of rules, radices etc.
This is not the place to explain the reason for storing ID values as position data.
As of version 0.930 of the client/server software, all elements of the SARA package (that is, the indexer, the server and the client) use one common file to access information about a BNC-style corpus. The files elements.txt and header.txt, as well as cdif.txt and various parameter files only used by the indexer, may be deleted.
This new file is called the corpus description file. Its
name must be the corpus name and its extenstion must be
dsc. The following corpus names are in use:
|
bnc1 |
The main corpus |
|
bncsam1 |
The old C6 sampler |
|
bncsam2 |
The new C6 sampler |
|
bnc2 |
World edition |
The server tells the client the name of the corpus it uses as part of the INFO exchange (see 1.23).
The description file consists of a number of lines each with a keyword and an argument string. Arguments are separated by blanks. Lines beginning with the character # are treated as comments.
The rest of this section lists the keywords supported.
[All]
n is the version number multiplied by 100. This must be the first line of the file and may not be preceded by a comment. With software version 0.930 all header versions have been set to 1.00.
o is an option string. At present the only
option strings in use are
noposindex
which tells the indexer that the corpus does not
contain BNC-style pos tags,
and namecase, telling
the indexer to treat SGML element and attribute names as case sensitive.
Tells the client that a newline is to count as a
space.
Tells the
indexer to treat SGML element and attribute names as case sensitive.
Tells the indexer that
the corpus does not contain BNC-style pos tags,
[Server]
Sets the label to be used to identify the position of a hit in the corpus. e must be an element name and a an attribute of that element. The server finds the label by searching for the most recent occurrence of the element/attribute pair before the hit and using the value of the specified attribute.
[Client]
Sets one of the search scopes used by the client in GET requests, qv. The first scope statement should be the smallest scope, called the default scope. Sara will always try to trim solutions to multiples of this scope. The last scope statement should represent a whole text. No more than three scopes are allowed. The scopes are referred to in the client as sentence, paragraph and maximum.
[All]
s1 is an attribute of the last declared element. n must be one of the following strings
|
Type |
Use of d field |
|
CDATA |
Unused |
|
CAT |
Alternatives as in ATTRIB declaration |
|
NUMBER |
Unused |
|
NAME |
Unused |
|
REFID |
Referenced element |
|
ID |
Root |
|
NULL |
Unused |
|
MULTID |
Root |
|
MULTIDREFS |
Root |
The terminology of this table is explained in Section 3.
Example:
att default CAT YES|NO Whether a default is available
Note that attributes must be listed after the element to which they belong. Attributes listed before the first ELT statement are treated as global, that is, they may belong to any element.
[All]
Declares a character entity. For example
char yacute 253
If the number n is present then it must denote the Unicode character value of the character as a hex number. Note that the indexer replaces character entities with replacement values less than 256 by the characters themselves.
If the number is not present, as in
char Ycirc
then the entity will appear in SGML entity notation.
c has the same
force as in a LEX statement. If omitted the entity is treated as a character.
Declares how a character or character entity should be treated in tokenisation. name should be either a single character or a character entity already defined in a CHAR statement.
|
c |
Meaning |
|
c |
Treat as letter character |
|
p |
Treat as punctuation |
|
s |
Treat as space |
Example
lex ‘ c
This causes a quote mark to be treated as a letter character.
It is only necessary to make this declaration if the character is to be treated otherwise than in the Sara default character table.
Note that this table is used by the indexer unless explicit
word division is used in a corpus (as it is in BNC). However, this table is always
used to parse phrase queries. If explicit word division is in force it is
important to edit this table to bring it as close as possible to the actual
tokenisation system used.
[Indexer]
Declares a lower case form for a character; name1 and name2 must be characters, or character entities already declared in CHAR statements.
Example
lc Æ æ
It is not possible to change the case mapping of ASCII characters.
[All]
s is an element of type n. The values of types n are:
|
0 |
Empty element |
|
e |
Non-empty element |
|
s |
s-type element |
By an s-type element is meant an element with omitted end tags where the end tag is always found immediately before the next start tag of the same kind, or at the end of the document if there are no more such start tags.
h is a string of flags as follows:
· b if the tag appears in text bodies only
· e if the tag should be preceded by a line break in page view
· i if the tag is internal. This causes the structure browser to treat this tag as content.
· h if it only appears in headers.
· t if the tag is transparent, that is, if it may occur within a word without breaking the word.
desc is a brief description of the element.
Example:
elt locale e h description of a place where speech recorded
[All]
Declares a special entity and its printed form. For example
ent alien [alien]
Special entities are always displayed in their printed form when non-SGML output is required.
[Client]
See MENU.
[Client]
List a menu to be used to solicit the value of a parameter. All menus must begin with a line such as
MENU 0 spoken_class
What follows depends on the value of the parameter k. If it is 0 then a number of item statements such as
ITEM 1 AB
ITEM 2 C1
ITEM 3 C2
ITEM 4 DE
give the actual parameter values and their menu equivalents.
If k is non-zero then it must refer to an enumeration declared in a TYPE statement. Note however that the value k=1 is reserved.
Note: as of software version 0.930 any attribute may have a menu. Declaring an empty menu for an attribute suppresses its display.
A MENU statement must directly follow the attribute it serves.
[All]
A POS statement lists a part-of-speech code. For example
POS VBD past form of the verb "BE" , i.e. WAS, WERE
[All]
A PUN statement lists a part-of-speech code that is classed as punctuation. For example
PUN PUL left bracket (i.e. ( or [ )
[Client]
n must be an integer greater than 1. The items that follow the statement are just the same as after a MENU statement. The only point of the type statement is that it saves having to list the same enumeration of items (eg ISO country codes) in several places. The integer n is used in a MENU statement to show that the items from type n should be used.
[Note that the break table in the client is still
hard-coded]
Need to add linefmt etc here!
[Client]
Prevents element e with attribute a from appearing in
selection menus in the client.
[All]
Sets the radix for
subsequent CHAR statements to n.
[Client]
This declares element s to be a bibliography
element. Bibliographic data for a
text is generated by concatenating in sequence the content of
each bibliography element in the text, having transformed that content using
the bibfmt format (see 5.3).
[Client]
This adds a column to the
list of texts. Like bibliographic data,
data in the table of texts is
extracted from element content and attribute values in the text itself.
t Column title
w Column width in dialog units
e Element name
a Attribute name or – to request element content
f Flags
The only flag allowed is N, which stipulates that the column
is numeric – this affects how it is
sorted.
[All]
Stipulates which
lemmatisation schemes (see 1.26) are in
use in the corpus. Include a line for each
scheme supported.
The lemmatisation name is the name from the
table just referred to followed by any extra parameters required for that
scheme.
Example:
lemmata bnc
lemmata
[Client]
n is a help tag indicating
a page in the Sara help file
documenting the POS tags in use in a corpus. Use c6tags or c7tags as appropriate.
[All]
Designates
a POS tag. When a word is indexed and POS indexing is switched on the value of
the a attribute of its closest
containing e tag is used as the POS
value.
[Indexer]
Designates
a lemmatisation tag. When a word is indexed and inline
lemmatisation is switched on the value of the a attribute of its closest containing e tag is used as the
headword.
[Clent]
Adds a string to a format
file. n must name the file and s is the string to be
added. These files are described in more detail in 5)
[Client]
Stipulate a word (with
optional tag) (that is, an extended word of the null lemmatisation scheme) to
be used as an example if the user asks for an example
of how different schemes work.
The sense is always 0.
[Server, client]
Stipulate the default
lemma scheme when a Sara session begins.
[Client]
Stipulate the content of
the source tag used in XML listings.
A format is a set of rules for transforming an XML-encoded string into a display string. A rule is an element
rule or an entity rule. All rules consist of a head and a body. The head is a
white-space terminated name for the rule; the body a number of double quote delineated strings in which the following escapes are
allowed: \n \t \\ \”. Entity
rules have at most one replacement string: element rules at most two.
The name of an entity
rule always begins with the character &. In transforming an XML string, two
transformations are processed: first the string is
processed for element rules, then the result is processed for entity
rules.
In the first
transformation, the string is searched for strings of the form
<elt att1=val1 … attn=valn>
or
</elt>
When such a string is
found
(a) if there is no element
rule for elt then it is deleted
(b) if the string is of the
first form there is an element rule
for elt then it is replaced by the first string in the rule body after variable
substitution
(c) if the string is of the second
form and the rule body has two strings then it is replaced by the second string
after variable substitution.
Variable substitution
means that the string %[att] is replaced by the
value of attribute att.
In the second
transformation the string is searched
for the character & starting a token; if that token is
the name of an entity rule then the name is replaced by the rule body. If the name is followed
immediately by a semicolon character this will be deleted.
It should be noted that although page, line
and bibliographic formats are supplied in the dsc file, the client user is
allowed to edit them. The client holds local
copies of all page formats that it knows about in a single file called
pagefmt.txt. This file has a section for each dsc file the client
knows of: the section begins with the dsc name in square brackets. When the version number
of the dsc file on the server exceeds that on the client the new dsc file is automatically
downloaded to the client. This has the side effect that all user edits to
these format strings are overwritten,
Page
formats are applied to the page-per-hit solution window when it is set to custom format.
Line
formats are applied to the line-per-hit solution window when it is set to custom format.
Bib formats are applied
to the strings returned for the various BIB elements listed in the dsc file.
SaraScript is an objected-oriented extension of Javascript that allows access to the Sara lexixon and solution mechanism. Readers who are not acquainted with JavaScript are advised to go away and learn it.
SaraScript is a
higher level version of the Sara Protocol. It is a wrapper around the
string-based protocol and is implemented partly on the server and partly on the
client. The simplest way to execute SaraScript is from the client, and this is
the approach taken below. However, it is possible to execute SaraScript in any
Windows scripting host. In that case the functions in 6.1 behave slightly differently. Instead of being
functions they become methods of a Sara server object.
Consider the function list from the example in 6.8. In the Sara script window it begins:
function
list(lemma,tag,ff)
{
var i, k, kk, s;
s=Solve("^\""+lemma+"="+tag+"=\"");
k=s.Count(); p=0; …
but in another application it would look like this:
var ss=new
ActiveXObject("SaraServer");
function
list(lemma,tag,ff)
{
var i, k, kk, s;
s=ss.Solve("^\""+lemma+"="+tag+"=\"");
k=s.Count(); p=0; …
When the server object is created the Sara client will be launched and the user will be prompted to log on to a corpus. At present there is no way to select the corpus in script.
Note: in reading
examples, remember that JScript treats all string as Unicode.
void ActivateClass(String part,String class)
Set the active class.
Example:
ActivateClass(“type”,“spoken”);
void OpenPartition(String path)
Open a partition
Example:
OpenPartition(“type.sc”)
String Server(String s1);
This offers raw access to the Sara protocol described above. The argument s1 is sent to the server and the string returned by the server is returned.
Example:
s=Server(“MOTD”);
s gets set to the message of the day.
void SetLemmate(String l)
Set the lemmatisation scheme in use.
Example:
SetLemmata(“bnc”);
SaraSolve Solve(String)
Get a SaraSolve object that is the solution to a query. If there are no solutions an exception is signalled.
SaraWordList GetSaraWordList(String)
Get a SaraWordList object that represents the set of words in the dictionary whose headwords match the pattern supplied as the argument.
Example:
wl=GetSaraWordList("mang.*");
Int s.Count()
Returns solution size, that is, the number of hits.
Example:
SaraSolve s=Solve(“hapax legomenon”);
WriteLine(s.Count());
1
SaraColloc ct=s.GetCollocateTableByCount(int left,int right,int k,int rare);
Build a collocation table for the solution set where
left left window
right right window
k number of collocates wanted
rare frequency below which words
are to be omitted
See CTABOPTIONS (1.8). The k best collocates are returned.
Note: at present
SaraScript collocation tables always use the Z-score score.
SaraColloc ct=s.GetCollocateTableByScore(int left,int right,double k,int rare);
Build a collocation table for the solution set where
left left window
right right window
k minimum score
rare frequency below which words
are to be omitted
See CTABOPTIONS (1.8). Collocates scoring more than k are returned.
Note: at present
SaraScript collocation tables always use the Z-score score.
SaraHit h=s.GetSol(Int n)
Get the nth hit
from a solution. If n is out of range an exception is raised.
void s.Release();
Dispose of unwanted
SaraSolve object.
A SaraHit object
represents a single hit for a query. In what follows the term solution denotes
the entire string found (the default scope is used) whereas the term hit
denotes the part of the solution that matches the query.
void h.Filter(String name);
Apply a filter to the solution. The next table lists the filters available. An exception occurs if the filter is not in this table.
|
NOSGMLX |
Strip all XML markup |
|
NORMSPACE |
Replace multiple white space characters by single spaces |
|
CMAP |
Map character entities to characters |
|
XMLENT |
Replace & and < by XML entities |
int n=h.GetStart();
Get location in
solution of start of hit.
int n=h.GetLength();
Get length of hit.
String s=h.GetString();
Get the solution.
String text=h.GetText();
Get the name of the
text in which the solution is found.
String text=h.GetText();
Get the name of the
unit of the text in which the solution is found. In the BNC this will be the
sentence number. In general the attribute returned is controlled by the LABEL
line of the dsc file (4.3).
Int ct.Count()
Return the number
of collocates in the collocation table.
SaraCollocate ct.GetCollocate(int n)
Get the nth
collocate from a collocation table.
void ct.Release()
Free an unwanted
collocation table.
int co.frequency
This read only
property is the co-frequency of the collocate.
SaraSolution co.GetHits();
For a given
collocate return the solution set, that is, the set of hits for the collocate headword
that occur in the collocation window.
double co.score
This read only
property is the score of the collocate.
String co.word
This read-only property is the headword of the collocate.
Int Count()
Returns the number of entries in the word table.
SaraWordListEntry GetWordListEntry(index)
Returns a single from the given position in the word list.
Void Release()
Free resources associated with the word list.
This is a String-valued property giving the matching headword.
This is a long property giving the number of occurrences of forms of the headword.
This is a long
property giving the number of forms of the headword.
The following example shows a
function report that is used to list to
file a random set of hits and
some collocation information for a word.
Example:
var p,
ff;
var
r=new Array();
var
fso=new ActiveXObject("Scripting.FileSystemObject");
function
report(word,tag)
{
ff=fso.CreateTextFile("listing.xml",true,true);
header(ff);
ff.WriteLine("<listLems>");
list(word,tag,ff);
ff.WriteLine("</listLems>");
ff.Close();
}
function
list(lemma,tag,ff)
{
var i, k, kk, s;
s=Solve("^\""+lemma+"="+tag+"=\"");
k=s.Count(); p=0;
kk=k;
if (k>10) kk=10;
ff.WriteLine("<lemma
freq=\""+s.Count()+"\">");
ff.WriteLine("<form
tag=\"" + tag
+"\">"+lemma+"</form>");
ff.WriteLine("<listHits>");
for (i=0;i<kk;i++) {
t=x(k);
sol=s.GetSol(t);
ShowSol(sol);
delete sol;
}
ff.WriteLine("</listHits>");
Collocates(s,ff,1,0);
Collocates(s,ff,0,3);
ff.WriteLine("</lemma>");
s.Release();
delete s;
}
function
ShowSol(sol)
{
var hit;
sol.Filter("CMAP");
sol.Filter("NOSGMLX");
sol.Filter("NORMSPACE");
sol.Filter("XMLENT");
hit=sol.GetString();
ff.WriteLine("<hit
text=\""+sol.GetText()+
"\"
n=\""+sol.GetUnit()+"\">"+
hit.substr(0,sol.GetStart())
+
"<kw>"+hit.substr(sol.GetStart(),sol.GetLength())
+
"</kw>"+hit.substr(sol.GetStart()+sol.GetLength())
+
"</hit>");
}
function
Collocates(sol,file,left,right)
{
var i, k, ct, c, ss, hits, sol;
ct=sol.GetCollocateTableByScore(left,right,10.0,2);
k=ct.Count();
file.WriteLine("<listCollocs
left=\""+left+"\"
right=\""+right+"\">");
for (i=0;i<k;i++) {
c=ct.GetCollocate(i);
ss=c.word.split("=");
file.WriteLine("<colloc
freq=\""+c.frequency
+"\"
zscore=\""+decimal(c.score)
+"\"><form
tag=\""+ss[1]
+"\">"+ss[0]+"</form>");
hits=c.GetHits();
sol=hits.GetSol(0);
ShowSol(sol);
file.WriteLine("</colloc>");
delete sol;
hits.Release();
delete hits;
delete c;
}
file.WriteLine("</listCollocs>");
ct.Release();
delete ct;
}
function
x(k)
{
var i, n;
if (k <= 10) return p++;
n=Math.random()*k;
n=Math.floor(n);
if (n==k) n=k-1;
while (1) {
for (i=0;i<p;i++) {
if (r[i]==n) break;
}
if (i==p) {
r[p++]=n;
return n;
}
n++;
if (n==k) n=0;
}
}
function
decimal(z)
{
var zz;
zz=z*100;
zz=Math.floor(zz);
zz=zz/100;
return zz;
}
function
header(file)
{
ff.WriteLine("<?xml version=\"1.0\"
encoding=\"UTF-16\" ?>");
ff.WriteLine("<?xml:stylesheet
type=\"text/xsl\" href=\"listlems.xsl\"?>");
ff.WriteLine("<?xml
version=\"1.0\" encoding=\"ISO-8859-1\" ?>");
ff.WriteLine("<?xml:stylesheet
type=\"text/xsl\" href=\"l:\wb\eng\pep\bnc\plist.xsl\"
?>");
ff.WriteLine("<!DOCTYPE
listLems SYSTEM \"listlems.dtd\">");
}
report(“idle”,”VERB”);
The program solve returns solutions to a CQL query, which must appear on the command line. Thus
solve frog
will enumerate all solutions to the query frog in the format returned by GET (minus the OK, of course). This format was documented in 1.13.
solve uses a connection to localhost to contact the server. It can therefore only be run on a machine that is running the server.
testnet is used to test the server. It accepts ASCII packages as documented in Section 1, which the user types at the prompt. The replies are echoed.
sample is a simple filter that may be used to reduce the number of solutions returned by solve. sample n reads lines from standard input and sends every nth line to standard output. Thus
solve frog|sample 3
displays every 3rd hit from the query frog.
_ver 100
# =================================================
# BNC2.DSC file
#
# 1. First version AD
# 2. Modified by LB 29 apr 97: attribute descs
# 3. Modified by AD 1 may 97: restored portmanteau
tags, element text
# 4. Modified by LB 10 May 97: replaced POS table
for sampler
# removed
wbpsel and writim (not used in sampler)
#
lowercase all catref values
# revised for BNC2 : LB :
label s/n
scope s
scope p/u
scope bncDoc
option namecase
lemmata bnc
lemmata
lemmdef bnc
#bib sourceDesc
col Title 300 title -
col Size 100 extent -
radix 16
tagsethelp c6tags
wtag w pos
wtag c pos
# default page and line formats
fmt line u "<%[who]>:
"
fmt line div "|"
fmt line div1 "|"
fmt line div2 "|"
fmt line div3 "|"
fmt line head "||"
fmt line spkr "||"
fmt line caption "|"
fmt line p "|
"
fmt line item "* "
fmt line pause " ... "
fmt line unclear " [ ... ] "
fmt line event " [%[desc]]
"
fmt line vocal " [%[desc]]
"
fmt line kinesic " [%[desc]]
"
fmt line shift "[%[new]]"
fmt page u "\n%[who] >:
"
fmt page div "\n"
fmt page div1 "\n"
fmt page div2 "\n"
fmt page div3 "\n"
fmt page head "\n\n"
fmt page spkr "\n\n"
fmt page caption "\n"
fmt page p "\n
"
fmt page item "\n*\t"
fmt page pause " ... "
fmt page unclear " [ ... ] "
fmt page event " [%[desc]]
"
fmt page vocal " [%[desc]]
"
fmt page kinesic " [%[desc]]
"
fmt page shift "[%[new]]"
#------------------------------------
# First the element and attribute table
# -------------------------------------
att n CDATA 0
att rend CDATA 0
elt w e b word
att pos CDATA 0
elt c e b word
att pos CDATA 0
elt bncDoc e b an individual text in the BNC
att id CDATA 0
elt s s b sentence-like linguistic segment
att p CAT Y|N
manually checked at
# not currently used
elt seg e b
att part CAT Y|N|I|M|F
att type CDATA 0
att subtype CDATA 0
elt gap 0 b point where part of source text has
been omitted
att desc CDATA 0
description of the part omitted
att resp CDATA 0 identifier of the editor
responsible
att reason CDATA 0
reason for making change
elt corr e b editorially regularized part of a
written text
att sic CDATA 0
uncorrected form
att resp CDATA 0 identifier of the editor
responsible
att reason CDATA 0 reason for making change
elt sic e b apparently erroneous transcription
att corr CDATA 0
suggested correction
att resp CDATA 0 identifier of the editor
responsible
att reason CDATA 0
reason for making change
elt ptr 0 b link to a displaced element or to
synchronisation point
att target CDATA 0
identifier of target
elt text e b an individual written text
att complete CAT Y|N Is sample complete (Y or N)?
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls CDATA 0 Applicable editorial declarations
elt div1 e b first-level subdivision of a written
text
att complete CAT Y|N Is division complete (Y or N)?
att type CDATA 0 Function of division eg chapter,
part, section, toc, etc.
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls CDATA 0 Applicable editorial declarations
elt div2 e b second-level subdivision of a written
text
att complete CAT Y|N Is division complete (Y or N)?
att type CDATA 0 Function of division eg chapter,
part, section, toc, etc.
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls CDATA 0 Applicable editorial declarations
elt div3 e b third-level subdivision of a written
text
att complete CAT Y|N Is division complete (Y or N)?
att type CDATA 0 Function of division eg chapter,
part, section, toc, etc.
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls CDATA 0 Applicable editorial declarations
elt div4 e b fourth-level subdivision of a written
text
att complete CAT Y|N Is division complete (Y or N)?
att type CDATA 0 Function of division eg chapter,
part, section, toc, etc.
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls CDATA 0 Applicable editorial declarations
elt head e b head heading at the start of a
division of a written text
att type CAT main|sub|byline|unspec (
elt caption e b floating heading or caption within
a written text
att id NULL 0
att type CAT attached|display|byline|unspec (ATTACHED, DISPLAY, BYLINE, UNSPECified)
elt p e b paragraph in a written text
elt sp e b speech in a written text
att who CDATA 0
elt spkr e b speaker of a speech in a written text
elt stage e b stage direction in a written
text
att id NULL 0
att type CAT m|s|a|d|x|u
elt age e h
elt occupation e h
elt dialect e h
elt particLinks e h
elt poem e b group of verse lines in a written text
elt l e b line of verse in a written text
att part CAT y|n|u Is verse line complete (Yes, No,
Unknown)?
elt lg e b
att type CDATA 0
elt quote e b quotation in a written text
att type CAT inline|display|unspec (INLINE,
DISPLAY, or UMSPECified)?
elt list e b list of items in a written text
att type CDATA 0
elt item e b item within a list in a written text
att id CDATA 0
elt label e b label of a list item in a written
text
elt note e b note or comment of any kind
att id NULL 0
att resp CDATA 0 identifier of the editor
responsible
att place CAT side|foot|end|unspec (SIDE, FOOT,
END, or UNSPEC)
att type CAT ed|orig (EDitorial or ORIGinal)
elt bibl e b bibliographic reference in a written
text
elt hi e b typographically highlighted phrase in a
written text
elt salute e b salutation or greeting in a written
text
elt lb 0 b position of line break in written
sourcetext
elt pb 0 b position of page break in written
sourcetext
elt xref e b
elt editor e h
elt publisher e h
elt body e b
elt stext e b an individual spoken text transcript
att complete CAT Y|N Is transcript complete (Y or
N)?
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls MULTIDREFS SD
att decls MULTIDREFS CN
att decls MULTIDREFS QT
att decls MULTIDREFS HN
att decls MULTIDREFS TR
att decls MULTIDREFS QN
att decls MULTIDREFS SN
att decls MULTIDREFS TN
att decls MULTIDREFS RE
att decls MULTIDREFS SE
elt div e b any subdivision of a spoken text
att complete CAT Y|N Is division complete (Y or N)?
att type CDATA 0
att org CAT compo|seq Organization (COMPOsite or
SEQential)
att decls MULTIDREFS RE
att decls MULTIDREFS SE
elt align e b alignment map for synchronizing
overlap points in a spoken text
elt loc 0 b synchronisation point within an
alignment map in a spoken text
att id ID LC
elt u e b utterance in a spoken text
att who REFID person
elt vocal 0 b non-verbal vocalization in a spoken
text
att desc CDATA 0 kind of sound made
att dur NUMBER 0 duration in seconds
elt pause 0 b noticeable pause in a spoken text
att dur NUMBER 0
duration in seconds
elt shift 0 b change in voice quality in a spoken
text
att new CDATA 0 voice quality after the shift
elt event 0 b non-verbal event within a spoken text
att desc CDATA 0 description of the event
att dur NUMBER 0
duration in seconds
elt unclear 0 b inaudible or incomprehensible
passage in a spoken text
att who REFID person speaker identifier
att dur NUMBER 0
duration in seconds
elt trunc e b truncated form in a spoken text
elt teiHeader e h meta-information describing a
corpus text
att date.updated CDATA 0
att creator CDATA 0
att status CAT new|update (NEW or UPDATEd)
att update CDATA 0
att type CAT corpus|text (CORPUS or TEXT)
att id NULL 0
elt fileDesc e h documentation of an electronic
text
elt titleStmt e h title statement for a text
elt title e h title within a bibliographic entry
elt respStmt e h statement of responsibility in a
bibliographic entry
elt resp e h nature of responsibility
elt ednStmt e h information about a particular
edition
elt extent e h size of a corpus text
att kb NUMBER 0 Number of Kbytes
att words NUMBER 0 Number of <w> elements
contained
elt publicationStmt e h publication or distribution
information
elt address e h postal or other address
elt idno e h identifying number for a text
att type CDATA 0
elt availability e h availability code for file
att status CAT restrict|unknown|free
att region CDATA 0
elt sourceDesc e h description of the source for a
written text
elt biblStruct e h structured bibliographic entry
att default CAT YES|NO (YES or NO)
elt analytic e h analytic analytic bibliographic
entry
elt monogr e h monographic bibliographic entry
elt author e h author in bibliographic entry
att domicile CDATA 0
att born NUMBER 0
elt edition e h edition in a bibliographic entry
elt imprint e h imprint within a bibliographic
entry
elt name e h proper name of person, place etc.
att type CAT place|org|person
att id CDATA 0
elt date e h a date
att value CDATA 0
elt pubPlace e h place of publication within
bibliographic entry
elt biblScope e h page range within bibliographic
entry
att type CAT vol|issue|pp
elt bibNote e h note within a bibliographic entry
elt editionStmt e h details of an edition
elt distributor e h distributor of an edition
elt addrLine e h part of an address
elt para e h textual note in the header
att id CDATA 0
elt encodingDesc e h encoding description
att id CDATA 0
elt projectDesc e h background information about
BNC project
elt recordingStmt e h information about an audio
recording
att id ID RS
att default CAT YES|NO
elt recording e h recording details
att id NULL 0
att dur NUMBER 0
duration in seconds (?)
att date CDATA 0 date of recording
att time CDATA 0
time of day when recording made
att type CAT DAT DAT, WALKman, or UNKNOWN
elt samplingDecl e h description of sampling policy
att id ID SD
elt editorialDecl e h
elt classDecl e h description of classification
scheme
elt taxonomy e h
att id CDATA 0
elt tagsDecl e h list of tags used in a particular
text
elt tagUsage e h count for a particular tag in a
text
att gi CDATA 0 name of tag
att occurs NUMBER 0 frequency of tag in text
elt refsDecl e h description of reference system
used
att id CDATA 0
elt category e h a category-value pair
att id MULTID allava
att id MULTID alltyp
att id MULTID alltim
att id MULTID scgdom
att id MULTID sdeage
att id MULTID sdecla
att id MULTID sdesex
att id MULTID spolog
att id MULTID sporeg
#att id MULTID wbpSel
att id MULTID wmipub
att id MULTID wriaag
att id MULTID wriabp
att id MULTID wriad
att id MULTID wriaet
att id MULTID wriase
att id MULTID wriaty
att id MULTID wriaud
att id MULTID wridom
att id MULTID wrilev
att id MULTID wrimed
att id MULTID wripp
att id MULTID wrisam
att id MULTID wrista
att id MULTID writas
elt profileDesc e h additional information about a
text
elt langUsage e h description of languages used in
a text
elt particDesc e hy description of spoken text
participants
elt catDesc e h description of a category
elt creation e h information about creation of a
text
att date CDATA 0
elt person e h information about a speaker
att role CAT resp|other (RESPondent or OTHER)
att sex CAT m|f|u (Male, Female or Unknown)
att soc CAT AB|C1|C2|DE|UU (AB, C1, C2, DE, UU)
att resp CDATA 0
att age CAT 0|1|2|3|4|5|X
att dialect CDATA 0
att flang CDATA 0 first language
att educ CDATA 0 educational level reached
att id ID PS
elt relation 0 h relationship between participants
in a spoken text
att desc CDATA 0 Relationship type e.g. aunt,
mother
att passive CDATA 0 Identifiers for other speakers
related
att type CDATA 0
att active CDATA 0
att mutual CAT Y|N
elt settingDesc e h description of setting in which
speech occurs
elt setting e h an individual setting in which
speech occurs
att
att spont CAT L|M|H|U Low, Medium, High, or Unknown
att who CDATA 0 Identifiers of speakers at this
location
att audSize CDATA 0 Audience size at this location
att id ID SE
elt locName e h name of place where speech recorded
elt locale e h description of a place where speech
recorded
elt activity e h participants' activity during
recording
att spont CDATA 0
elt textClass e h text classification
att default CAT YES|NO
elt classCode e h Genre or other classification of
a text
att scheme CDATA 0
elt catRef 0 h category codes applicable to a text
att target MULTIDREFS allava Text availability
MENU 0
#all_availability
#ITEM 0 Information not available
#ITEM 1 Freely available worldwide
#ITEM 2 Available worldwide
#ITEM 3 Not available in
#ITEM 4 Not available in
#ITEM 5 Not available outside the European Union
att target MULTIDREFS alltyp Text type
MENU 0 all_type
ITEM 0 Information not available
ITEM 1 Spoken demographic
ITEM 2 Spoken context-governed
ITEM 3 Written books and periodicals
ITEM 4 Written-to-be-spoken
ITEM 5 Written miscellaneous
att target MULTIDREFS alltim Publication date
MENU 0 all_time
ITEM 0 unknown
ITEM 1 1960-1974
ITEM 2 1975-1984
ITEM 3 1985-1993
att target MULTIDREFS scgdom Domain
MENU 0 spoken_domain
ITEM 0 Information not available
ITEM 1 Educational/Informative
ITEM 2 Business
ITEM 3 Public/Institutional
ITEM 4 Leisure
att target MULTIDREFS sdeage Respondent age
MENU 0 spoken_age
ITEM 0 Information not available
ITEM 1 Under 15
ITEM 2 15-24
ITEM 3 25-34
ITEM 4 35-44
ITEM 5 45-59
ITEM 6 60 or over
att target MULTIDREFS sdecla Respondent social
class
MENU 0 spoken_class
ITEM 0 Information not available
ITEM 1 AB
ITEM 2 C1
ITEM 3 C2
ITEM 4 DE
att target MULTIDREFS sdesex Respondent gender
MENU 0 spoken_sex
ITEM 0 Information not available
ITEM 1 Male
ITEM 2 Female
att target MULTIDREFS spolog Interaction type
MENU 0 spoken_type
ITEM 0 Information not available
ITEM 1 Monologue
ITEM 2 Dialogue
att target MULTIDREFS sporeg Region of capture
MENU 0
spoken_region
ITEM 0 Information not available
ITEM 1 South
ITEM 2
ITEM 3 North
#att target MULTIDREFS wbpSel Selection method
#MENU 0 written_selection
#ITEM 0 Information not available
#ITEM 1 Selective
#ITEM 2 Random
att target MULTIDREFS wmipub Publication status
MENU 0 written_pubstatus
ITEM 0 Information not available
ITEM 1 Published
ITEM 2 Unpublished
att target MULTIDREFS wriaag Author age band
MENU 0 written_age
ITEM 0 Information not available
ITEM 1 Under 15
ITEM 2 15-24
ITEM 3 25-34
ITEM 4 35-44
ITEM 5 45-59
ITEM 6 60 and over
att target MULTIDREFS wriabp
MENU 0
att target MULTIDREFS wriad Author domicile
MENU 2 written_domicile
#att target MULTIDREFS wriaet
#MENU 0
att target MULTIDREFS wriase Author sex
MENU 0 written_sex
ITEM 0 Information not available
ITEM 1 Male
ITEM 2 Female
ITEM 3 Mixed
ITEM 4 Unknown
att target MULTIDREFS wriaty Type of author
MENU 0 written_type
ITEM 0 Information not available
ITEM 1 Corporate
ITEM 2 Multiple
ITEM 3 Sole
ITEM 4 Unknown
att target MULTIDREFS wriaud Audience type
MENU 0 written_audience
ITEM 0 Information not available
ITEM 1 Child
ITEM 2 Teenager
ITEM 3 Adult
ITEM 4 Any
att target MULTIDREFS wridom Domain
MENU 0 written_domain
ITEM 0 Information not available
ITEM 1 Imaginative
ITEM 2 Natural and pure sciences
ITEM 3 Applied sciences
ITEM 4 Social science
ITEM 5 World affairs
ITEM 6 Commerce and finanace
ITEM 7 Arts
ITEM 8 Belief and thought
ITEM 9 Leisure
att target MULTIDREFS wrilev Level of circulation
MENU 0
written_level
ITEM 0 Information not available
ITEM 1 Low
ITEM 2 Medium
ITEM 3 High
att target MULTIDREFS wrimed Medium
MENU 0 written_medium
ITEM 0 Information not available
ITEM 1 Book
ITEM 2 Periodical
ITEM 3 Misc. published
ITEM 4 Misc. unpublished
ITEM 5 To-be-spoken
att target MULTIDREFS wripp Place of publication
MENU 2 written_place
att target MULTIDREFS wrisam Sample type
MENU 0 written_sample
ITEM 0 Information not available
ITEM 1 Whole text
ITEM 2 Beginning sample
ITEM 3 Middle sample
ITEM 4 End sample
ITEM 5 Composite
att target MULTIDREFS wrista Reception status
MENU 0 written_status
ITEM 0 Information not available
ITEM 1 Low
ITEM 2 Medium
ITEM 3 High
att target MULTIDREFS writas Target audience sex
MENU 0 written_gender
ITEM 0 Information not available
ITEM 1 Male
ITEM 2 Female
ITEM 3 Mixed
ITEM 4 Unknown
#att target MULTIDREFS writim Time period
#MENU 0 written_time
#ITEM 0 Information not available
#ITEM 1 1960-1974
#ITEM 2 1975-1993
elt keywords e h descriptive keywords for topics of
a text
att scheme CDATA 0
elt term e h individual term in a list of keywords
elt revisionDesc e h revision description
elt change e h change note
# ----------------------
# C6 part of speech tags
# ----------------------
POS AJ0 adjective (unmarked) (e.g. GOOD, OLD)
POS AJC comparative adjective (e.g. BETTER, OLDER)
POS AJS superlative adjective (e.g. BEST, OLDEST)
POS AT0 article (e.g. THE, A, AN)
POS AV0 adverb (unmarked) (e.g. OFTEN, WELL,
LONGER, FURTHEST)
POS AVP adverb particle (e.g. UP, OFF, OUT)
POS AVQ wh-adverb (e.g. WHEN, HOW, WHY)
POS CJC coordinating conjunction (e.g. AND, OR)
POS CJS subordinating conjunction (e.g. ALTHOUGH,
WHEN)
POS CJT the conjunction THAT
POS CRD cardinal numeral (e.g. 3, FIFTY-FIVE, 6609)
(excluding ONE)
POS DPS possessive determiner form (e.g. YOUR, THEIR)
POS DT0 general determiner (e.g. THESE, SOME)
POS DTQ wh-determiner (e.g. WHOSE, WHICH)
POS EX0 existential THERE
POS ITJ interjection or other isolate (e.g. OH,
YES, MHM)
POS NN0 noun (neutral for number) (e.g. AIRCRAFT,
DATA)
POS NN1 singular noun (e.g. PENCIL, GOOSE)
POS NN2 plural noun (e.g. PENCILS, GEESE)
POS NP0 proper noun (e.g.
POS ONE the word ONE (including numeral and
non-numeral uses)
POS ORD ordinal (e.g. SIXTH, 77TH, LAST)
POS PNI indefinite pronoun (e.g. NONE, EVERYTHING)
POS PNP personal pronoun (e.g. YOU, THEM, OURS)
POS PNQ wh-pronoun (e.g. WHO, WHOEVER)
POS PNX reflexive pronoun (e.g. ITSELF, OURSELVES)
POS POS the possessive (genitive) morpheme 'S or '
POS PRF the preposition OF
POS PRP preposition (except for OF) (e.g. FOR,
ABOVE, TO)
POS TO0 infinitive marker (i.e. TO)
POS UNC unclassified: i.e. items which are not part
of the English lexicon
POS VBB the base forms of the verb "BE" ,
except infinitive, i.e. AM, ARE
POS VBD past form of the verb "BE" , i.e.
WAS, WERE
POS VBG -ing form of the verb "BE" , i.e.
BEING
POS VBI infinitive of the verb "BE"
POS VBN past participle of the verb "BE"
, i.e. BEEN
POS VBZ -s form of the verb "BE" , i.e.
IS, 'S
POS VDB base form of the verb "DO" ,
except the infinitive
POS VDD past form of the verb "DO" , i.e.
DID
POS VDG -ing form of the verb "DO" , i.e.
DOING
POS VDI infinitive of the verb "DO"
POS VDN past participle of the verb "DO"
, i.e. DONE
POS VDZ -s form of the verb "DO" , i.e.
DOES
POS VHB base form of the verb "HAVE" ,
except the infinitive
POS VHD past tense form of the verb
"HAVE" , i.e. HAD, 'D -->
POS VHG -ing form of the verb "HAVE" ,
i.e. HAVING
POS VHI infinitive of the verb "HAVE
POS VHN past participle of the verb
"HAVE" , i.e. HAD
POS VHZ -s form of the verb "HAVE" , i.e.
HAS, 'S
POS VM0 modal auxiliary verb (e.g. CAN, COULD,
WILL, 'LL)
POS VVB base form of lexical verb, except the
infinitive (e.g. TAKE, LIVE)
POS VVD past tense form of lexical verb (e.g. TOOK,
LIVED)
POS VVG -ing form of lexical verb (e.g. TAKING,
LIVING)
POS VVI infinitive of lexical verb
POS VVN past participle form of lexical verb (e.g.
TAKEN, LIVED
POS VVZ -s form of lexical verb (e.g. TAKES, LIVES)
POS XX0 the negative NOT or N'T
POS ZZ0 alphabetical symbol (e.g. A, B, c, d)
POS AJ0-AV0 Adjective or adverb
POS AV0-AJ0 Adjective or adverb
POS AJ0-NN1 Adjective or singular noun
POS NN1-AJ0 Adjective or singular noun
POS AJ0-VVD Adjective or past tense form
POS VVD-AJ0 Adjective or past tense form
POS AJ0-VVG Adjective or -ing form of lexical verb
POS VVG-AJ0 Adjective or -ing form of lexical verb
POS AJ0-VVN Adjective or past participle
POS VVN-AJ0 Adjective or past participle
POS AVP-PRP Adverb particle or preposition
POS PRP-AVP Adverb particle or preposition
POS AVQ-CJS wh-adverb or subordinating conjunction
POS CJS-AVQ wh-adverb or subordinating conjunction
POS CJS-PRP Subordinating conjunction or
preposition
POS PRP-CJS Subordinating conjunction or
preposition
POS CJT-DT0 THAT as conjunction or determiner
POS DT0-CJT THAT as conjunction or determiner
POS CRD-PNI ONE as number of pronoun
POS PNI-CRD ONE as number of pronoun
POS NN1-NP0 Singular common noun or proper noun
POS NP0-NN1 Singular common noun or proper noun
POS NN1-VVB Singular common noun or base verb form
POS VVB-NN1 Singular common noun or base verb form
POS NN1-VVG Singular common noun or -ing form of
verb
POS VVG-NN1 Singular common noun or -ing form of
verb
POS NN2-VVZ Plural common noun or -s form of verb
POS VVZ-NN2 Plural common noun or -s form of verb
POS VVD-VVN Past tense verb or past participle
POS VVN-VVD Past tense verb or past participle
PUN PUL left bracket ( or [
PUN PUN punctuation marks . ! , : ; - ? ...
PUN PUQ quotation mark
PUN PUR right bracket ) or ]
# ------------------
# next the character entities
# --------------------
char apos 0027
char ast 002a
char colon 003a
char comma 002c
char equals 003d
char excl 0021
char hyphen 002d
char lpar 0028
char percnt 0025
char period 002e
char plus 002b
char quest 003f
char rpar 0029
char semi 003b
char sol 002f
char brvbar 00a6
char half 00bd
char horbar 2015
char lowbar 005f
char nbsp 00a0
char shy 00ad
char emsp
char ensp
char emsp13
char emsp14
char numsp
char puncsp
char thinsp
char hairsp
char dash
char blank
char nldr
char incare
char block 2588
char uhblk 2580
char lhblk 2584
char blk14
char blk12
char blk34
char marker
char male 2642
char female 2640
char phone
char telrec
char caret
char fflig
char filig f001
char fjlig
char ffilig
char ffllig
char fllig f002
char mldr
char sext
char target
char dlcrop
char drcrop
char ulcrop
char urcrop
char Aacute 00c1
char aacute 00e1
char Abreve 0102
char abreve 0103
char Acirc 00c2
char acirc 00e2
char acute 00b4
char AElig 00c6
char aelig 00e6
char Agr 0391
char agr 03b1
char Agrave 00c0
char agrave 00e0
char Amacr 0100
char amacr 0101
char amp 0026
char Aogon 0104
char aogon 0105
char Aring 00c5
char aring 00e5
char Atilde 00c3
char atilde 00e3
char Auml 00c4
char auml 00e4
char Bgr 0392
char bgr 03b2
char bquo 2018 p
char breve 02d8
char bsol 005c
char bull 2022
char Cacute 0106
char cacute 0107
char caron 02c7
char Ccaron 010c
char ccaron 010d
char Ccedil 00c7
char ccedil 00e7
char Ccirc 0108
char ccirc 0109
char Cdot 010a
char cdot 010b
char cedil 00b8
char cent 00a2
char check
char cir
char circ 02c6
char clubs 2663
char commat 0040
char copy
00a9
char copysr
char cross
char curren 00a4
char dagger 2020
char Dagger
char darr 2193
char dblac
char Dcaron 010e
char dcaron 010f
char deg 00b0
char Dgr 0394
char dgr 03b4
char diams 2666
char die 00a8
char divide 00f7
char dollar 0024
char dot 02d9
char Dstrok
char dstrok
char dtri
char dtrif 25bc
char Eacute 00c9
char eacute 00e9
char Ecaron 011a
char ecaron 011b
char Ecirc 00ca
char ecirc 00ea
char Edot 0116
char edot 0117
char EEgr 0397
char eegr 03b7
char Egr 0395
char egr 03b5
char Egrave 00c8
char egrave 00e8
char Emacr 0112
char emacr 0113
char ENG 014a
char eng 014b
char Eogon 0118
char eogon 0119
char equo 2019 p
char ETH 208
char eth 00f0
char Euml 00cb
char euml 00eb
char flat
char frac12 00bd
char frac13
char frac14 00bc
char frac15
char frac16
char frac17
char frac18 215b
char frac19
char frac23
char frac25
char frac27
char frac29
char frac34 00be
char frac35
char frac37
char frac38 215c
char frac45
char frac47
char frac49
char frac56
char frac57
char frac58 215d
char frac59
char frac67
char frac78 215e
char frac79
char frac89
char ft 2032
char gacute
char Gbreve 011e
char gbreve 011f
char Gcedil 0122
char Gcirc 011c
char gcirc 011d
char Gdot 0120
char gdot 0121
char ge 2265
char Ggr 0393
char ggr 03b3
char grave 0060
char gt 003e
char Gt
char Hcirc 0124
char hcirc 0125
char hearts 2665
char hellip 2026
char Hstrok
char hstrok
char hybull
char Iacute 00cd
char iacute 00ed
char Icirc 00ce
char icirc 00ee
char Idot 0130
char iexcl 00a1
char Igr 0399
char igr 03b9
char Igrave 00cc
char igrave 00ec
char IJlig 0132