XXQ - an informal introduction
1. Searching for words
A lemma query is the simplest way to search for a word. The word is specified as the content of the lemma element. For example to search for the word parrot the query used would be
There are a couple of options you can use with lemma queries: in XXQ as in many XML vocabularies attributes are used to carry options. The convention with XXQ is that attributes that are either on or off take values true and false.
ignorecase tells Xaira to search for any word with the spelling supplied irrespective of capitalisation. This option is true by default so if you want to search just for the form exactly as spelt you must turn it off as in:
<lemma ignorecase="false">Parrot</false>
pattern tells Xaira to treat the search value as a regular expression; by default it is false. Regular expressions are a topic in their own right and there isn’t space to discuss them here. For many purposes all you need know is that a dot matches any character, a character followed by * can be repeated 0 or more times, a character followed by a + 1 or more times and a character followed by ? 0 or 1 times. To search for singular and plural parrots the following query would work:
<lemma pattern="true">parrots?</lemma>
Remember that if you want to use one of the magic regular expression characters as a character in its own right you must escape it by preceding it with \. The following two queries find the same words:
<lemma pattern="true">3\.14</lemma>
Before we get into more detail about words searches it is important to repeat that the ‘atoms’ of XXQ are words not characters. Reading what we have just said you might run away with the idea that
would find the sequence parr in the word parrot: but it will not. This sometimes causes confusion and the confusion is compounded because:
(a) different languages divide text into words in different ways. In some languages such as Thai the task of dividing a string of characters into words requires use of a special dictionary.
(b) the person who builds a corpus may decide to override the normal rules for word breaking using special tags. In the British National Corpus (BNC), for example, where words are tagged to show their part of speech, words that are run together orthographically are indexed separately: the word can’t, for example, is indexed as ca and n’t to show that ca (=can) is a verb and n’t (=not) is the negative particle. So
won’t find any words at all: you would have to look for a sequence of
using techniques we shall describe later. We’ll talk about this a bit more in the section on searching for phrases.
Up: Contents Next: 2. Searching for words with special properties

