Trecx:DesignIssues

From Bodington Wiki

Jump to: navigation, search

Back to TReCX

Contents

General

It is recommended that we adopt a plug-in architecture. We must building in unit and load testing from the start.

We should use a command line test suite (based on Jakharta's HTTPClient, see http://jakarta.apache.org/commons/httpclient/) to fire messages at the services and JUnit to perform internal testing.

Presentation layer

It is recommended that we use a common presentation layer framework. Obvious example are

  • Struts
  • Java Server Faces
  • Velocity
  • JSP

We will use JSP here.

Type of Web Service

We feel that the system should be based on a RESTian approach for simplicity's sake.

Tracking Store

There are two approaches here

  • push - e-learning application posts messages to the tracking store off its own bat
  • pull - tracking store asks for messages every so often

The push approach means each e-learning application can be configured locally. If a reconfiguration is required only one service has to be taken down. If configuration were to be done from the tracking store then the whole datastore would have to go offline and each of the clients would have to buffer their data (and maybe serialise) until the tracking store was back in the land of the living.

In the first implementation, the tracking store will not include any access control on who can see the events. This could be achieved by using Shibboleth (http://shibboleth.internet2.edu/) or placing a GuanXi guard (http://guanxi.sourceforge.net/) in front of the store.

Admin Interface

Browse Data

It would be nice to have an interface for us to browse the contents of the tracking store. I guess this could be achieved by using Postges' admin interface (and the Hypersonic equivalent)?

Data Protection

In order to comply with Data Protection we need an admin interface to the remote tracking store. This could be achieved via the SQL propmt.

Reporting Service

The reporting application will send an HTTP GET 'search' request to the tracking store / e-learning application.

The GET request will include (a) search term(s) formulated via an API. We probably want to be able to search on any of the fields in the 'event table' but we will also want to retrieve 'summaries' of data. (See Functional Requirements document.)

The reporting service API will be defined later but we have a few design considerations:

  • will will return the results of the query using the same schema as that used to post the messages in the first place
  • want the report application to be required to do the minimum amount of results processing (over and above the 'merge' needed to collate the distributed data)
    • we assume that data is always chronologically ordered, but this may be affected by the GROUP BY clause
    • would want to be be able to specify that the data be grouped by a given field (or list of fields)
  • we need provide a mechanism to map bewtween usernames if, for example, a single user has a different identifiers in the target e-learning applications. It is out of scope here.

As mentioned earlier, it is envisaged in the future that the reporting service should be given an authentication and authorisation layer such as Shiobboleth.

Security

We have been advised not to get too hung up about secturity! We envisage an optional SSL / certifiate based approached which will be easy to do after the end of the project.

Events Table

Brainstorming Whiteboard

These are the initial scribblings!

Image:Whiteboard.jpg

Example Events

The above diagram has informed our discussion. After a lot of discussion we have finally come to the conclusion that one 'user event' can actually trigger a series of system events.

Example Event 1 (Forum Post)

For example, consider a user who posts a message to a forum in response to an already existing thread. We need to relate the actual message to its containing thread and to the actual instance of the forum. (A tutor may set up a forum for his Noodle Making 101 course, over a term, this area may contain numerous threads all about NM101.) The above user event translates to 3 distinct system events

  1. create forum post
  2. update forum thread
  3. update forum instance

It is up to the tool doing the reporting to construct these 3 events. The reporting application will still function if only the 'most significant' event is reported (ie, create forum post) but will not be able to relate back to the containing 'object'.

Example Event 2 (Course enrolement)

Another example form a course booking system - user books on a course:

  • create course registration
  • update course instance

Example Event 3 (Course un-enrolement)

User un-enroles from a course

  • delete course registration
  • update course instance

Event Structure

So the user event will look like:

(NOTE: the location of the schema under version control is right here).

The types are described here as per XML Schema.

  • timestamp (xs:dateTime) the timestamp of event.
  • systemID (xs:string) identifier of system that produced event: In the short term this is likely to be a domain name, eg, weblearn.ox.ac.uk or aspire.ox.ac.uk.
  • userID - partitioned into
    • "content" is the value of the ID itself.
    • type (xs:string) the resticted enumeration of values are:
      • TRANSIENT (for anonymous (not logged in) users)
      • SYSTEM for events generated by the system (note there will not be a username in this case)
      • INSTITUTION_ID
      • UNIQUE_LEARNER_ID
      • USER_NAME
  • service (xs:string) an extensible list. However, to kick off it's likely that Wiki, Chat, Blog, Forum, etc will be initial "core" values.
  • domain (xs:string) the restricted enumeration of values are:
    • USER
    • RESOURCE
    • AUTHENTICATION
  • message pertinent message which can be displayed to a human to give an idea of what the event is all about, may be the first line of a forum post or chat message, or the title of a wiki page that was created or edited, and so on. This message should be capable of displaying unicode/UTF-8 charaters. Note that the message tag should contain the 'xml:lang' attribute to indicate what language is being usaed. this is stored in the the database.
    • "content" is the message itself.
    • language (xml:lang) is an attribute which describes the language used for the message.
  • applicationID (xs:string) identifier of resource within system that produced event: a given instance of a tool within the above. Again, this will probably be a URL (or URI). URLs only work for some applications. Lots of applications don't have good URLs (JSF).
  • operation (xs:string) taking our cue from the database world, our operation types would be from the restricted enumeration of:
    • CREATE
    • READ
    • UPDATE
    • DELETE
  • servicePart (xs:string) an (optional) sub-part of the service element. Examples would include a thread or a post within a forum, post within a blog, etc.

A TRANSIENT userID is used when a temporary userId has been assigned by the e-learning system, for example, when a user has a 'session' but has not yet logged in (cf. Plone, Bodington). A non-transient userId is some sort of 'permanent' username. We do not intend to connect transient userIds to fixed userIds. Imagine the following scenario: Ming The Merciless accesses several 'open' resources in Plone as an anonymous user and then logs in (using username 'minger') and accesses several more. If Ming The Merciless's tutor does a search on the username 'minger' from within the reporting application they will only see data pertaining to resources accessed after login, all the resources visited when Ming was not logged in will be absent.

One thing we have not considered or taken into account here is event duration. If we want to see how long someone spent in a resource do we have to "walk" the sequence of events, or should we capture an explicit "end" timestamp?

Notes and Queries

This section lists future enhancements tgo the system.

  • What do we do about changes in systemIDs, eg, what if what happens when, say, MVN Forum 'goes live', ie, changes from http://mvnforum.ltg.oucs.ox.ac.uk/ to http://mvnforum.ox.ac.uk ? Do the applicationID and the systemID both change? But how would we get the old tracking data? The solution to this would be to use a resolver in the database, in other words, when the message arrives it must be processed to hoik out the URL/URIs for storage in a resolver table. This will be left to version 2 of the software!
  • A user has the same username in each system. Indeed it is imagined that before too long identity will itself be externalized into a service independent of any one e-Learning tool. However, we will include provision in the reporting application for a look-up table to be accessed so that a mapping may be performed to relate accounts for the same person which use different identifiers.
  • We also do not cater for internationalisation with the 'message' that is sent with the event; we store the language and the string as it is sent. As we have the language string, we could perform automatic translation, however, we would envisage that it is part of the reponsibility of the e-learning application to supply internationalised strings if requested.


Example Implementation

In the schema, eventList has been declared as nillable. This means that if there are no events (intended to be used as the result of a query when there are no matching events to return) the eventList can just be declared as null (or xsi:nil="true" to be precise!). A non-null eventList should thus always contain at least one event. (NOTE: publishing clients are not intended to send null eventLists; if there are no events to send, do not send anything at all!).

Example messages:

<?xml version="1.0" encoding="UTF-8"?>
<eventList xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:noNamespaceSchemaLocation="../schema/schema.xsd">
   <event>
       <applicationID>/biog/year04/forum/</applicationID>
       <domain>AUTHENTICATION</domain>
       <operation>CREATE</operation>
       <service>Forum</service>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:36</timestamp>
       <userID type="INSTITUTION_ID">lxsocon</userID>
   </event>
   <event>
       <applicationID>/medsci/year01/wiki/</applicationID>
       <domain>RESOURCE</domain>
       <message xml:lang="en">Corrected spelling mistakes.</message>
       <operation>UPDATE</operation>
       <service>Wiki</service>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:37</timestamp>
       <userID type="INSTITUTION_ID">jneeskens</userID>
   </event>
   <event>
       <applicationID>/library/staff/internal/</applicationID>
       <domain>RESOURCE</domain>
       <message xml:lang="en">Removed inflammatory feedback.</message>
       <operation>DELETE</operation>
       <service>Blog</service>
       <servicePart>entry20060609d</servicePart>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:37</timestamp>
       <userID type="INSTITUTION_ID">bookworm</userID>
   </event>
</eventList>


<?xml version="1.0" encoding="UTF-8"?>
<eventList xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:noNamespaceSchemaLocation="../schema/schema.xsd">
   <event>
       <applicationID>/biog/year04/forum_y/</applicationID>
       <domain>RESOURCE</domain>
       <message xml:lang="en">I disagree in the strongest terms.</message>
       <operation>CREATE</operation>
       <service>Forum</service>
       <servicePart>thread3.post26</servicePart>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:36</timestamp>
       <userID type="INSTITUTION_ID">bmoore</userID>
   </event>
   <event>
       <applicationID>/biog/year04/forum_y/</applicationID>
       <domain>RESOURCE</domain>
       <operation>UPDATE</operation>
       <service>Forum</service>
       <servicePart>thread3</servicePart>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:36</timestamp>
       <userID type="INSTITUTION_ID">bmoore</userID>
   </event>
   <event>
       <applicationID>/biog/year04/forum/</applicationID>
       <domain>RESOURCE</domain>
       <operation>UPDATE</operation>
       <service>Forum</service>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:29:36</timestamp>
       <userID type="INSTITUTION_ID">bmoore</userID>
   </event>
   <event>
       <applicationID>/medsci/year01/wiki/</applicationID>
       <domain>RESOURCE</domain>
       <message xml:lang="en">Corrected spelling mistakes.</message>
       <operation>UPDATE</operation>
       <service>Wiki</service>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:35:12</timestamp>
       <userID type="INSTITUTION_ID">jneeskens</userID>
   </event>
   <event>
       <applicationID>/library/staff/internal/</applicationID>
       <domain>RESOURCE</domain>
       <message xml:lang="en">Removed inflammatory feedback.</message>
       <operation>DELETE</operation>
       <service>Blog</service>
       <systemID>http://vle.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:41:21</timestamp>
       <userID type="INSTITUTION_ID">bookworm_king</userID>
   </event>
   <event>
       <applicationID>/staff/register/</applicationID>
       <domain>USER</domain>
       <message xml:lang="en">New member of staff: Diego Maradona.</message>
       <operation>CREATE</operation>
       <service>Registration</service>
       <systemID>http://library.ox.ac.uk/</systemID>
       <timestamp>2006-06-09T14:46:09</timestamp>
       <userID type="INSTITUTION_ID">bookworm_king</userID>
   </event>
</eventList>


Database

Platform

Initial Thoughts

It was initially felt that due to the XML nature of this toolkit, we would use the Exist database (see http://exist.sourceforge.net/). It was thought that this would work well and we would be able to use XQuery as the search interface language.

Considerations

Due to our requirement to present a search API as part of the toolkit, if we adopt the above approach, we will need to provide a library to parse XQuery queries and translate to the equivalent XML. This is thought to be way too complex for the scope of this project.

So on reflection, our decision to use Exist appears flawed, a better aproach is thought to be to use an SQL database and tailor the search API to be flexible enough to allow all desirable queries to be serviced and also to use the power of SQL to do things like grouping and sorting.

Current Thinking

We will probably use a relational database at the back-end of the tracking store. We will target

  • HSQLDB for the development environment and 'quickstart' WAR file.
  • PostgreSQL which is the standard database used by the Bodington VLE.

We initially used Apache Derby another pure Java relational database (which Sun bundle with the NetBeans IDE) but switched to HSQLDB because Derby did not perform as we had hoped. We also like the GUI interface to HSQLDB.

We will adhere to good practice and use a database abstraction layer. Obvious choices include:

We were considered using HyperJAXB2 (https://hyperjaxb2.dev.java.net/) to interface the XML messages to the database via Hibernate but the software is not yet mature enough to be usable.

In the end we settled upn JAXB plus JDO with HSQLDB being used for development.



Database Searches

The tracking store is searched using JDOQL, however, the query interface which maps the REST interface is very simple. For the first release there are three types of query

  • search for events between start and end dates (either or all dates can be null)
  • search for events for a given userID between start and end dates (either or all dates can be null)
  • search for events for a given application between start and end dates (either or all dates can be null)

Future work on the system could provide more complex searching.


Messaging

There's a few issues to consider:

  1. How many messages should we be supporting collection a second?
  2. How hard should we work to ensure that messages aren't dropped?
  3. How long do we have from a message being generated to when it should appear in the search results?

As discussed above, to keep things simple, the e-learning applications will post their tracking data to the tracking store. To address the first point, we feel that each service should have its own configuration (see below).

The second point above raises the issues of reliability - how do we know that the tracking store has correctly processed our posting? Answer is that we dont, all we will have to go on is the status code of the HTTP request.

We should steer away from listeners - having to write listeners on each app that does tracking event generation seems to raise the bar for implementors. This approach would raise configuration spectres. If there are n apps co-operating, why should one have to configure n(n-1) listeners when one can configure n? The UNIX syslog style where eveything is accepted and apps send thier events in themselves (possibly doing queuing and bulk transfers) is a good model.

This is accepted in many large event tracking systems - a central event sink which provides storage, notification and similar services. See, for example, OpenView and other network management systems; syslog as already described; MS Operations Manager [a godsend as Windows has so many event sources and no syslog equivalent]; Nagios. It is far easier to configure; it is far easier to amend that configuration as the number of co-operating systems increases; and it is far easier to query and report on the entire system.

Messaging System Configuration

The applications which dispatch tracking information to the tracking store need to be configurable to refect location of tracking store, bandwidth, system business and optimal message size.

The reporting app also needs to know what apps to query.

Configuring the Trackees

One possiblity is to use WS-Notification to configure the applications which send tracking data to the tracing store. This is rejected as it is felt that it is too heavyweight for the task in hand.

We configure the trackees via a properties file. This a file holds the following:

  1. IP address of tracking store
  2. time between flushing message buffer (BatchInterval)
  3. size of buffer (BatchSize)

The idea being that the messages are sent every 'so many' seconds unless the buffer is full in which case it is flushed and the 'clock' reset.

Validation can be enabled in the tracking store for debugging. Log4J can optionally be used.

Configuring the Tracking Store

A servlet filter protects the Tracking store. The filter is configured via initialisation parameters in the web.xml file; the filter is told to only store events from the listed sources.

Another instance of this same filter is also used to restrict queries to those that originate from the specified reporting applications.

Validation can be enabled in the tracking store for debugging. Log4J can optionally be used.

Configuring the Reporting Application

This is configured by entries in a properties file specifying the URL of the tracking stores to be searched.

Notes and Queries

The reporting user interface must be able to present a list of possible service types to the user (blog, wiki, forum etc). It would be good to have a vocabulary of service types, so would this be better served by stipulating that the 'service' attribute has a fixed list of possibilities defined as part of the schema?

But then that's not flexible as one cannot add an arbitrary service to the configuration implying that the attribute should be any old String 'PCDATA'?

Messaging Format

Each message will contain 1 or more events. Would we want to compress the message? Could we do this at the http level with mod_gzip?


Back to TReCX