Main Page
From Ethnoer
Welcome to the EthnoER wiki pages
These pages allow discussion of the models for online annotation and browsing of ethnographic media corpora.
Go to the home page of this project
This project has been running since 2006 when it was originally funded by an ARC Eresearch grant. Since then development of the system for presenting interlinear text and media online (EOPAS) with the streaming server Annodex has continued as discussed elsewhere on this page.
Presentations from the two conferences run during the funding period can be found on the community portal page which is where there is a list of sample data files in the test directory. There is also a more detailed discussion of the EthnoER online presentation and annotation system (EOPAS) on that page (members only).
Streaming IGT
Streaming media and Interlinear Glossed Text
October 2009: Things have moved on as we predicted and the streaming server discussed below has now been widely adopted. Firefox (>3.5) needs no plug-ins to be able to play the media from this server, and we can make simple HTML calls to points within large media files. The working demo of this is available here
Technologies change! October 2008
While the underlying model of EOPAS still works, the streaming server has developed and is now much more robust than when we experimented with it in 2005/2006. The problem is that we have not been able to update our server and so the demo pages listed below are unlikely to work in the latest browsers. BUT! See the metavid pages for an example of what we will be doing next (when our next grant application is approved).
Download the Annodex plugins for Firefox
2008-02-07
Annodex currently only works in Firefox, so, in order to play Annodex media you will need to download the plugin for your platform here. Linux, 718 Kb Macintosh (only for non-intel Macs), 386 Kb Intel Macintosh, 350 kb Windows, 360 Kb
Example of HTML calling the Annodex server
2007-11-27
This will allow us to place any content online and to have it call media without having to download large media files. A dictionary implementation using this method is now available here.
To get to this point we have had to work through a number of issues, as discussed elsewhere on this wiki. The idea is that there could be several Annodex servers, perhaps associated with linguistic archiving projects, and selected media files can be placed into the server (they have to be transcoded to ogg format, thanks to Stuart Hungerford and Jonathan McCabe of APAC for their help in setting up the PARADISEC Annodex server) and then be available to be called in the way seen here. If you look at the HTML coding you will see that there is a javascript that controls how the call is made (thanks to Shane Stephens of CSIRO, and now of Google for his assistance here). To get your data into this form you will need to convert the timeformats etc to match the structure of the demo document. I have written an export routine in Audiamus 2.5 to create a skeleton time-aligned document in the correct format.
Nick Thieberger
Getting media into online dictionaries, an example workflow
2007-12-04
I have updated the online dictionary of South Efate to include some 950 spoken headwords and a few spoken example sentences. These use timecodes in streamed media files as discussed elsewhere on this wiki. To get to this point I transcribed the media file – in which Endis Kalsarap patiently reads through a list of headwords – using Transcriber. This resulted in a time-aligned transcript which I imported into Audiamus, both for display and searching the text, as well as providing a conversion tool to output the transcript to various formats. I exported the file into Toolbox format and then combined it into my existing lexical database (using regular expressions in TextWrangler to match the headword to the word in the transcript, and inserting the timecodes into a field in the Toolbox lexicon, e.g., \aud 98015B 2002.345, 2003.3, where the first item is the filename, the second is the starting time and the third is the ending time).
To produce the online dictionary I imported my Toolbox lexical file into LexiquePro and exported to html. As I wanted to include references to media as well as images I needed to trick LexiquePro into exporting these fields. As I did not have a phonetic field in my database, and LexiquePro does allow you to export phonetic information, I renamed the phonetic field as 'pc', the tag I used for links to image files. I included the audio link together with the headword (e.g., \lx akam 98015B 2050.787 2054.731) and it was also exported into the html version.
I then used TextWrangler again to edit the html pages, converting the phonetic field into a link to an image, and making the audio reference suitable for calling time offsets using the Annodex format (e.g., <A HREF="javascript:play_at(file2, 1531.133,1537.394)">ralim iskei atmat inru,</A>). You can look at the page source to see how the javascript works.
The advantage of this approach is that the media is not actually downloaded so it is more trouble for someone to copy it, and it takes very little bandwidth and so can be delivered to remote locations with poor internet connections. And, since it builds on a normal workflow, it requires very little extra labor (it took less than a day to go through the process described here).(see below for another method for getting individual soundfiles to play from an online dictionary).
Nick Thieberger
Getting media into online dictionaries, another example workflow
2009-05-05
When the streaming server we were using in the above example stopped working and it seemed like it would be a year or so before the planets would align for it to work again, I decided to take my lexical files and convert the HTML calls to play individual mp3 files. To create the mp3 files I got the wordlists I had had read by a speaker of the language, and time-aligned the recording to those wordlists (using Transcriber). I then exported the text from the Transcriber file in its 'Limsi label' format (which is basically a timecode, a space, and then the textual chunk). I then opened the audio file in Audacity and imported the label file to create labels in Audacity. Using the Tools>'Export multiple' function automatically segmented the wav file into hundreds of mp3 files, each named for their label. <p> Using regular expressions in Text Wrangler I could now rewrite the html calls to point to the headword with an '.mp3' suffix. This all works and definitely beats cutting audio files and writing html links by hand.
The EthnoER online presentation and annotation system (EOPAS)
2006-08-01
A major outcome for this project will be the development of a model for delivery of online media resulting from outputs of tools routinely used by professional linguists (that is, those keeping up with the requirements of their profession and so using the best current tools). This model will also permit users to annotate the data served by EOPAS.
Using Toolbox for XML outputs
Toolbox generously allows the user to enter all kinds of information in all kinds of orders. This is very useful but provides no constraints on the items in your data. Since the program doesn't provide it, the user will need to exercise some discipline in entering data. If the data is to be exported to an XML format then the user needs to consider establishing a hierarchy within the Toolbox document that will then form the structure for the XML document. One way of supporting this outcome is to distribute Toolbox template files from a central repository and to encourage users to adopt these templates.
Toolbox minimal fields required by EOPAS
By using the following field markers we will have a simpler path to an XML output in Toolbox 1.5 that can be read by EOPAS. Note that the marker hierarchy indicated by indentation below is significant and needs to be encoded in the Toolbox .typ file associated with the data (by including 'following field' information in the document properties).
\id chunk identification number \aud time reference in the form 'filename starttime endtime' in seconds and milliseconds \tx text line \mr morphemic \mg glosses of the morphemic line \fg free gloss
Metadata should be entered into the EOPAS uploader , but a metadata header for a text file can include the following:
\ti title
\lg language (using the ISO/DIS 639-3 three-letter codes)
\dat date (ISO form: yyyy-mm-dd)
\sp speaker name
\sex sex of speaker
\age age of the speaker in years when recorded
\loc location at which the recording was made
Transcriber XML outputs for EOPAS
The native Transcriber file is in an XML form (<!DOCTYPE Trans SYSTEM "trans-13.dtd">) that provides the following elements to EOPAS:
[Yet to be entered]
Elan XML outputs for EOPAS
The native Elan file is in an XML form
(xsi:noNamespaceSchemaLocation=http://www.mpi.nl/tools/elan/EAFv2.2.xsd) that provides the following elements to EOPAS:
[Yet to be entered]
Toolbox template for use with the EOPAS system
We have provided a template for Toolbox users who want to constrain their data to output to the EOPAS schema. Get it here . This template provides a lexical file and a text ready for interlinearising. Using this template to enter one's texts should allow the normal XML output of Toolbox to be in a form conformant to the EOPAS.xsd being developed by Ronald Schroeter at ITEE at the University of Queensland.
A set of texts from South Efate prepared in this format can be seen here, click on the title to see the whole text. Not all of the media is currently playable (all media will be playable in the near future), but this example does have media linked to the text.
Melbourne workshop, February 2006.
This presentation lists the ACLA Project's needs for tools for transcription, annotation, search and counting, and some problems with past ACLA practice. Jane Simpson
This presentation describes technically how to use the Annodex technologies for ethnographic eReserach. Silvia Pfeiffer
Model of interlinear text structure
This paper presents a model for XML encoding of interlinear text. Towards a General Model for Interlinear Text by Cathy Bow, Baden Hughes and Steven Bird.
Directions for Interlinear Glossing
The principles for interlinear glossing set out here are based on those in: Lehmann, Christian (1983). "Directions for interlinear morphemic translations." Folia Linguistica 16, 1982:193-224.
Advanced Glossing
A presentation by Sebastian Drude ('Advanced Glossing — a language documentation format and its implementation with Shoebox') which proposes up to twelve levels of interlinear glosses.
Annodex example
An Annodex example is available here:
http://test03.centie.net.au/cmmlwiki
This site needs to be viewed with the Annodex browser which works best in Firefox/Windows for now. To install the plugin, go to http://www.annodex.net and click on the "Get Firefox Extension" link.
Vannotea demo movies.
Vannotea Demo Videos can be accessed through the following links (the lower the frame rate, the choppier the movements and the lower the file size):
http://maenad.itee.uq.edu.au/Vannotea/VannoteaDemo_Full.html [Shockwave Flash, full frame rate, 55MB]
http://maenad.itee.uq.edu.au/Vannotea/VannoteaDemo_Quarter.html [Shockwave Flash, 1/4 frame rate, 28MB]
http://maenad.itee.uq.edu.au/Vannotea/VannoteaDemo.wmv [Windows Media Video, 1/2 frame rate, 46MB]
http://maenad.itee.uq.edu.au/Vannotea/VannoteaDemo.exe [Standalone executable, Windows only, full frame rate, 43MB]
Audiamus
Audiamus is a tool written by Nick Thieberger which has some of the functions desired by fieldworkers working with time-aligned media. A discussion of the software can be found here and a working version can be found here.
Nick Thieberger. 2006. Well It Works! Reflections on the Audiamus Model for Corpus Building and Where It Could Go from Here (A presentation at EMELD 2006 that summarises the aims of EthnoER) available here (3 Mb)
Survey of Existing Tools, Standards and User Needs for Annotation of Natural Interaction and Multimodal Data
January 2001
Authors:
Laila Dybkjær, Stephen Berman, Michael Kipp, Malene Wegener Olsen1,
Vito Pirrelli, Norbert Reithinger, Claudia Soria
The aim of this report (which is written by the ISLE Natural Interactivity and
Multimodality Working Group and is available at http://isle.nis.sdu.dk/reports/wp11/) is to provide a
survey of some of the most prominent tools in support of natural interactivity and multimodal data
annotation.
The report reviews twelve different tools, some of which have since developed considerably:
Anvil (Annotation of Video and Language Data);
ATLAS (Architecture and Tools for Linguistic Analysis Systems);
CLAN (Computerized Language Analysis) ;
CSLU Toolkit* (Center for Spoken Language Understanding Toolkit);
MATE (Multilevel Annotation Tools Engineering);
MPI tools* (CAVA and EUDICO) (Computer Assisted Video Analysis and European Distributed
Corpora);
Eudico is still at an early stage; MultiTool* (developed as part of a Swedish project on a Platform for Multimodal Spoken
Language Corpora). ;
The Observer is a professional system for the collection, analysis, presentation and management
of observational data;
Signstream (developed as part of the American Sign Language Linguistic Research Project) ;
SmartKom uses tools developed in the Verbmobil
project for audio annotation;
SyncWriter is a transcription and annotation tool;
TalkBank is a US project which aims to provide standards and tools for creating, searching, and
publishing primary materials via networked computers.
ICT Tools for Searching, Annotation and Analysis of Audiovisual Media
October 2006
This report is the result of a 1-year joint project between Lancaster University and the University of Oxford.
The goal of the project was to inform the Arts and Humanities Research Council about:
1. The current state of computer-related activity in the arts and humanities, 2. Requirements of the research community, and 3. The likely course of future developments.
Our project was focussed on Audio-Visual media (e.g. speech, music, and videos primarily). It discusses technologies for searching and annotating these media. We looked forward to technologies that would be available to the typical researcher in the next few years, limiting ourselves to things that exist at least in the form of applied research.
We looked at what researchers currently do with audio-visual materials, and explored some scenarios for what they might do if upcoming technologies were available.
Joaquim Llisterri's page of links to annotation software
Can be found here: http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html
Thomas Schmidt's page of links to annotation software
Can be found here annotation http://www.exmaralda.org/annotation/index.php/Main_Page annotation
Thomas Schmidt's presentation at LREC 2003
Visualising Linguistic Annotation as Interlinear Text:
Interlinear Text (IT) is a widely used method of data visualisation in linguistics. In spite of
this fact, and although there are quite a number of tools for inputting and outputting such data,
IT has rarely been described from a formal point of view. This paper tries to do this by
a) showing where (in linguistics) IT is used,
b) attempting a characterisation of what IT is, and
c) outlining what may be necessary in order to work with IT
Thomas Schmidt is the author of the EXMARaLDA transcription system
Thorsten Trippel's presentation at EMELD 2006
Thorsten Trippel (Bielefeld University) gave the paper titled 'The Missing Links in Documentary Linguistics: An approach to bridging the gap between annotation tools' at the EMELD conference in 2006, get the abstract or the paper
PAULA: Interchange Format for Linguistic Annotations
PAULA stands for Potsdamer Austauschformat für linguistische Annotation ("Potsdam Interchange Format for Linguistic Annotation"). PAULA is an XML-based standoff representation format, which has been designed to represent data annotated at multiple layers. http://www.sfb632.uni-potsdam.de/~d1/paula/doc/PAULA_intro.html
Indiana Annotated Text Processor
AISRI's Annotated Text Processor (ATP) is a text processor designed to manage interlinear text and to support the operations of several kinds of linguistic analysis including parsing and glossing.[more info]
Note: ATP is currently in beta testing phase.Last update: Oct 20 2005
Alexander Nakhimovsky, Chris Hellmuth and Tom Myers, BoxReader/BoxWriter presentation at EMELD 2006
The presentation is called "The Linguist's Toolbox and XML Technologies" and can be seen here. This tool takes a standard Toolbox text file and, together with its '.typ' file, produces XHTML output for display in a web browser. The important point about this process is that it is conceived as being lossless so that editing can take place in the XHTML format and then be reread into Toolbox.
I installed the java programs on my Mac (not without some problems) and the result has worked if I remember to turn Tomcat on.
This tool allows conversion and display of the data, and it should be possible to build the media player capability we are planning in EthnoER into the output given by BoxReader. (Nick Thieberger)
Toolbox XML to XHTML Conversion scripts
An XSLT script is provided here by 'Taliesin' which converts Toolbox XML export into an online XHTML, including tables for wrapping to screen size.
Ulrik Petersen : Emdros – a text database engine for analyzed or annotated text
See the web page here.
Emdros is a text database engine for linguistic analysis or annotation of text. It is applicable especially in corpus linguistics for storing and retrieving linguistic analyses of text, at any linguistic level. Emdros implements the EMdF text database model and the MQL query language.
Ronald Sprouse : The Berkeley Interlinear Text Collector (BITC)
BITC is a system for collecting interlinear texts and is especially designed for group collaboration. BITC is installed on a network server, and users access BITC through a web browser. This is ideal for group work, since, once the server is set up, practically anyone can get involved with the project without having to install special software. Also, everyone benefits from the texts collected by everyone else on the project because all work contributes to a shared dictionary. Using the Internet potentially enables geographically-dispersed researchers to collaborate with each other, too! Currently, BITC is ALPHA software, which means it is not for the faint-of-heart, mainly because installation may be difficult.
[Note there appear to be no texts in this web version]
See the web page here
Text Analysis Info - transcribing (page of links to transcription tools)
See this page for a list of useful tools for transcription: http://www.textanalysis.info/transcribe.htm
Keira Ballantyne's XML work with Yapese texts
On these pages Keira Ballantyne discusses an XML schema she developed for presentation of annotated texts in Yapese (Micronesia).
Current (November 2006) IT display
This page is Ronald Schroeter's sample of the data display for interlinear text (you will need to have installed the Firefox extensions provided above). It shows the text, uses wrapping (thanks to John Thomson of SIL for assistance here), and searches on or supplies a concordance of all texts that share a language code. To access the concordance doubleclick on a word in either the text or morphemic line and see a list of all occurrences of that word with some context in the bottom left panel. Clicking on them will take you to their context in whichever text they occur in. Access media associated with a chunk by clicking on the identifier in the left column or the triangle there. A paper discussing EOPAS was presented at the conference 'Sustainable data from digital fieldwork' held in Sydney on December 4th and 5th, 2006.
Sustainable Data from Digital Fieldwork: From creation to archive and back
University of Sydney
December 4 - 6, 2006 Many academic disciplines depend on analysis of primary data captured during fieldwork. Increasingly, researchers today are using digital methods for the whole life cycle of their primary data, from capture to organisation, submission to a repository or archive, and later access and dissemination in publications, teaching resources and conference presentations. This conference and workshop showcased a number of projects that have been developing innovative and sustainable ways of managing such data.
Papers from this conference (partly funded by EthnoER) are available for free download here. A table of contents with links to the files, including mp3 files of the presentations, can be found here and is listed below:
Introduction
Sustainable data from digital fieldwork: the state of the art (Sydney, 2006) -Linda Barwick
Part I: Fieldwork to archive
Issues in the creation of a digital archive of a signed language - Trevor Johnston & Adam Schembri Powerless in the field: a cautionary tale of digital dependencies - Tom Honeyman Archiving directly from the field - Laura Robinson From trees to descriptions and identification tools - Barry Conn & Damas Kipiro
Part II: Best practice?
When best practice isn't necessarily the best thing to do: dealing with capacity limits in a developing country - John Bowden & John Hajek Proficient, permanent, or pertinent: aiming for sustainability - David Nathan Finding the locus of best practice: technology training in an Alaskan language community - Andrea Berez & Gary Holton E-MELD and the School of Best Practices: an ongoing community effort - Jessica Boynton, Steve Moran, Anthony Aristar & Helen Aristar-Dry
Part III: Tools and repositories
EOPAS, the EthnoER online representation of interlinear text - Ronald Schroeter & Nicholas Thieberger The Annodex platform (2006) -Shane Stephens Archiving and sharing data using XML - Simon Musgrave Sowing seeds in the digital garden - Murray Henwood, Susan Hanfling, Rowan Brownlee, Belinda Pellow & Tristan Gutsche
Part IV: Beyond the repository
Past, present and future in Reefs-Santa Cruz research - Åshild Næss Field, file, data, conference: towards new modes of scholarly publication - Ross Coleman
Contents list of mp3 podcasts of unpublished presentations from the conference:
The Bidwern project - Leo Monus, Kim McKenzie & Murray Garde The dawning of the age of online collections - Adrian Burton An ethnography of the EthnoER project - Nicholas Thieberger Fieldwork Data Sustainability (FIDAS): the FieldHelper project - Steven Hayes Sustainability models for digital preservation - David Pearson Using fieldwork data in publications: musicology - Linda BarwickCopies of the proceedings as a book can be ordered here.
