Tuesday, 6 August 2013

A conceptual model of Text, Documents and Work - Part 1

As part of the thinking I'm doing for THE BOOK, there is a collaborative brainstorming with a colleague of mine, Geoffroy Noel,  who is an expert on modelling and knowledge representation. We have started a journey which we don't know where it will take us to, but it looks like a fun journey, so I thought of sharing it!

First the context: I will only talk about texts, documents and words in a textual scholarship framework, so I'm not talking about the creation of texts or texts transmitted orally or other objects, but of cultural heritage texts that are written.

Second the entities. In this model, at least for now (we are at the beginning only) there are three entities:
  1. documents
  2. texts
  3. works
The document is a physical object that has some text on it.[1] So maybe a better name for it could be verbal text bearing object,  or VTBO, but I found the expression too long and the acronym too ugly, so I'll stick with document for the time being (I may change my mind).

A document is defined by two sets of characteristics:
  1. It has some physical dimensions which can be measured (length, weight, number of leaves, whatever)
  2. It has some signs on it that can be recognised as words written in a natural language by a competent scholar
It has also other characteristics (it can be found somewhere, for instance), but at this stage I want simply to define it at a general level, the details will come.
Documents have observable dimensions, or as Michael Sperberg-McQueen put it, an infinite set of observable facts, which, I add, can be grouped into sets of facts or dimensions, motivated by research objectives. Each dimension has a set of document features it draws from. Features are directly and fairly objectively observable facts from the document. For instance, the Palaeographical Dimension will be interested in the way (Feature) a letter (Part of the document) is written. The following is a non exhaustive list of these dimensions potentially observable in a document.

  • Linguistic dimension: the language in which the text is written, its grammatical rules, and they way it has been realized. 
  • Semantic dimension: what the words mean and what they were intended to mean.
  • Graphematic/palaeographical dimension: the letters that compose the words, their shape and succession, the type of script or typeset and the cultural implications of using one or the other.
  • Literary dimension: style, rhetorical features, genre.
  • Genetic dimension: by whom and when the document was filled with words, the revisions and the additions.
  • Artistic dimension: how words and decoration look, their artistic quality, what they represent and mean.
  • Codicological dimension: how the words have progressively been added to the document and how the particular shape and structure of the document and its production have influenced the words.
  • Cultural dimension: the value we attribute to it because someone special has written it, or it has been owned by someone special, or has played a role in a historical event; because of the scarcity or the value of its components, its role for religious devotion.

As I said, there might be many more dimensions (and I think I will add at least another one shortly, when I'll introduce the concept of work, but for the moment I'll leave it like this), as there are infinite sets of fact which can be observed on the object.
Another way of seeing these dimensions is to interpret them as channels of communication (or Contacts), each one with its own codes (a Meaning encoded as Feature), as they are made of signs which, in the intentions of their producers (the Senders) where meant to communicate something (the message) to someone else (the Receivers). I have used the communication model to talk about text transmission here.

Now, these Dimensions are only potentially available, and until someone reads and inspects the document for some reason, the document itself has no particular meaning: it is an inert object with no particular significance. The place where the meaning are conjured is the text which come into being when the Document is read by a reader. Or, the Document transmits its message when an interested Receiver can be found. 

The Text then is meaning(s) that readers give to the subset of dimensions they derive from a document and that they consider interesting. As a consequence, Texts are immaterial and interpretative. Many people can read from the same document and understand slightly or radical different things, depending on their culture, their understanding, their disposition, their circumstances, etc. There are facts in the object, but their meaning is not factual, is interpretative. For one reader the only dimension could be the semantic (what the text means, the plot, who's the murderer), for another could be the artistic value (maybe she/he cannot read the words written in an old language, but he/she can still admire and make (some) sense of the miniatures). 
Editors are very special types of readers and are likely to be engaged with many dimensions at once, but then, according to their scholarly approach and the purpose of their editions, they will weight differently the evidence provided by each of these dimensions. My previous post has tried to use the codicological dimension as way to position theoretical approaches and editorial outcomes. And in my article of 2011 I have used the purpose of the edition as a parameter to chose the features of the document.
So far the model looks like this:



If we reconsider then the communication model I have mentioned above, we could then define texts as 
A text is a multidimensional written message that is made by set of meanings transmitted by a document by the mean of various channels, each of them with its own codes which are potentially understandable to at least one group of receivers holding the capabilities or the interest to decipher at least one of such codes.

Ok, I think I'll stop here for the moment. Next step will be the Work.

[1] I note here that his is more or less the same definition given by Huitfeldt and Sperberg-McQueen 2008: "By a document we understand an individual object containing marks. A mark is a perceptible feature of a document (normally something visible, e.g. a line in ink)"p. 297. The difference between these two definitions is that I explicitly say "text" and not "marks.

Thursday, 1 August 2013

Of texts and documents and editing

I have finally be granted a six month research leave and I am now totally immerse int he writing of The Book.
One of the things I am reflecting upon is the weight we give to documentary evidences in textual scholarship.
 The graph below shows in a simplified (and utterly imprecise) way the collocation of some most important editorial theories with respect to the centrality they give to the documentary evidence (the materiality of the text).


Similarly, editorial format can be equally distributed on a continuous line




 While presenting a very complex reality of theoretical positions in an over simplistic way, these graphs have the advantage of showing the path between text and document as a continuum, with no clearly defined boundaries.
The collocation is not scientifically wighted, it is just the approximate collocation I would give to them. Comments?

Tuesday, 19 March 2013

Crowd Sourcing and Digital Editing

The term is almost over, and finally I find some time to write down soem of the thoughts I have been munching in the past months. Apologies to my numerous followers for the long silence.
The topic this time is crowd-sourcing, which is a bit unusual for me as I have not been directly involved in any crowd-sourcing project, but as many of you already knows, I'm working at a book on Digital Scholarly Editing, which inevitably force me to consider new form of edition such as, for instance, crowd-sourcing and its role in editing.  
A King's College London project devoted to the classification of different types of crowd-sourcing activity is just concluded by producing a hefty report written by Stuart Dunn, but as per admission of its author, the classification contained in the report is a bit too comprehensive to be really useful for my purpose, so Here you have the one I have created (yes, all by myself!). Comments are welcomed!

Without any pretension of being exhaustive, crowdsourcing concerning in some way the edition and publications of texts can be classified according to five parameters:
1.     Context: Crowdsourcing projects can be hosted and supported by: 
a.     Universities and Cultural Heritage Institutions, such as Libraries and Museums. This is the case of some of the projects mentioned above (TranscribeBentham is hosted and supported by UCL, for instance), and of the National Library of Australia’s Historic NewspaperDigitisation Project, where users have been asked to correct OCRed articles from historical newspapers.
b.     Non-governmental organisations and other private initiatives: It is the case, for instance, of the Project Gutenberg, which began 19971 from the vision of his founder, Michael Hart and continued since thanks to donations.
c.     Commercial: it is the case, for instance, of Google that uses the ReCAPTCHA service, asking users to enter words seen in distorted text images onscreen, a part of which comes from unreadable passages of digitised books, thus helping the correction of the output of the OCR process, while protecting websites from internet robots (the so-called ‘bots’) attacks.
2.     Participants: or better, how are they recruited and which skills should they possess to be allowed to contribute. Some project issues open calls, for which anybody can enrol and contribute at their wish, with no particular skill being required other than commitment; other projects require their contributors to possess specific skills, which are checked before the user is allowed to do anything. The former is the case for the Historic Newspaper Digitisation or the Project Gutenberg, the latter for the EarlyEnglish Laws project. Many projects collocate themselves in between these two categories, closer to one end or the other. In the SOL project, for instance, users are assumed to read and understand Greek, but their competence is verified by the quality of their translations, although to register as editors, users are expected to declare their competences, which are checked by the editorial board.
3.     Tasks: The tasks requested to the users could be one or more of:
a.     Transcribing manuscripts or other primary sources, like in the case of Transcribe Bentham.
b.     Translating: as in the case of SOL.
c.     Editing, which is requested by the Early English Law project.
d.     Commenting and Annotating: as in the case of the Pynchon Wiki 
e.     Correcting: this is the case, for instance of the National Library of Australia’s project seen above and of the Project Gutenberg, where users not only contributes by uploading new material, but also take on proofreading texts in the archive.
f.      Answering to specific questions: this is the case for the Friedberg Genizah Project, for instance, which uses the project Facebook page to ask specific questions to its followers about, for instance, a particular reading of a passage, or if the hand of two different fragments is the same, and so on.
4.     Quality control: the quality of the work produced by the contributors can be assessed professional staff hired for that purpose (e.g. Transcribe Bentham), or could be assured by the community itself, with super-contributors which controlling roles are gained on the field by becoming major contributors (e.g. Wikipedia), or because of their qualifications (e.g. SOL), or both.
5.     Role in the project: for some project the crowdsourced material can be the final aim of the project, like for the Project Gutenberg or the Historical Newspapers Digitization project, or it could be a product that will be used in other stages of the project. The transcriptions produced within Transcribe Bentham project serve a double purpose: they represent the main outcome of the project as, once their quality has be ascertained, they feed into UCL’s digital repository, but they are also meant to be used for the edition of The Collected Works of Jeremy Bentham in preparation since 1958.

Is there anything else I should have included?

Friday, 11 May 2012

Genetic encoding at work

I get back to my blog after a long silence which has been determined by a rather busy month (March) with three papers given in three different continents on three different topics (1. Paris on Proust: see below; Providence on Modelling in Teaching; Canberra on the role of TEI on DH projects), al all of it in the middle of term. Nice. Than there was a rather deserved holiday (Australia seems to be better each time I go!), and MMSDA (April). Finally, catching up with loads of emails, deadlines, etc.

This post wants to relate on the content of the the first of the three conferences, i.e. the presentation that Julie André and I gave in Paris on the 1st of March Proust, l’œuvre des manuscrits. The conference was organised by the "Equipe PROUST" of ITEM-CNRS (Institut des Textes et Manuscrits modernes), with funding by the ANR Program CAHIERS-PROUST (Nathalie Mauriac Dyer, ITEM, dir.)

You can admire the prototype I have created at this address: http://research.cch.kcl.ac.uk/proust_prototype/ You can download the XML and XSLT, if so you wish.

The idea that is at the base of this prototype is that in digital editions we have so far tried to reproduce print editions without engaging with the new medium in a fruitful or interesting way. Even the most sought-after type of online edition, such as the transcription presented side by side with the facsimile is not new at all, and shows quite a few limitations.
  1. It creates an alternative space which tries to mimic the original space, without ever being able to represent it in full; 
  2. It leaves to the user/reader the task of establishing the relationship between the transcribed and the inscribed text; 
  3. It is bound to present pages (and not, for instance, openings), given the constraint in width of the screen, an approach that, if applied to Proust’s Chaiers, will indeed falsify the documentary evidence which shows how Proust considered the double page as his writing space (have a look at these materials on Gallica: they are amazing!).
The normal type of publication format adopted for draft manuscripts is the ultra-diplomatic edition, which presents the transcribed text in a format that tries to mimic the layout of the manuscript page as much as possible. While this type of edition provides many advantages, it lacks one fundamental aspect: the dynamicity of the writing process.

So what, then? For the transcription we have used the new TEI elements for documentary transcription (I talk about this in another post), then I have used SVG to plot on top of the facsimile the transcribed zones of text, then I have used a bit of javascript to put a bit of animation into the output to reproduce the sequence of writing and the sequence of reading of such zones. I have also used color to mark uncertainty: are we sure about the temporal collocation of the sequences? the yellower the background the least sure we are.

I think this type of visualisation is definitely not perfect but it is interesting for many reasons: first, because it tries to do something that the print editions cannot do; second, because it doesn't present a coherent read-me-top-to-bottom type of text (which would be just wrong in this case); and third, because it takes the (facsimile of the) document as its structural support.
What's still missing? Quite a lot, actually, but in particular I can think of these few points now:

  • A way to represent the dynamic sequences across pages and documents: this can be easily doen in the XMl source, but not yet in the output
  • A way to drag the zones away in roder to read what's underneath
  • Microgenesis: timing writing and rewriting at word level.  
But this is for the next project!

Thursday, 26 January 2012

Digital Humanities seen from the outside: a Fish out of water

On this post I would like to reflect on the way Digital Humanities are seen from the outside and the consequences of misunderstandings.

Apparently, seen from outside, we are those people counting words and detecting hidden meanings from numbers and statistics; this method is seen as being in contrast with the more traditional literary interpretation (close vs. distant reading, to say it with a slogan). Of this opinion seems to be Stanley Fish. In his blog post Mind Your P’s and B’s: The Digital Humanities and Interpretation he reports on a DH-like analysis of the Aeropagitica of John Milton, where he studies the "the dance of the “b’s” and “p’s” on a given passage. In the end, he concludes that DH-like analysis is not his piece of cake:
But whatever vision of the digital humanities is proclaimed, it will have little place for the likes of me and for the kind of criticism I practice: a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play. Nothing ludic in what I do or try to do. I have a lot to answer for.
Well, there is nothing wrong in the fact that DH is not everybody's piece of cake. I can live with that, pretty easily, as it happens. The problem is that  to do an effective criticism, you should actually know what you are talking about. Mark Liberman has in fact run a test on the very premise of Fish argument and has discovered that in that passage:

  • The number of 'p's and 'b's is only 1% higher of the average number of 'p's and 'b's in the whole text
  • There are passages that contain even more 'p's and 'b's
  • There are letters that show similar patterns, such as 'x's and 'y's
  • There are letters that show even bigger picks, such as 'l's
A.k.a.: to do a DH-like research you should use DH tools, i.e. use a computer! Had Fisher used a programme for his own research he could has spared himself a bit of ridicule. To do DH-like research, you should be able to do it, actually. DH are not approaches and theories only, they are practice as well (see my definition of DH on a earlier post). Turns out that to count words (or letters) you actually have to count them.

What do we learn from this? Two main lessons, I think.
First, that we have to reflect on our image and the way we present our research and ourselves to people that take more traditional approaches to scholarship. Second, that if you want to criticise something you have to make sure you have done your homework (something I discussed in another post). The problem is, I think, is that Fish *has*, in my opinion, a point here, namely that the statistical, computational approach is not for everybody (he doesn't say that it is not useful, only that is not for him) and that there is still a lot of values in doing things traditionally.
But if you want to make a point, make sure your argument is solid, otherwise people will make fun of you, missing something potentially interesting.

Are my students listening?

Wednesday, 25 January 2012

Research without Borders and the TEI

Last Friday (20 of January) I have been invited by Marjorie Burghart to give a lecture in Lyon as part of a two days DH event, called L’édition électronique dans tous ses états : évolution des pratiques, évolution des besoins (details of the event here and poster here).
It has been lot of fun, in particular because I have organised a role game and everybody was very involved.
I have also had the opportunity of investigating one of my favorite topics: why on earth people spend time (and money) to work for the TEI when all this work is not credited i.e. the name of who got the idea is not recorded anywhere, it is just represented as the collaborative effort of The TEI (a.k.a. the Technical Council + the SIG + TEI-L + etc. etc.)? This si not what academics normally do, right? And even more so, why on earth Institutions accept this to happen?
On a personal level, the best answer to these questions is, in my opinion, because working to improve the TEI is fun, you have the opportunity of meeting with exceptionally gifted researchers from all over the world and, even if you cannot immediately quantify or point at something specific, your research is affected by this. Mine has: I think I am a much better researcher as a results of my past 10-odds year of work with the TEI, as part of the SIG, the Council and now the Board.
At institutional level, the reason is that the TEI is recognised as one of the foundational bases of DH, of which we are all collectively responsible.
Yes, the TEI has a lot of open issues (last summer putsch is a luminous example of this), but, as always, I think the best way to solve the problems is to get involved. So, au travail mes amis!

Here are the slides of the presentation, in French though... apologies to all non-French speakers and to all French speakers as well (quality of the language is, well, you'll see!).

Monday, 23 January 2012

Medievalists on the making and the digital

For the past few years I have been lucky enough to be involved in a wonderful training course, MMSDA, i.e. Medieval Manuscript Studies in the Digital Age. This course is offered for free to UK PhD students which have to work with medieval manuscripts and are interested in the digital stuff. We have now run the course for three years with an exceptional success which we mesure in the number of applicants  (65, in the first year, 42 the second and 28 the third) and their enthusiasm and commitment. The main brain behind this initiative is Peter Stokes (yep, my Peter Stokes).
The course was initially funded by the AHRC, so we were forced to offer it only to UK-based students, but, from the very first time we run it, we were aware of a much larger interest out there. This is the reason why we have sought alternative funding and we were finally lucky enough, thanks to the hard work of Charles Burnett form the Warburg Institute, to secure some substantial funding from a COST Action project,   IS1005, 'Medieval Europe - Medieval Cultures and Technological Resources'.

So we have opened the application to European Countries. Results? we had 90 applications (yes 90!!) from 18 countries for 20 places. The quality of the applicants where outstanding, I have never had to make more difficult choices, really! We have just been through them all and sent the list of successful candidates to the COST office for approval, then we will communicate the results.

This experience is telling me a few things:

  1. There are some amazing young researchers out there, we will have some stiff competition quite soon.
  2. Many people that have to engage with manuscripts lack appropriate training. Even at PhD level for many a manuscripts is little more than a support for a text. 
  3. Young researchers are desperate to acquire essential digital skills (we teach XML, TEI, imaging, so nothing very sophisticated, but very desirable, as it seems)
  4. We (i.e. the organisers) have willingly left out from the course soem essential topics: Greek, Hebrew, Arabic, Glagolitic, Cyrillic... all of this languages and scripts and traditions and manuscripts are part of our common European culture, but we tend to, quite conveniently, forget it... In our case it was mostly due to lack of time (there are just so many things you can fit in 5 days, you know), still there is something to keep in mind here, I think
Food for thought...