Tuesday 6 August 2013

A conceptual model of Text, Documents and Work - Part 1

As part of the thinking I'm doing for THE BOOK, there is a collaborative brainstorming with a colleague of mine, Geoffroy Noel,  who is an expert on modelling and knowledge representation. We have started a journey which we don't know where it will take us to, but it looks like a fun journey, so I thought of sharing it!

First the context: I will only talk about texts, documents and words in a textual scholarship framework, so I'm not talking about the creation of texts or texts transmitted orally or other objects, but of cultural heritage texts that are written.

Second the entities. In this model, at least for now (we are at the beginning only) there are three entities:
  1. documents
  2. texts
  3. works
The document is a physical object that has some text on it.[1] So maybe a better name for it could be verbal text bearing object,  or VTBO, but I found the expression too long and the acronym too ugly, so I'll stick with document for the time being (I may change my mind).

A document is defined by two sets of characteristics:
  1. It has some physical dimensions which can be measured (length, weight, number of leaves, whatever)
  2. It has some signs on it that can be recognised as words written in a natural language by a competent scholar
It has also other characteristics (it can be found somewhere, for instance), but at this stage I want simply to define it at a general level, the details will come.
Documents have observable dimensions, or as Michael Sperberg-McQueen put it, an infinite set of observable facts, which, I add, can be grouped into sets of facts or dimensions, motivated by research objectives. Each dimension has a set of document features it draws from. Features are directly and fairly objectively observable facts from the document. For instance, the Palaeographical Dimension will be interested in the way (Feature) a letter (Part of the document) is written. The following is a non exhaustive list of these dimensions potentially observable in a document.

  • Linguistic dimension: the language in which the text is written, its grammatical rules, and they way it has been realized. 
  • Semantic dimension: what the words mean and what they were intended to mean.
  • Graphematic/palaeographical dimension: the letters that compose the words, their shape and succession, the type of script or typeset and the cultural implications of using one or the other.
  • Literary dimension: style, rhetorical features, genre.
  • Genetic dimension: by whom and when the document was filled with words, the revisions and the additions.
  • Artistic dimension: how words and decoration look, their artistic quality, what they represent and mean.
  • Codicological dimension: how the words have progressively been added to the document and how the particular shape and structure of the document and its production have influenced the words.
  • Cultural dimension: the value we attribute to it because someone special has written it, or it has been owned by someone special, or has played a role in a historical event; because of the scarcity or the value of its components, its role for religious devotion.

As I said, there might be many more dimensions (and I think I will add at least another one shortly, when I'll introduce the concept of work, but for the moment I'll leave it like this), as there are infinite sets of fact which can be observed on the object.
Another way of seeing these dimensions is to interpret them as channels of communication (or Contacts), each one with its own codes (a Meaning encoded as Feature), as they are made of signs which, in the intentions of their producers (the Senders) where meant to communicate something (the message) to someone else (the Receivers). I have used the communication model to talk about text transmission here.

Now, these Dimensions are only potentially available, and until someone reads and inspects the document for some reason, the document itself has no particular meaning: it is an inert object with no particular significance. The place where the meaning are conjured is the text which come into being when the Document is read by a reader. Or, the Document transmits its message when an interested Receiver can be found. 

The Text then is meaning(s) that readers give to the subset of dimensions they derive from a document and that they consider interesting. As a consequence, Texts are immaterial and interpretative. Many people can read from the same document and understand slightly or radical different things, depending on their culture, their understanding, their disposition, their circumstances, etc. There are facts in the object, but their meaning is not factual, is interpretative. For one reader the only dimension could be the semantic (what the text means, the plot, who's the murderer), for another could be the artistic value (maybe she/he cannot read the words written in an old language, but he/she can still admire and make (some) sense of the miniatures). 
Editors are very special types of readers and are likely to be engaged with many dimensions at once, but then, according to their scholarly approach and the purpose of their editions, they will weight differently the evidence provided by each of these dimensions. My previous post has tried to use the codicological dimension as way to position theoretical approaches and editorial outcomes. And in my article of 2011 I have used the purpose of the edition as a parameter to chose the features of the document.
So far the model looks like this:



If we reconsider then the communication model I have mentioned above, we could then define texts as 
A text is a multidimensional written message that is made by set of meanings transmitted by a document by the mean of various channels, each of them with its own codes which are potentially understandable to at least one group of receivers holding the capabilities or the interest to decipher at least one of such codes.

Ok, I think I'll stop here for the moment. Next step will be the Work.

[1] I note here that his is more or less the same definition given by Huitfeldt and Sperberg-McQueen 2008: "By a document we understand an individual object containing marks. A mark is a perceptible feature of a document (normally something visible, e.g. a line in ink)"p. 297. The difference between these two definitions is that I explicitly say "text" and not "marks.

Thursday 1 August 2013

Of texts and documents and editing

I have finally be granted a six month research leave and I am now totally immerse int he writing of The Book.
One of the things I am reflecting upon is the weight we give to documentary evidences in textual scholarship.
 The graph below shows in a simplified (and utterly imprecise) way the collocation of some most important editorial theories with respect to the centrality they give to the documentary evidence (the materiality of the text).


Similarly, editorial format can be equally distributed on a continuous line




 While presenting a very complex reality of theoretical positions in an over simplistic way, these graphs have the advantage of showing the path between text and document as a continuum, with no clearly defined boundaries.
The collocation is not scientifically wighted, it is just the approximate collocation I would give to them. Comments?

Tuesday 19 March 2013

Crowd Sourcing and Digital Editing

The term is almost over, and finally I find some time to write down soem of the thoughts I have been munching in the past months. Apologies to my numerous followers for the long silence.
The topic this time is crowd-sourcing, which is a bit unusual for me as I have not been directly involved in any crowd-sourcing project, but as many of you already knows, I'm working at a book on Digital Scholarly Editing, which inevitably force me to consider new form of edition such as, for instance, crowd-sourcing and its role in editing.  
A King's College London project devoted to the classification of different types of crowd-sourcing activity is just concluded by producing a hefty report written by Stuart Dunn, but as per admission of its author, the classification contained in the report is a bit too comprehensive to be really useful for my purpose, so Here you have the one I have created (yes, all by myself!). Comments are welcomed!

Without any pretension of being exhaustive, crowdsourcing concerning in some way the edition and publications of texts can be classified according to five parameters:
1.     Context: Crowdsourcing projects can be hosted and supported by: 
a.     Universities and Cultural Heritage Institutions, such as Libraries and Museums. This is the case of some of the projects mentioned above (TranscribeBentham is hosted and supported by UCL, for instance), and of the National Library of Australia’s Historic NewspaperDigitisation Project, where users have been asked to correct OCRed articles from historical newspapers.
b.     Non-governmental organisations and other private initiatives: It is the case, for instance, of the Project Gutenberg, which began 19971 from the vision of his founder, Michael Hart and continued since thanks to donations.
c.     Commercial: it is the case, for instance, of Google that uses the ReCAPTCHA service, asking users to enter words seen in distorted text images onscreen, a part of which comes from unreadable passages of digitised books, thus helping the correction of the output of the OCR process, while protecting websites from internet robots (the so-called ‘bots’) attacks.
2.     Participants: or better, how are they recruited and which skills should they possess to be allowed to contribute. Some project issues open calls, for which anybody can enrol and contribute at their wish, with no particular skill being required other than commitment; other projects require their contributors to possess specific skills, which are checked before the user is allowed to do anything. The former is the case for the Historic Newspaper Digitisation or the Project Gutenberg, the latter for the EarlyEnglish Laws project. Many projects collocate themselves in between these two categories, closer to one end or the other. In the SOL project, for instance, users are assumed to read and understand Greek, but their competence is verified by the quality of their translations, although to register as editors, users are expected to declare their competences, which are checked by the editorial board.
3.     Tasks: The tasks requested to the users could be one or more of:
a.     Transcribing manuscripts or other primary sources, like in the case of Transcribe Bentham.
b.     Translating: as in the case of SOL.
c.     Editing, which is requested by the Early English Law project.
d.     Commenting and Annotating: as in the case of the Pynchon Wiki 
e.     Correcting: this is the case, for instance of the National Library of Australia’s project seen above and of the Project Gutenberg, where users not only contributes by uploading new material, but also take on proofreading texts in the archive.
f.      Answering to specific questions: this is the case for the Friedberg Genizah Project, for instance, which uses the project Facebook page to ask specific questions to its followers about, for instance, a particular reading of a passage, or if the hand of two different fragments is the same, and so on.
4.     Quality control: the quality of the work produced by the contributors can be assessed professional staff hired for that purpose (e.g. Transcribe Bentham), or could be assured by the community itself, with super-contributors which controlling roles are gained on the field by becoming major contributors (e.g. Wikipedia), or because of their qualifications (e.g. SOL), or both.
5.     Role in the project: for some project the crowdsourced material can be the final aim of the project, like for the Project Gutenberg or the Historical Newspapers Digitization project, or it could be a product that will be used in other stages of the project. The transcriptions produced within Transcribe Bentham project serve a double purpose: they represent the main outcome of the project as, once their quality has be ascertained, they feed into UCL’s digital repository, but they are also meant to be used for the edition of The Collected Works of Jeremy Bentham in preparation since 1958.

Is there anything else I should have included?