Guest post by Jean Rime (University of Fribourg / Paul-Valéry University, Montpellier). For a french version, please go to our partner infoclio.ch's website.
Contributions focused on three types of material: diplomatic documents, press archives and born-digital sources. The first-hand experiences described by speakers were placed in a broader context by discussants so as to highlight related methodological issues.
The day began with a session on diplomatic documents. JOËL PRAZ and CHRISTIANE SIBILLE (Bern) presented Dodis (Diplomatic Documents of Switzerland), a research centre set up in 1972 with the aim of providing access to sources on Swiss foreign relations. In 1995, the project developed a digital database, which currently contains some 30 000 documents, as well as 40 000 individuals, 20 000 organisations and 7 000 geographical locations that have been manually indexed. Its activities involve the selection and scientific editing of documents published online, the most important of which are given a critical apparatus and published as print reference works. Based around this primary objective, various related tools have been developed, including a collection of open-access studies (Quaderni di Dodis), a bibliography and thematic e-reports. Dodis is committed to ensuring the interoperability of its resources with other research instruments via the Metagrid portal and the HistHub platform. The longevity of the project provides an opportunity for critical reflection on the sustainability of the information produced, transcription standards for online publishing and issues raised by the retrospective indexing of previous volumes. Conclusions reached include the need to adopt common standards such as TEI (Text Encoding Initiative) to facilitate the interoperability of descriptors.
Drawing on a series of examples from his own research and from other projects, FRÉDÉRIC CLAVERT (Luxembourg) explored the relevance of specifically digital approaches, which do not merely enable a shift in scale in the processing of corpora but aim to provide tools for the computerisation of mass phenomena – datafication – and to analyse these data through machine-learning techniques. Discussing Franco Moretti’s notion of distant reading, he outlined how this paradigm has contributed to statistics, topic modelling and data clustering, while also noting that it may obscure analytical processes, since visualisation techniques often mask underlying algorithms and make it impossible to verify results in the original documents. These results suffer from a number of recurrent biases, whether because of the quality of the digital corpus and its automatic reading (OCR) or because of its blind spots: the growing but irregular digitisation of sources produces an “illusionary order” (Ian Milligan), which has the effect of concealing sectors that do not make such extensive use of it, thereby creating potentially misleading imbalances.2
Led by EMMANUEL MOURLON-DRUOL (Glasgow), discussions continued on the limits of digital approaches. The existence of vast reams of documents waiting to be digitised raises questions as to the criteria employed to select archives. Moreover, the corpora available and the computer tools used to analyse them inevitably influence the choice of research questions. These issues are closely related to more traditional methodological concerns.
The second session focused on the digitised historical press. Representing the tradition of interdisciplinary research (literature and cultural history), GUILLAUME PINSON (Quebec), one of the protagonists of the Numapresse project, is analysing these corpora not as reservoirs of information but as a way of investigating the workings of the medium itself, how it is perceived and conceptualised, as studied in the previous projects La Civilisation du journal3 and Médias19. Press digitisation at international level and the prospect of improved interoperability between platforms are extending research in this field, which was previously confined to a traditional monographic or national approach, towards an observation of the media system as a whole. Newspapers are therefore seen as a “laboratory for data circulation” characterised by reprinting and by the “viral” nature of content (see Ryan Cordell, www.viraltexts.org/). The new possibilities raised by optical character recognition (OCR) enable automatic tracking of the circulation of articles, including in unexpected areas of the corpus; they help formalise and detect the generic nature of certain discourses, shedding light on the development of forms of journalism. Given the technological challenges (interoperability), legal challenges (partially copyrighted newspapers) and scientific challenges (the status and complexity of journalistic discourse), it nevertheless seems wise to argue for a “light digital approach” that also involves an immersive approach to the original texts.
Based on a case study on anti-European sentiment in the Swiss and Luxembourgish press carried out as part of the Impresso project, ESTELLE BUNOUT (Luxembourg) demonstrated the complex nature of digitised sources, which are composed not only of the scanned text but also of a series of technical, bibliographical and administrative metadata documenting the processes undergone by the scanned sources. Ideally, all these layers of information are needed to create a relevant corpus. Corpora are no longer formed by systematically going through an entire collection but by a cross-examination technique based on keywords, relevant descriptors or filtering by type of article. Comparing these various strata of a digitised corpus (which in no way excludes the pragmatic use of more traditional methods, depending on the documents being processed) is vital as a means of delineating, for example, the media space in which anti-European sentiment is expressed, thereby avoiding skewing the analysis with an illusory levelling of the medium of origin.
Summarising the methodological issues raised by the two previous speakers, ALAIN CLAVIEN (Fribourg) asked whether the digitisation of the press simply makes traditional sources easier to consult, or whether this implicit “remediation” actually creates a new source that is not equal to the sum of scanned historical newspapers but could be described generally as the collective expression of a “discourse”, in the sense used by Foucault or Angenot.
The third session looked at born-digital sources. Taking the example of his institution, ALEXANDRE GARCIA (ICRC, Geneva) presented the viewpoint of an archivist. Even though the ICRC opened its doors to researchers in 1996, its archiving policy remains mainly guided by internal objectives: operational continuity, accountability to partners or member states, and institutional memory. The emergence of born-digital documents has changed the life cycle of archives. Processing now takes place well in advance of any conservation procedures, sometimes as soon as the documents are created, to allow for the future addition of metadata that will facilitate their subsequent classification. While email archiving is now well managed, other types of data are more complex to deal with: geographical data, WhatsApp messages, social media, clouds, etc. These data are characterised by a life cycle that transforms institutional communication into composite, multi-channel information whose volume, forms, content and means of dissemination are constantly changing – making the case for adjustments to the legislative framework.
While collecting digital data is a tricky task at institutional level, it is all the more complex at the scale of the global web. VALÉRIE SCHAFER (Luxembourg) presented the various pitfalls involved. Since InternetArchive was set up in 1996, it has accumulated and currently stores 330 billion pages. But despite this massive quantitative increase, the fund also suffers from structural bias, which needs to be understood if it is to be used properly. For example, data-mining robots programmed to work quickly do not take into account some images, or, when mining updated content on different dates, do not systematically save data that is supposed to remain more stable, such as banners.4 This means that the archive consulted by historians often fails to reproduce both the appearance and features of the site as it appeared to its original users. This variability of the archive is only increasing over time, with users now able to customize the content they display. This discrepancy raises the question of exactly what we can and want to analyse in web archives.
CAROLINE MULLER (Reims) widened the debate by emphasising how difficult it is to keep pace with these rapid developments in historians’ training, research practices and even the representations of their activities. We are repeatedly told that we are a “generation of transition”, but the dizzying pace of technological change could make this a permanent state of affairs, a prospect that is both frightening and stimulating. While digital documents were initially treated by analogy with paper documents (PowerPoint presentations conserved as PDFs, emails printed then scanned, etc.), there is an increasing need for ad hoc procedures for sources that are far removed from the bibliographical skills that we were taught: codes, algorithms, software, etc. Unless changes are made to the political, legal and scientific frameworks governing archiving practices, vast swathes of our current heritage are at risk of disappearing entirely: what will the Facebook archives look like a few decades from now?
By the end of the day, the diversity of projects presented had enabled us to take stock of the progress made in accessing and processing primary sources as a result of digital technologies, while also identifying some of the methodological problems raised by these new tools. Resisting the lure of either fascination with attractive visualisation tools (maps, clouds, etc.) or undue scepticism, the historiographical reflection on these developments seemed to demonstrate a degree of maturity that revealed several areas of convergence among the multidisciplinary audience, in terms of both the questions raised and the proposals put forward to address them.
A first area concerned the scope of technological innovations. Do they merely facilitate the work of historians, or are they creating an epistemological revolution? While it goes without saying that the scale of corpora has changed, several participants noted that the results produced – at great cost – were often not radically new, and that it was even rare for theories to emerge as a direct result of work with a computer. Indeed, identifying and countering the various biases in corpora of digital or digitised archives (gaps, disproportions, etc.) ultimately boils down to applying traditional source criticism methods. The fantasy of a “total archive” is therefore replaced with well-reasoned procedures that give corpora a coherence and representativeness in light of issues that have been clearly predefined and specified for each research question. Although there has undoubtedly been a paradigmatic shift, it is not merely a result of the use of machines; it is first and foremost down to the conceptualisation of new objects, which can be attributed to a more complex combination of factors – as was also the case with the emergence of social history.
Reflecting this midpoint between a blindly all-digital approach and traditional practices, there seemed to be broad agreement on a compromise between “distant reading” and “close reading” (or “deep reading”). In contrast with Franco Moretti’s “little pact with the devil”,5 a hybrid method seems to be called for, not only to address the shortcomings of digitisation (sources that are only physical or are illegible), but also to refine quantitative results using qualitative analysis, even if applying these precautions means missing out on the most innovative – but also the most speculative and least controllable – possibilities of digital technologies.
While historical methods, especially external criticism, are strengthened by their contact with digital technologies, the conceptual vision of the profession is troubled by this development. The prospect of immediate access to sources that were previously difficult to access or of research being governed by the “hard” sciences is both attractive and disheartening, and it can elicit a certain nostalgia. Our relationship with digital technologies is primarily conditioned by a series of representations or affects, over and above the actual impact of technical data. It is therefore also important to analyse these reactions when looking to the future, since it is they, perhaps more than technical innovations, that will determine the “allure of the archives” in future.
Module 1 – Diplomatic documents
Christiane Sibille & Joël Praz, Academic Researchers, Dodis: “Les Documents Diplomatiques Suisses à l’ère numérique: pratiques et perspectives” [Swiss diplomatic documents in the digital age: practices and prospects]
Frédéric Clavert, Senior Research Scientist, C2DH, AIHCE: “Du document diplomatique à l’histoire des relations internationales par les données massives” [From diplomatic documents to the history of international relations through big data]
Discussion moderated by Emmanuel Mourlon-Druol, Senior Lecturer, University of Glasgow, AIHCE
Module 2 – The digitised press
Guillaume Pinson, Professor at Université Laval, member of the Numapresse project: “La presse ancienne à l’ère numérique: enjeux scientifiques d’une remédiation” [The historical press in the digital age: the scientific issues surrounding data remediation]
Estelle Bunout, Research Associate, C2DH: “Une recherche plus fouillée dans un corpus imparfait? L’étude de la question européenne dans la presse numérisée suisse et luxembourgeoise (1848-1945)” [A deeper exploration of an imperfect corpus? Investigating the European question in the digitised Swiss and Luxembourgish press (1848-1945)]
Discussion moderated by Alain Clavien, Professor at the University of Fribourg
Module 3 – Born-digital sources
Alexandre Garcia, Archivist, ICRC: “Comment se constitue un fonds d’archives aujourd’hui? L’exemple du CICR” [How are archives compiled today? The example of the ICRC]
Valérie Schafer, Professor in Contemporary European History, C2DH: “Les archives du Web, une lecture à plusieurs niveaux” [Web archives, a reading at several levels]
Discussion moderated by Caroline Muller, PRAG, University of Reims
- 1. Farge Arlette: Le Goût de l’archive , Paris, Seuil, 1997. See in particular the case she makes for “the tactile and direct approach of the material, the feel of touching traces of the past” (p. 15) when describing what at the time was a state-of-the-art technique: reproduction on microfilm.
See also: Muller Caroline and Clavert Frédéric (eds), Le goût de l’archive à l’ère numérique, http://gout-numerique.net/
- 2. This is true of the over-representation of digitised newspapers in press studies. See Milligan Ian, “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010”, The Canadian Historical Review, 94/4, 2013, p. 540‑569.
- 3. Kalifa Dominique, Régnier Philippe, Thérenty Marie-Ève and Vaillant Alain (eds), La Civilisation du journal. Histoire culturelle et littéraire de la presse française au XIXe siècle, Paris, Nouveau Monde, 2011.
- 4. Another striking example: pages that are supposed to have been recorded in summer display winter weather maps…
- 5. Moretti Franco, “Conjectures on World Literature” , Distant Reading, London / New York, Verso, 2013, p. 48. French version: Études de lettres, 2, 2001.