The usefulness in anthropology and digital ethnography of the free software corpus analysis tool TXM

Working on it!


I would like to thank Psychoslave for informing me about the existence of the Wikimedia-l mailing list, the team of the textomerie project developing and supporting the TXM software[1], the instructors of the course Méthodologie de l'analyse de corpus en linguistique taught at UCLouvain University during the years 2017- 2018 and 2018-2019[2], and finally the deepl GmbH company[3] for is free access online translation service used to write this text from my native French.

Introduction and initial question edit

As a result of the rise of ICTs, anthropologists, like other workers in the social sciences and humanities, are increasingly confronted with the use and observation of means of communication implemented in various digital spaces. Social networks, forums, instant messaging groups, mailing lists, collaborative sites, have become places of communication and life used by many human communities. As a result of this evolution, science, including anthropology, has turned them into new fields of research as digital anthropology or new [4].

It is therefore in the field of anthropology as a science and ethnography as a methodology that this research is part of. We will not discuss about linguistic anthropology here, analyse anthropological literature analysis[5], even nor online ethnography as a complement to the linguistic analysis of log data[6] but well about corpus analysis software as research tool in anthropological ethnographic field. Corpora can be numerous within digital communication spaces as much as corpus analysis software, but our choice will be limited in this case of studies to the archives of a mailing list and the TXM software

Thus, the initial question of this research could be summarized as follows:

" How the corpus analysis software TXM can help a digital anthropologist researcher in his ethnographic field work? "

The Wikimedia mailing list as a corpus edit

Why Wikimedia mailing list ? edit

One reason to chose Wikmedia mailing list as corpus was leads by than Wikimedia movement is my doctoral thesis focuses. An other reason was the facility to create the corpus by copy past the contain of mailing list archive month by month on separated files in .txt format directly usable by the software.

The contain of this mailing list could be interesting for me, but difficult to use by a simple reading. The idea of using TXM as sophisticated search engine to explore these thousands of messages came to my mind and was at the origin of this present research work.

As a last argument, the archives of the mailing list are directly copied and pasted, month by month, from a web page displaying the CC-BY 3.0 license[7], which makes their use very easy.

Mailing list description edit

Wikimedia community mailing list labeled "Wikimedia-l"[8] is a discussion list for the Wikimedia community and the larger network of organizations (Wikimedia Foundation, chapter organizations, affiliates, partners) supporting its work. This mailing list can, for example, be used for:

  • The initial planning phase of potential new Wikimedia projects and initiatives
  • Organizational issues of the Wikimedia Foundation, chapter organizations, others
  • Discussing the set-up of local Wikimedia chapters
  • Developing and evaluating grant-making programs
  • Planning elections, polls and votes
  • Discussion of projects that don't already have a mailing list
  • Finding ways to raise funds
  • Other Wikimedia-related issues

Corpus description edit

The corpus was initially constituted by a folder containing X files (one file per month from April 2004 to April 2018) for a total of X MB. All the text was constitueted by X words.

The free software TXM as analysis tool edit

TXM User Manual Version 0.7 ALPHA

Why TXM edit

Some people claim to be vegetarians and do not eat meat. For my part, I claim to be a libriste and do not "eat" proprietary software as defined by Richard Stallman[9]. XTM 0.7.9[10] meets my expectations in this respect. On the other hand TXM is developed by a team of French researchers and a good documentation of the software in French was available from the project's Internet website[11] in the form of manual[12] video tutorials[13]. Finally, the project has a mailing list[14] and a Wiki[15] that give to me the opportunity to receive support from community members in French..

TXM description[16] edit

TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal. It provides.

Qualitative analysis edit

  • Concordances of lexical patterns based on the efficient CQP full text search engine and its CQL query language
  • CQL pattern frequency lists for any word property (type, lemma, pos...) thanks to the integration TreeTagger's integration for lemmatization and pos tagging
  • CQL pattern occurrence graphics
  • lexical patterns are expressed in the CQL query language, based on word & structure level properties: (for example)
    • "aiming" to simply search for the word ’aiming’
    • ".*ing" to search for words ending in "ing" (including mainly verb forms)
    • [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
    • [lemma="group"] [] 0,3 [pos="VERB" & word=".*ing"] to search for the collocation followed by a with at most 3 words in between
  • rich HTML-based text edition navigation with links from all other tools

Quantitative analysis edit

  • factorial correspondance analysis
  • constrative word specificities
  • hierarchical classification
  • analysis of cooccurring words or lexical patterns

Corpus Data Model edit

  • Indexes words and their properties as well as hierarchical structure of texts
  • Indexes external or internal metadata of texts or speakers
  • Allows construction of various subcorpora and partitions (for constrative analysis between text structures or groups of words)

Personal feedback about Installation, importation and use of features edit

Before TXM, I had used very few textometric software and always in a very punctual way. Getting to grips with this software did not seem exessively difficult to me, but it would certainly have been if I had not previously acquired some knowledge of corpus analysis in linguistics. Without this previous training, I would have had to assimilate at the same time as discovering the software a whole set of concepts such as occurrence, lemma, tolken, etc.

Honestly, it seems to me that it is possible for someone who has enough time to successfully install the software and use it only from the manual that I used in French for myself, but that also exists in English in a Beta version.

In the end, the only problems I encountered in this experiment were the installation and use of the Treetagger automation software, which, unlike the R statistical processing software, is not pre-installed within TXM. These problems were related to configuration errors on my part and another problem probably related to a downloaded and corrupted file.

Finally, it should be noted that the process of importing my corpus leading to the creation of an XML file containing the categorization and lemmatization informations takes more than three hours on a desktop computer. At the end of the process an 8 GB overload of my RAM forcing the computer to use the swap space on the hard disk. Finally, the folder of the corpus binary format produced in more than one hour of calculation, was 6.5 GB in size and could not be loaded on my laptop due to lack of disk space while more than 15 GB was available.

It therefore seems important to me to point out that before embarking on the analysis of a corpus with TXM, it is necessary to ensure that the computer material is powerful enough according to the size of the text. Other example, after creating two partitions (12 months and 14 years) the software's start-up have increase to few seconds to nearly five minutes.

The software seemed relatively stable to me when you don't run a calculation until the end of a precedent. Faced with the size of the corpus and the power of my desktop computer, some processes can reach high or even excessive execution times. When the software freezes and its shutdown must be done via the computer's operating system, some of the work done before the shutdown may be lost. It would therefore be advisable to restart the application after performing an important job.

Useful informations for the ethnographer provided by TXM functionalities edit

One by one, we will discuss here the functionalities offered by the TXM software, and their ability to provide useful information for the ethnographer. For each useful feature, we will give an example applied to the analysis of the archives of the Wikimedia-l mailing list.

Edition edit

The editing function allows you to browse the entire corpus in html display with the display of an information bubble on each word indicating its lexical category.

The navigation is done file by file with the name of the file as the header of the tab and a contextual menu by right click allows the sending of a word to the concordancer.


Without leaving the TXM software, this function gave me the opportunity to briefly browse the corpus in full text to locate its structure and launch some concordance search from any name or pseudonym of known people. A way to easily browse all the interventions of an actor that you would like to follow within the mailing list. We will come back to the functionality of the concordancer later.

Lexicon edit

A Lexicon analysis (list of work ordered by frequency) already give some good informations to the ethnographer which words are most often cited by the community on the mailing list, a searcher can get information on :

  • information on the main topics of discussion in the community, use full for guiding individual semi-directive interview;
  • information on the most active members on the mailing list, use full for selecting interviewees;
  • information about the most used email provider, use full to know in which communication channel will allow the maximum number of contacts to be contacted.

Example from the corpus :

  • In this corpus constituted by a mailing list archive, the lexicon show 863676 @ concurrency wich means than the same amount of messages posted on the mailing list,

Conclusion edit

Other type of corpus possible.

Theoretical resources edit

  • Strategic Interaction and Knowledge Sharing in the KDE Developer Mailing List[17].
  • What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List[18].
  • Analyse de complexité des textes coutumostratiques[19]
  • Outline of natural language processing[20]
  • TXM French user manual[21]

Papers to explore edit

  • Explore, play, analyse your corpus with TXM[22]

External resources edit

Note and sources edit

  1. "L'équipe TXM - Projet Textométrie". Retrieved 2018-12-16.
  2. UCL/SGSI (2018). "Méthodologie de l'analyse de corpus en linguistique". UCLCatalogue des formations 2018-2019 (in French). Retrieved 2018-12-16.
  3. "DeepL Translator". Retrieved 2018-12-16.
  4. As an indication, in the English-speaking sphere, the article cyber-anthropology renamed digital anthropology appeared on 11 September 2005 ( and the article on virtual ethnography renamed cyber-ethnography on 1 February 2006 (
  5. This kind of research could be very interesting for example as an extension of digital archiving work carried out by the ODAS platform (
  6. Androutsopoulos, Jannis (2008-09-04). "Potentials and Limitations of Discourse-Centred Online Ethnography". Language@Internet 5 (8). ISSN 1860-2029. 
  7. Source:
  8. Source:
  9. Williams, Sam (2011-11-30). Free as in Freedom (Paperback): Richard Stallman's Crusade for Free Software (in en). "O'Reilly Media, Inc.". ISBN 9781449324643. 
  10. Heiden, S. (2010). The TXM Platform : Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University.
  11. "Projet Textométrie". Retrieved 2018-12-13.
  12. Heiden, Serge (2018-02-26). "Manuel de TXM 0.7 FR". (in French). Retrieved 2018-12-13.
  13. "Atelier d'initiation à TXM de Bénédicte Pincemin du 27 septembre 2012". Retrieved 2018-12-13.
  14. "txm-users - TXM users mailing list - subrequest". Retrieved 2018-12-13.
  15. "index [Le wiki de la liste txm-users]". Retrieved 2018-12-13.
  16. "Présentation - Projet Textométrie". Retrieved 2018-12-11.
  17. Kuk, George (2006-07). "Strategic Interaction and Knowledge Sharing in the KDE Developer Mailing List". Management Science 52 (7): 1031–1042. doi:10.1287/mnsc.1060.0551. ISSN 0025-1909. 
  18. "Purchase: What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List". Retrieved 2018-11-20.
  19. "Recherche:Analyse de complexité des textes coutumostratiques — Wikiversité". (in French). Retrieved 2018-11-20.
  20. "Outline of natural language processing". Wikipedia. 2018-10-08. 
  21. Heiden, Serge (2018-02-26). "Manuel de TXM 0.7 FR". (in French). Retrieved 2018-11-20.
  22. "Explore, play, analyse your corpus with TXM | DHd-Blog". Retrieved 2018-12-13.