Open Global Health/ContentMine
This page organizes the work regarding the ContentMine Fellowship awarded to Ale Abdo as part of the Open Global Health effort.
Tag along
editInterested in following up or contributing to this project?
Ongoing
editRoadmap
editFirst step is to organize a database of pubmed and conference abstracts relevant to global health issues from recent years (>2000).
Then, some effort will involve recognizing what kinds of facts are actually interesting and extractable. This will likely involve creating dictionaries and improving the software.
Once we're able to extract facts, create a nice interface to make facts and their sources easily discoverable. Main idea for now is a tool to assist a person who is looking for an overview and references on a combination of conditions, location and social aspects.
Following that, check if we can trace the precocity of facts in conference abstracts relative to published literature. If successful, figure out features of conference abstracts that could help us infer their reliability.
Tasks
editSee below and perhaps some issues at GtiLab.
Data
editCurrently working on the dataset coming from an EPMC search for:
(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")
The specific command is
node node_modules/getpapers/bin/getpapers.js -q '(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")' -o data/globalhealth.all -xa
Currently it sometimes breaks getpapers, see this issue.
After getting the files I use norma to run and output scholarly html as
./contentmine/norma/bin/norma --project data/globalhealth.all --input fulltext.xml --output scholarly.html --transform nlm2html
Finally I manage to access the data from python using pycproject, with one quirck however.
Dictionary
editTurning XML MeSH into a CM-formatted dictionary.
For this I put together mesh2cmdict.py. It automatically downloads and processes MeSH into a CM-compatible dictionary, optionally filtering specific branches of the MeSH hierarchy.
Places dictionary: MeSH is actually kinda poor in terms of places, stopping at country level with only a few 'global' cities (and a very colonialist point of view).
Social condition dictionary: MeSH seems OK, matching for the social branch.
Does CM understand synonyms in a dictionary? If so, how? For this and other issues, See this discourse thread.
Fact extraction
editThe syntax to use the dictionary is supposed to be like:
./contentmine/ami/bin/cmine data/gp_globalhealth word\(search\)w.search:contentmine/dicts/desc2017-cmdict.xml
However it outputs empty results. This bug is being discussed in discourse.
Turned out the syntax was not only scarcely documented, but also had an error. The right syntax being:
./contentmine/ami/bin/cmine data/gp_globalhealth 'search(file:///srv/lisis-lab/devroot/home/ale/contentmine/dicts/desc2017-cmdict.xml)'
This is still processing here! =)
Interface
editSo, I'd love people to interface in the most useful way with the facts extracted. Some ideas...
- I recently met a doctoral student working in computer linguistics that has an interesting system which infers agreement relationships between a text and a term, and through a search can display relevant extracts exposing these relationships in a meaningful way for people to interpret. I thought it would be cool to use that in this project, if it can for example highlight the agreement of papers on a disease, in a certain place, to some form of treatment. I've contacted him about this and now am waiting for a reply.
- He responded, not directly usable, shared some other tips.
- Had another idea, train a word embedding model and enrich abstracts with annotations over words present in MeSH, so one can navigate to abstracts that use that same word in a similar context. I believe this is feasible and that's likely what I'm going for.
Suggestions for future fellowship programs
editWe were asked to give some directions for a continued fellowship program. Here are mine, in bullet form because I like guns.
- stimulate fellows to document their efforts - their journals, not their code - in a single collaborative environment such as right here, or keep them linked there;
- could be github as well, each fellow a folder in a single project tree, but my feeling is github is good for code, not so much to organize collaboration.
- challenge fellows to standardize how they document stuff, this will make them pay attention to how each other setting up their environment; but also to what they've been up to;
- make sure the team responds to issues in place; I've often had an answer through chat of something I had posted on discourse, this gets in the way of immediate and future reference.
- better documentation to start with is very much needed; not only better content but also better organization. For each package: installation, usage, behaviour, features.
- use something non-proprietary for conference calls, like Jitsi or Riot which are both web based. Riot could also substitute Slack.