The dynamics and social organization of innovation in the field of oncology/Co-occurrence likelihoods

Our goal is to analyse the evolution of themes and terms in a corpus using word cooccurrence likelihoods. Those are obtained in regularized form a word embedding model.

Summary

Words occurr in a corpus alongside other words. The patterns of co-occurrence between words can be encoded as probabilities. We can use these probabilities to quantify how much any text resembles the ensemble of texts from which the probabilities were derived. For such a text, co-occurrences similar to those found in the corpus increase this likelihood, co-occurrences less common in the corpus decrease it.

If we have a sequence of corpora, we can derive a sequence of corresponding co-occurrence probabilities. For a given text, we can then use said sequence to quantify how much that text resembles each of those corpora. Thus, for example, if the sequence is temporal, it will tell us how the co-occurrences found in the text became more or less common through time.

We can then make this observation to an entire corpus, and talk about texts whose co-occurrences had greater variation, or have grown more, or decreased more, throughout whatever that sequence represents.

At this moment we focus on sequences of years, and we try to bring to evidence texts whose co-occurrences as a whole varied the most in time. Then, for each of those texts, we identify the specific co-occurrences that, while calculating the the likelihood for the whole text, made the highest contributions.

Technical aspects

From corpora to co-occurrence matrices

Given any corpus, we can construct a co-occurrence matrix of words in it. Once chosen the distance window to consider that words co-occur, those words within the window of a word are called its context.

For example, suppose our sub-corpus is the phrase:

the black spotted dog is scared of the black cat

For a symmetric window of length 1, part of this matrix would look like:

	dog	cat	spotted	black
dog	0	0	1	0
cat	0	0	0	1

And for length 3:

	dog	cat	spotted	black
dog	0	0	1	1
cat	0	0	0	1

And for length {math}\infinity{/math}:

	dog	cat	spotted	black
dog	0	1	1	2
cat	1	0	1	2

Words as vectors

Given matrices like these, we can calculate co-occurrence frequencies, reflecting some probability of finding a word in the context of another.

For example, for the word cat, normalizing the line of the matrix, we get the co-occurrence vector:

cat
dog	cat	spotted	black
0.25	0	0.25	0.5

Scores

To study a corpus, we might want to split it into a series of sub-corpora that carry a particular meaning, for example splitting in time into years.

Assuming some co-ocurrence window we care about, we can use the above frequencies to tell us how much the co-occurrence frequencies in one document are similar to those of one whole sub-corporus.

We achieve that by composing, for each word in the document, the sub-corpus frequencies as probabilities of encountering it surrounded by its neighbours. This yields a measure of the compatibility between the distribution of words in the sub-corpus and the particular sequence of words in that document. Let's call this measure the score of a document in a sub-corpus.

With that, we can then compare the scores of different texts in respect to one sub-corpus, and we can also compare the scores of the same text across different sub-corpora. For example, to ask whether or not a text scores best in its own sub-corpus.

Practical problems

However, a number of difficulties present themselves in these tasks:

1. co-occurrence matrices are too big and sparse

They occupy the square of the size of the vocabulary, and are full of zeroes since most combinations of words never appear together.

As a consequence, they are hard handle computationally.

2. co-occurrence matrices lack any regularization

Any combination of word and context that is not explicitly present in a corpus is not taken into account. Even if it would be sensible to attribute it some value, given some semantic similarity with other combinations.

For example, let's say we add to our previous example the sentence below.

A cat is a pet and a dog is a pet

Now, the co-occurrence matrix would associate 'pet' and 'cat', and 'pet' and 'dog'. But it still would not associate 'pet' with 'black', even though both 'cat' and 'dog' are associated with 'black'.

Two sentences, infinite window
	dog	cat	spotted	black	pet
dog	0	2	1	2	2
cat	2	0	1	2	2
pet	1	1	0	0	1

This way, our score would find a text saying "black pet" to be completely exotic, even though that information is semantically present in the corpus.

This problem goes away if one has a very large and representative corpus, but that is often not the case.

3. Comparing scores from different sub-corpora is not evident

Several factors can make the numerical values of the log-likelihoods hard to compare. For example, models may be trained with different parameters or amounts of data, causing some particular bias.

Context-ocurrence

Above we represent the relationship between a word and its context as count for that word and the words in the context. We can also represent the context as a single entity formed of various words. This preserves more of the structure of the text, but drastically increases the dimension of the matrix, since one additional column would be needed for each new combination of words.

Dimensional reduction and regularization

To solve issues 1 and 2, we resort to a Word Embedding algorithm called Word2Vec.

It works by reducing the dimension of the co-occurrence matrix in such a way that it is still useful to know probabilities, and the probabilities get regularized in a fashion consistent with semantics if one accepts the distributional hypothesis.

Distributional hypothesis

Semantically similar words have similar co-occurrence frequencies - called the co-occurrence vector of that word.

Word embedding

Consists of representing each word by a vector of a smaller dimension than it's co-occurrence vector. One intuition of why this can work: since the co-occurrence matrix is sparse, most co-occurrence vectors are full of zeroes, meaning their actual dimensionality is much lower. They can be compressed into a smaller number of dimension, keeping the most important ones and discarding less informative dimensions.

Word2vec (simplified description)

Learn a neural network to choose the low dimension vectors. Learn it by sampling pairs of words and adapting their vectors so that the vector's scalar product better approximates the co-occurrence frequencies of the two words. (More precisely, its logarithm.)

We call the resulting word vectors a model for the original co-occurrence matrix. And we can get the approximated co-occurrence probabilities from their scalar product. Also, because the algorithm may discard some infrequent words to get better results, we call the words used by the model its vocabulary.

Conveniently, the Word2vec implementation in the software Gensim - a Python module - contains a score function that calculates the equivalent of the score we had discussed.

Score issues: the score calculated by Gensim is the logarithm of the likelihood the text fits the model. Yet, it has a text length bias because it simply adds the log-likelihoods for each word and context pair (equivalent to multiplying the likelihoods). To get a length neutral score we divide it by the number of words in the scored text that belong to the model's vocabulary. That is, we use the average log-likelihood.

Word2vec (technical aspects)

We use the continuous bag of words model. It supposedly performs better for infrequent words, such as the words we might be interested when looking at the breakout of innovations. [needs reference]

We use the hierarchical-softmax schema, both because it is the one for which the score function is implemented, but also because it seems to be the one which better maximizes the probabilities we're interested in, at the expense of performance in the word-analogy tasks word2vec is famous for. [needs reference]

Balance corpora for over-fitting and over-ranking

Neural networks are susceptible to over-fitting. That is, finding a result that fits too specifically the examples used to train it. Regularization procedures are usually applied to mitigate this. In our case, in addition, the dimensionality reduction should help avoid over-fitting. But, given corpora that are not sufficiently large, one may still end up with a solution that is way more sensitive to the specific examples provided. This should create a bias where smaller corpora display stronger specificity.

To account for that, we train the model with equally sized training sets, drawn randomly from each sub-corpus larger than the smallest one.

Rank comparison

~~To solve problem 3, we resort to a mathematical trick to recover some comparison information from incompatible measures.~~

~~For each measure - i.e. model, i.e. sub-corpus - we rank all objects with it and substitute the measure by this rank order.~~

Furthermore, when ranking texts from different sized corpora, one introduces a filling bias. When a larger corpora tends to rank in one direction, it will push down other texts much further down the rank than a smaller corpora. To account for this we weight how much a text increases the rank, making it inversely proportional to he number of texts in its sub-corpora.

Thematic diversity and specificity, border effects

Another issue which may come up is diversity. For example, if our models come from increasingly more diverse corpora, the earlier models will be more specific not only to their own texts, but to fractions of the other sub-corpora which share their style, yielding higher scores.

Then, there may exist border effects: if the sub-corpora represent a window of an evolutionary process, there are fractions of each sub-corpora which might be reminiscent or predictive of times outside the window, and will thus tend to be scored higher at the extremities.

Word-level scores

Since scores are the compositions of probabilities of word-context pairs in a text, we can also look at the scores for individual word-context pairs. These can be used, for example, to understand what parts of a text contributed more significantly to its overall score.