Talk:Preprint/Molecular source attribution

Review by Bethany Dearlove edit

This is a comprehensive article, giving a useful introduction to source attribution using genetic sequence data. I was pleased to see that the article covers a number of challenges and limitations with source attribution methods, and I thank the authors for contributing this.

Major comments

The focus of the article is clearly on human-to-human transmission, though I understand source attribution to be a little wider than this, including water- or food-borne outbreaks and zoonoses (the Campylobacter example used later on is a nice example of this, attributing human infections to various host species – human-to-human infection of Campylobacter being rare). Therefore, it could be seen as “who or what infected whom”. It may also be worth adding a couple of (general) examples to, give more context to what a “source population, individual or location” might be. Why is it interesting/important to infer the source? How might this information be used?

There is a lot of technical detail in this page, which I think is understandable given the cross-disciplinary nature of the topic, but in places it would help to add signposting for the reader as to how this relates back to source attribution. For example, before going too far into Microbial subtyping, it would be helpful to lay out for the reader that for source attribution, you need two things: some information about the sample of interest (e.g. subtype), and the same information from the possible sources of infection for comparison. This would then naturally lead into 1) types of data, and 2) ways to compare them. The Phylogenetic methods section could also lean more heavily on to the pages for ‘Phylogenetic tree’ and ‘Computational Phylogenetics’ (the latter of which gives much more detail on how to reconstruct phylogenies and node support methods, https://en.wikipedia.org/wiki/Computational_phylogenetics), to help keep more of the focus on source attribution.

What about non-genetic methods of source attribution, e.g. case-control studies for food-borne outbreaks? This might help lay out the fundamentals for a general audience, both in grouping cases together (what food did all the cases from an event all eat?), and methods (e.g. Bayes theorem using the food example).

Does ‘Whole genome sequencing’ belong under ‘Single and multi-locus typing’? I think it may be better in its own section, as there seems to be a disconnect here with what follows about genetic clustering. The advantage of WGS is the increased resolution, which I think is worth stating explicitly (again, signposting back to why is this relevant for source attribution).

I wonder whether much of the ‘Genetic Clustering’ section might be better placed with the ‘Phylogenetic methods’ section, given it features phylogenetic trees before they have been fully introduced. This might help better delineate the terms ‘cluster’ and ‘subtype’, which I think come across as a little muddled.

‘It is generally easier to reconstruct a phylogenetic tree from genetic sequences than to reconstruct the transmission tree from other sources of information, such as contact tracing.’ – I’m not sure ‘easier’ is right here, surely this depends on the data/expertise you have available?

‘If there is little to no evolution has occurred among these infections (in other words, if they are genetically almost identical) then it is difficult to reconstruct the order that different lineages descended from their common ancestors.’ Should this be observed evolution, given that the example that follows suggests that there was evolution when using the increased resolution of genomes versus MLST?

‘Conversely, if there has been too much evolution because we are working with a transmission tree on an extremely long time scale, then the genetic similarity that implies common ancestry will have eroded.’ I think this is a little unclear. Is it trying to say that if the rate of transmission is sufficiently slow relative to the evolutionary rate, too much mutation may have obscured the common ancestry?

It might be worth mentioning the effects of recombination on source attribution in the section on secondary infection – both how it might affect tree inference, and what it means for the transmission chains.

The discrete state migration models can be applied to the population-level attribution to host-species for zoonotic/food-borne pathogens, as well as individuals, which might be worth adding if you do expand the definition to include those.

The ‘Phylodynamic methods’ needs a more focused introduction to how it helps with source attribution (I’d suggest it’s more commonly used for population-level transmission dynamics/epidemiological processes). This section also re-introduces the coalescent, which makes the distinction between phylogenetics and phylodynamics less clear.

‘Continuous-state models could be employed for source attribution at the level of geographic regions, especially if precise geolocation data were available; however, we have not yet found such an application of ancestral reconstruction for source attribution in the literature.’ – Dellicour et al (2018; Nature Communications https://www.nature.com/articles/s41467-018-03763-2) fitted continuous diffusion models for phylogeography, though admittedly they didn’t use it directly for source attribution, but rather for evaluating potential barriers to transmission.

Some of comments relating to Phylogenetic uncertainty are not just limited to phylodynamic approaches (alignment uncertainty; sequencing error). Might these be better included under a general ‘Limitations of genetic data’ heading? (Recombination might also be a good fit here, as might accurately rooting the tree).

Minor comments

• Inconsistency in quotes for the coin toss of ‘heads’ or “heads”. • Molecular epidemiology and Phylodynamics can have a wikilink • Under ‘Transmission clustering’, there’s a broken link for ‘Forensic applications of phylogenetic clustering’. • Under’ ‘Bayesian inference’, ‘paramters’ should be ‘parameters’ • Under Time scales, ‘If there is little to no evolution has occurred among these infections’ should be ‘If little to no evolution has occurred among these infections’. • Under Time scale, ‘For example, this limitation is the driving force behind the growing adoption of whole-genome sequencing for the molecular epidemiology and source attribution[6] of the bacterium tuberculosis, which makes it possible to distinguish between infections that would otherwise be assigned to the same subtype according to the standard multi-locus genotyping method[7] that targets only 24 loci of the M. tuberculosis genome, which comprises roughly 4.3 million nucleotides encoding over 4,000 genes.’ – this might be clearer rephrased or split into two sentences. • Under Phylogenetic uncertainty, ‘greatly expanded model space also makes convergence more challenging attain’ should be ‘greatly expanded model space also makes convergence more challenging _to_ attain’


Review by Matthew Hall edit

The page is a quite thorough and comprehensive introduction to this subject, covering the main bases well. I appreciate that the ethical considerations are taken into account and the command of the literature is generally good. I do feel that it could use some reorganisation.

I have four major recommendations. Firstly, be clear exactly what you are talking about! Even the first sentence is a little clumsy - the words “from a source population, individual or location, to subsequent hosts” do not give the impression that the source entities may be of the same nature as the second (i.e. “hosts”). I am not sure what this page intends to define as falling within the category “source attribution”. Is e.g. phylogeography source attribution? Is the identification of reservoir species for zoonses? If so, then there is plenty of relevant material which is not covered here.

Secondly, although the stated focus of the page is source attribution in general, it does tend to focus mostly on genetic methods. This is not surprising given the state of the art, but it fails to flow well. The introduction is on the general topic, but the page then introduces microbial subtyping before returning to non-genetic methods in the next section (‘"Dutch"/Hald models and Bayesian inference’). I would recommend switching the order of these, and making the genetics section as self-contained as possible.

Thirdly, if the authors wish to introduce concepts such as Bayesian inference and coalescent theory to a general audience, I would recommend relying much less on dense text blocks. A picture tells a thousand words, and for example, Bayes’ theorem is not even stated. The coalescent is also effectively introduced twice. There is a tendency to leap from very basic explanations to jargon that would be unfamiliar to lay readers, for example going from the explaining the concept of a prior distribution to talking about uniform priors without really defining them.

Finally, I would recommend a full section describing the differences between viruses and bacteria (and potentially other types of pathogen) that are relevant to this topic.

Overall, though, I like this page, and think it forms the basis for a very sound introduction to the topic.

More specific comments:

“Encountering too many types on the population, however, makes it likely that every individual carries a unique subtype.” I understand what is meant here but it needs rewording.

Subtyping assigns samples to subtypes, not types.

It’s not clear whether the authors regard a distinct genome as a “subtype” and whether subtyping and WGS are being treated as fundamentally different procedures or not. Under “whole-genome sequencing” it rather suggests that they are not, but under “defining subtypes” the impression is the opposite. It would be a little odd to treat a unique full genome sequence as a subtype.

Related, I find the suggestion that the primary reason that WGS clustering algorithms are used is that there is too much resolution in the sequences to be rather clumsy. More resolution really cannot be a bad thing and it would anyway be trivially easy to decrease it were that true. A cluster is not a subtype. (E.g. the sampling of new isolates cannot ever join two subtypes up.) This leads the text to implicitly suggest that the HA/NA influenza classification is somehow a procedure of a similar nature to clustering WGS influenza sequences by similarity. Clustering is rather an answer to the question of “what can we do to link samples together, given that we have high resolution?”

Phylogenetics is introduced suddenly in the “defining subtypes” section without definition, and is then introduced later.

I do not think that what is meant by “mass” in the Hald section would be clear to the layman.

I also do not think the hiking metaphor is very good, since it suggests that the goal of the hike is to reach a relative plateau where the hikers, presumably, want to keep walking forever. This is not intuitive. Nor is the idea that “the amount of time the hikers spend in a particular area will be proportional to the overall elevation” intuitive. I am not sure this page is the right place for a basic introduction to MCMC in any case.

The first paragraph in “Inferring transmission history from the phylogeny” contradicts itself, by first stating that internal nodes are treated as transmissions, and then saying that it is a common mistake to do this. (It is not really a mistake, but an assumption.)

The paragraph beginning “The amount of time we have to follow two or more lineages back in time until we encounter a common ancestor” lacks clarity. As above, if this is to be an introduction for the layman, it needs more work.

The more sophisticated versions of the CTMC discrete-state models commonly used in phylogeography do not work by ancestral state reconstruction, but rather by estimation of rates of lineage movement. See e.g. Lemey et al 2009.

On the other hand, Phyloscanner does not use a CTMC model of character evolution – it is parsimony on a fixed tree.

Finally on that subject, and as alluded to before, it is not really clear whether this page regards phylogeography as a source attribution method or not.

“A group of infections is paraphyletic if they are related by a common ancestor that also has one or more members that are not assigned to this group.” Reword this – it suggests that common ancestors have members.

Modern bootstrapping uses approximations that make it less time-intensive than simply re-running the analysis many times.

The phylodynamics section also lacks clarity. Ultimately the most commonly-used phylodynamic methods of both types (backwards-time and forwards-time) are not source attribution methods and are concerned with reconstructing overall dynamics. You cannot in general use a skyline plot or standard birth-death methods to do source attribution. I do not feel this really comes across at all.

The COVID case counts from March should be updated.

Response to reviews edit

On behalf of my co-authors, I would like to thank the reviewers for their thorough and insightful evaluation of our article, and the editors for granting us the opportunity to address the issues raised by the reviewers with a revised version. Here, we provide a point-by-point response to the reviews, using blockquote elements to itemize each comment. Also, a summary of differences between the current version and the original submission can be reviewed at this history page.

Response to review by Bethany Dearlove edit

This is a comprehensive article, giving a useful introduction to source attribution using genetic sequence data. I was pleased to see that the article covers a number of challenges and limitations with source attribution methods, and I thank the authors for contributing this.

Thank you for this encouraging feedback!

Major comments

The focus of the article is clearly on human-to-human transmission, though I understand source attribution to be a little wider than this, including water- or food-borne outbreaks and zoonoses (the Campylobacter example used later on is a nice example of this, attributing human infections to various host species – human-to-human infection of Campylobacter being rare). Therefore, it could be seen as “who or what infected whom”. It may also be worth adding a couple of (general) examples to, give more context to what a “source population, individual or location” might be. Why is it interesting/important to infer the source? How might this information be used?

Our objective was to provide a publicly accessible overview of source attribution, particularly for communities who carry a greater burden of stigmatization or criminal prosecution for the transmission of infectious disease. Thus, we have tried to describe the technical challenges of accurate attribution, as well as its ethical issues and controversies. Nevertheless, we have taken the reviewer's suggestion by clarifying the different ways that source attribution can be applied to populations, individuals or locations; and added some examples for each scenario. These changes were mostly applied to the introductory section.

There is a lot of technical detail in this page, which I think is understandable given the cross-disciplinary nature of the topic, but in places it would help to add signposting for the reader as to how this relates back to source attribution. For example, before going too far into Microbial subtyping, it would be helpful to lay out for the reader that for source attribution, you need two things: some information about the sample of interest (e.g. subtype), and the same information from the possible sources of infection for comparison. This would then naturally lead into 1) types of data, and 2) ways to compare them.

We have expanded the introductory section to more clearly explain the role of subtyping for source attribution. In addition, we have substantially reduced the technical detail in parts of the article (e.g., Bayesian inference) based on feedback from both reviewers.

The Phylogenetic methods section could also lean more heavily on to the pages for ‘Phylogenetic tree’ and ‘Computational Phylogenetics’ (the latter of which gives much more detail on how to reconstruct phylogenies and node support methods, https://en.wikipedia.org/wiki/Computational_phylogenetics), to help keep more of the focus on source attribution.

We have followed the reviewer's advice and removed much of this section to focus more on source attribution.

What about non-genetic methods of source attribution, e.g. case-control studies for food-borne outbreaks? This might help lay out the fundamentals for a general audience, both in grouping cases together (what food did all the cases from an event all eat?), and methods (e.g. Bayes theorem using the food example).

Thank you for pointing this out. Although these are important topics, incorporating non-genetic methods of source attribution would greatly expand the scope of this article. Hence, we decided to modify the title of the article to "Molecular source attribution", and added a paragraph to the introduction section to clarify that we are talking about only a subset of source attribution methods using molecular data.

Does ‘Whole genome sequencing’ belong under ‘Single and multi-locus typing’? I think it may be better in its own section, as there seems to be a disconnect here with what follows about genetic clustering. The advantage of WGS is the increased resolution, which I think is worth stating explicitly (again, signposting back to why is this relevant for source attribution).

We agree that "Whole genome sequencing" should be subsumed under "Single and multi-locus typing", and relocated this section accordingly.

I wonder whether much of the ‘Genetic Clustering’ section might be better placed with the ‘Phylogenetic methods’ section, given it features phylogenetic trees before they have been fully introduced. This might help better delineate the terms ‘cluster’ and ‘subtype’, which I think come across as a little muddled.

The problem is that subtypes are often defined from genetic clusters - for example, consider the hepatitis C virus genotypes and subtypes, and the HIV-1 subtypes. Thus, we maintain that subtypes are always clusters (just not necessarily derived from genetic information, e.g., serotypes), but clusters are not necessarily subtypes. We have tried to clarify the role of genetic clustering in microbial subtyping by adding a figure and explanatory text.

‘It is generally easier to reconstruct a phylogenetic tree from genetic sequences than to reconstruct the transmission tree from other sources of information, such as contact tracing.’ – I’m not sure ‘easier’ is right here, surely this depends on the data/expertise you have available?

Fair point. We have revised this line as follows: "In conjunction with reconstructing the transmission tree from other sources of information, such as contact tracing, reconstructing a phylogenetic tree can serve as a useful, additional information source especially when genetic sequences are already available."

‘If there is little to no evolution has occurred among these infections (in other words, if they are genetically almost identical) then it is difficult to reconstruct the order that different lineages descended from their common ancestors.’ Should this be observed evolution, given that the example that follows suggests that there was evolution when using the increased resolution of genomes versus MLST?

Thank you for pointing this out. We have revised this sentence as follows: "If little to no evolution has occurred among these infections (i.e., they are almost genetically identical) or if the existing divergence is not captured due to incomplete sequencing of the respective genomes, then it is difficult to reconstruct the order that different lineages descended from their common ancestors. For example, this limitation is the driving force behind the growing adoption of whole-genome sequencing for the molecular epidemiology and source attribution[6] of the bacterium tuberculosis."

‘Conversely, if there has been too much evolution because we are working with a transmission tree on an extremely long time scale, then the genetic similarity that implies common ancestry will have eroded.’ I think this is a little unclear. Is it trying to say that if the rate of transmission is sufficiently slow relative to the evolutionary rate, too much mutation may have obscured the common ancestry?

We apologize for the confusing language. We were trying to point out that internal branches become more difficult to resolve when they are deeper in the tree. During our revisions, we realized that this point relates to phylogenetic uncertainty in general, so we removed this paragraph altogether and integrated the material in a more concise way in the preceding section on uncertainty.

It might be worth mentioning the effects of recombination on source attribution in the section on secondary infection – both how it might affect tree inference, and what it means for the transmission chains.

This is a good point. We have added the following paragraph to the section on Phylogenetic uncertainty: "Genetic recombination is the exchange of genetic material between individual genomes. For pathogens, recombination can occur when a cell is infected by multiple copies of the pathogen. If some hosts were infected multiple times by two or more divergent variants from different sources (i.e., superinfection), then recombination can produce mosaic genomes that complicate the reconstruction of an accurate phylogeny[21]. In other words, different segments of a recombinant genome may be related to other genomes through discordant phylogenies in such a way that cannot be accurately represented by a single tree. In practice, it is common to screen for recombinant sequences and discard them before reconstructing a phylogeny from an alignment that is assumed to be free of recombination [22]."

The discrete state migration models can be applied to the population-level attribution to host-species for zoonotic/food-borne pathogens, as well as individuals, which might be worth adding if you do expand the definition to include those.

Thanks for pointing this out - we have incorporated these points into the section on Ancestral host-state reconstruction.

The ‘Phylodynamic methods’ needs a more focused introduction to how it helps with source attribution (I’d suggest it’s more commonly used for population-level transmission dynamics/epidemiological processes). This section also re-introduces the coalescent, which makes the distinction between phylogenetics and phylodynamics less clear.

We apologize - the coalescent indeed appeared in multiple parts of the original text. We have greatly reduced the redundant text, and clarified between the two different applications of the coalescent in phylodynamics (namely to model common ancestry within hosts, and transmission among hosts).

‘Continuous-state models could be employed for source attribution at the level of geographic regions, especially if precise geolocation data were available; however, we have not yet found such an application of ancestral reconstruction for source attribution in the literature.’ – Dellicour et al (2018; Nature Communications https://www.nature.com/articles/s41467-018-03763-2) fitted continuous diffusion models for phylogeography, though admittedly they didn’t use it directly for source attribution, but rather for evaluating potential barriers to transmission.

Thank you for pointing out the Dellicour paper - we have amended this statement accordingly.

Some of comments relating to Phylogenetic uncertainty are not just limited to phylodynamic approaches (alignment uncertainty; sequencing error). Might these be better included under a general ‘Limitations of genetic data’ heading? (Recombination might also be a good fit here, as might accurately rooting the tree).

As suggested, we relocated these topics to the section on phylogenetic uncertainty, and added a paragraph on recombination.

Minor comments • Inconsistency in quotes for the coin toss of ‘heads’ or “heads”.

Fixed, thank you.

• Molecular epidemiology and Phylodynamics can have a wikilink

Fixed, thank you.

• Under ‘Transmission clustering’, there’s a broken link for ‘Forensic applications of phylogenetic clustering’.

This link now functions properly.

• Under’ ‘Bayesian inference’, ‘paramters’ should be ‘parameters’

Typo has been corrected, thank you.

• Under Time scales, ‘If there is little to no evolution has occurred among these infections’ should be ‘If little to no evolution has occurred among these infections’.

Fixed, thank you.

• Under Time scale, ‘For example, this limitation is the driving force behind the growing adoption of whole-genome sequencing for the molecular epidemiology and source attribution[6] of the bacterium tuberculosis, which makes it possible to distinguish between infections that would otherwise be assigned to the same subtype according to the standard multi-locus genotyping method[7] that targets only 24 loci of the M. tuberculosis genome, which comprises roughly 4.3 million nucleotides encoding over 4,000 genes.’ – this might be clearer rephrased or split into two sentences.

As suggested, we split this text into two sentences. Actually we moved this text and incorporated it into the section on whole-genome sequencing, since the original text was somewhat redundant.

• Under Phylogenetic uncertainty, ‘greatly expanded model space also makes convergence more challenging attain’ should be ‘greatly expanded model space also makes convergence more challenging _to_ attain’

Fixed, thanks.

Response to review by Matthew Hall edit

The page is a quite thorough and comprehensive introduction to this subject, covering the main bases well. I appreciate that the ethical considerations are taken into account and the command of the literature is generally good. I do feel that it could use some reorganisation.

Thank you for the diligent feedback. As suggested, we have extensively reorganized the article.

I have four major recommendations. Firstly, be clear exactly what you are talking about! Even the first sentence is a little clumsy - the words “from a source population, individual or location, to subsequent hosts” do not give the impression that the source entities may be of the same nature as the second (i.e. “hosts”). I am not sure what this page intends to define as falling within the category “source attribution”. Is e.g. phylogeography source attribution? Is the identification of reservoir species for zoonses? If so, then there is plenty of relevant material which is not covered here.

We have revised the opening paragraph. In particular, we have clarified that applications of phylogeography for source attribution do exist, and have cited some examples.

Secondly, although the stated focus of the page is source attribution in general, it does tend to focus mostly on genetic methods. This is not surprising given the state of the art, but it fails to flow well. The introduction is on the general topic, but the page then introduces microbial subtyping before returning to non-genetic methods in the next section (‘"Dutch"/Hald models and Bayesian inference’). I would recommend switching the order of these, and making the genetics section as self-contained as possible.

This is a good point. We have renamed the article to "Molecular source attribution" and added a paragraph to the introduction section explicitly recognizing that our focus is on source attribution using molecular (genetic) data. We also acknowledge that Hald models can be applied to non-genetic subtypes (e.g., resistance phenotypes).

Thirdly, if the authors wish to introduce concepts such as Bayesian inference and coalescent theory to a general audience, I would recommend relying much less on dense text blocks. A picture tells a thousand words, and for example, Bayes’ theorem is not even stated. The coalescent is also effectively introduced twice. There is a tendency to leap from very basic explanations to jargon that would be unfamiliar to lay readers, for example going from the explaining the concept of a prior distribution to talking about uniform priors without really defining them.

We decided to greatly reduce the amount of text describing Bayesian inference and coalescent theory, by focusing on the basic explanations and relying on references to the literature and Wikipedia for more detailed background. Also, we removed the redundant text on the coalescent, and provide an explanation on uniform priors in the context of the Hald model.

Finally, I would recommend a full section describing the differences between viruses and bacteria (and potentially other types of pathogen) that are relevant to this topic.

As requested in another comment, we elaborated on the differences between viruses and bacteria in the context of whole genome sequencing (also see "Time scales"). Because the article is already quite long, covering several topics, we are reluctant to add a full section to further elaborate on the distinction between viruses and bacteria.

Overall, though, I like this page, and think it forms the basis for a very sound introduction to the topic.

Thank you for the encouraging feedback!

More specific comments: “Encountering too many types on the population, however, makes it likely that every individual carries a unique subtype.” I understand what is meant here but it needs rewording.

We revised this statement and added a figure illustrating the issues that arise when every sampled pathogen carries as unique sequence.

Subtyping assigns samples to subtypes, not types.

We apologize - we have replaced all instances of "type" with "subtype".

It’s not clear whether the authors regard a distinct genome as a “subtype” and whether subtyping and WGS are being treated as fundamentally different procedures or not. Under “whole-genome sequencing” it rather suggests that they are not, but under “defining subtypes” the impression is the opposite. It would be a little odd to treat a unique full genome sequence as a subtype.

We have added a figure (Figure 1) to try to clarify the issue of whether a distinct genotype (single or multi-locus) is handled as a unique subtype, or if clustering is used to group similar genomes together. Although it would indeed be unusual in many cases to "treat a unique full genome sequence as a subtype", some studies do exactly that - for example, TM Walker et al. (2013, Lancet Inf Dis 13: 137) present a network in which each node represents one or more infections with exactly the same M. tuberculosis genome sequence. We have tried to clarify this in our revision:

"Whole genome sequencing (WGS) can confer a significant advantage for source attribution over single- or multiple-locus subtyping. Sequencing the entire genome is the maximal extent of multi-locus typing, in that all possible loci are covered. Having whole genome sequences will tend to make one-to-one subtyping (Figure 1) less useful, since most genomes will be unique by at least mutation for rapidly evolving pathogens. Consequently, applications of WGS for source attribution at a population level will likely have to cluster similar genomes together."

Related, I find the suggestion that the primary reason that WGS clustering algorithms are used is that there is too much resolution in the sequences to be rather clumsy. More resolution really cannot be a bad thing and it would anyway be trivially easy to decrease it were that true. A cluster is not a subtype. (E.g. the sampling of new isolates cannot ever join two subtypes up.) This leads the text to implicitly suggest that the HA/NA influenza classification is somehow a procedure of a similar nature to clustering WGS influenza sequences by similarity. Clustering is rather an answer to the question of “what can we do to link samples together, given that we have high resolution?”

We apologize for the misunderstanding. Our intent was to make the case that WGS makes frequency-based source attribution more complicated if we interpret every unique genome as a "subtype" (what we refer to as one-to-one subtyping). It would mean that there are very many small frequencies to track. Hence, clustering WGS data becomes necessary.

We think the underlying problem is that the term "subtype" is being used at different scales. The Hald model on which much of our discussion of frequency-based source attribution is based refers to subtypes, which may be defined from microbial phenotypes or genotypes. These subtype definitions can be quite fine-grained. For example, the multi-locus subtype (MLST) database for Campylobacter jejuni/coli contains over 1.3 million alleles for seven loci, where a subtype consists of a specific combination of alleles. In contrast, an influenza A virus HA/NA-based subtype encapsulates many sequence variants.

We have extensively rewritten this part of the article, including a figure with which we try to communicate the idea that clustering is used to link samples together given high resolution.

Phylogenetics is introduced suddenly in the “defining subtypes” section without definition, and is then introduced later.

We agree that having phylogenetics appear in the "defining subtypes" section was confusing. We have removed these paragraphs from this section and incorporated some of the material into the later section on Phylogenetic methods (now renamed "Comparative methods").

I do not think that what is meant by “mass” in the Hald section would be clear to the layman.

We revised this sentence to read: "the observed total amount (mass) of the j-th food source", where "mass" links out to the Wikipedia article on mass as a property of a physical object.

I also do not think the hiking metaphor is very good, since it suggests that the goal of the hike is to reach a relative plateau where the hikers, presumably, want to keep walking forever. This is not intuitive. Nor is the idea that “the amount of time the hikers spend in a particular area will be proportional to the overall elevation” intuitive. I am not sure this page is the right place for a basic introduction to MCMC in any case.

We have deleted the paragraph with the hiking metaphor, and indeed much of the material on Bayesian inference. We agree that this page is not the right place to introduce readers to MCMC. Instead we link out to the existing Wikipedia page.

The first paragraph in “Inferring transmission history from the phylogeny” contradicts itself, by first stating that internal nodes are treated as transmissions, and then saying that it is a common mistake to do this. (It is not really a mistake, but an assumption.)

Good point. We have revised this paragraph to make it clear that this is a common assumption.

The paragraph beginning “The amount of time we have to follow two or more lineages back in time until we encounter a common ancestor” lacks clarity. As above, if this is to be an introduction for the layman, it needs more work.

We have revised this statement for clarity, thank you.

The more sophisticated versions of the CTMC discrete-state models commonly used in phylogeography do not work by ancestral state reconstruction, but rather by estimation of rates of lineage movement. See e.g. Lemey et al 2009.

This is a good point. Since rates of lineage movement are a more global characterization of transmission, it is not really a method of source attribution. Therefore, we have revised this section to focus on other articles in the phylogeography literature that specifically employed ancestral reconstruction with CTMC discrete-state models.

On the other hand, Phyloscanner does not use a CTMC model of character evolution – it is parsimony on a fixed tree.

We apologize for this error. This is fixed in the revised article.

Finally on that subject, and as alluded to before, it is not really clear whether this page regards phylogeography as a source attribution method or not.

We apologize. We had intended to present phylogeography as a form of source attribution, where different geographic locations are potential sources instead of individuals. To clarify this, we have revised the introductory paragraph as well as the section on ancestral reconstruction.

“A group of infections is paraphyletic if they are related by a common ancestor that also has one or more members that are not assigned to this group.” Reword this – it suggests that common ancestors have members.

Thank you, we have reworded this statement: "A group of infections is paraphyletic if the group includes the most recent common ancestor, but does not include all its descendants."

Modern bootstrapping uses approximations that make it less time-intensive than simply re-running the analysis many times.

Thank you, we have revised the text as follows:

"Non-parametric bootstrapping is a time-consuming process that scales linearly with the number of replicates, since every bootstrap sample is processed by the same method as the original tree, and post-processing steps are required to enumerate clades. The precision of estimating the node support values increases with the number of bootstrap replicates. For instance, it is not possible to obtain a node support of 99% if fewer than 100 bootstrap samples have been processed. Consequently, it is now more common to use faster approximate methods to estimate the support values associated with different nodes of the tree (for instance, see approximate likelihood-ratio testing below)."

The phylodynamics section also lacks clarity. Ultimately the most commonly-used phylodynamic methods of both types (backwards-time and forwards-time) are not source attribution methods and are concerned with reconstructing overall dynamics. You cannot in general use a skyline plot or standard birth-death methods to do source attribution. I do not feel this really comes across at all.

We apologize for the lack of clarity in this section. It was not our intent to characterize all phylodynamic methods as being a form of source attribution, which is certainly incorrect! We have revised this section to make it clearer to the reader that only a subset of phylodynamic models are used for this purpose.

The COVID case counts from March should be updated.

Good point. We updated the case counts to global estimates as of July 8, 2021.

Return to "Preprint/Molecular source attribution" page.