anagallis: a program for parsimony analysis of character hierarchies

In parsimony analysis, the problem of inapplicables (see Maddison W.P. 1993. Syst. Biol. 42, 576-581) can be overcome by maximizing the amount of similarity that can be interpreted as homology, an idea that I first discussed in this 2002 talk.

Maximization of homology also provides the key to extend parsimony to the analysis of unaligned sequence data, as discussed in this 2004 talk and in this 2005 paper. For tree alignment programs such as POY, it is shown that cost regime 3221 (gap opening cost three, transition and transversion costs two, and gap extension cost one) provides an optimal approximation for the cost set that maximizes homology when all instances of homology are equally weighted. A discussion of differential weighting of homologies can be found in this 2015 paper (section on approximations and section on sensitivity analysis).

Inapplicables as they arise in the classic approach are a special case of inapplicables as they arise in sequence data. This special case can be tackled with algorithms that are computationally less complex. A discussion can be found in the above 2005 (pp. 110-111) and 2015 (section on inapplicables) papers.

Anagallis is a computer program that provides tree evaluation and tree searches with such algorithms. The first public version (v0.998 beta, Linux only) was released on 16 April 2018. Anagallis v1.01, the first version to include a high level description of the main algorithm, was released on 11 December 2018. It was also the first version for which a macOS executable was available.

The current version is anagallis v1.03 (8 September 2020). A dump of its built-in documentation can be found here. It includes a discussion of the theory behind the program, of the high level structure of its main optimization algorithm, and of the scope and limits of that algorithm. A gzipped tarfile that contains the macOS executable, a dynamically linked 64-bit Linux executable, and several files to get started with the program can be found here. Basic instructions, including how to get started once the tarfile is downloaded, are here.

In October 2017, Brazeau, Guillerme and Smith published this interesting paper on morphological analysis with inapplicable data on BioRXiv. The main difference with my approach seems to be that they independently optimize single-column characters with inapplicables rather than character hierarchies as a whole. This may give good results under a wide range of conditions, but in general the optimization of a character hierarchy on a tree cannot be reduced to a series of independent single-character optimizations on that tree. Doing so may yield a fast approximation for the score of a character hierarchy on a tree, but it can miss optimal state reconstructions, miss the optimal score, and ultimately identify non-optimal trees as optimal during tree search. The first of these three issues can be illustrated using the example of their Fig. 3. It is discussed in more detail here (also available as this osf preprint).