Last update: 22 March, 2026.
Launched: 27 December, 2011.
Note: the largest public bibliography of references on Zipf’s law for word frequencies is available here. Here we only offer a selection of references on Zipf’s law on animal behavior and organic chemistry.
Hint for browsing: Heaps’ law is another name for Herdan’s law.
2024
Li, Wentian; Almirantis, Yannis; Provata, Astero
Range-limited Heaps’ law for functional DNA words in the human genome Journal Article
In: Journal of Theoretical Biology, vol. 592, pp. 111878, 2024.
Abstract | Links | BibTeX | Tags: Herdan's law
@article{Li2024a,
title = {Range-limited Heaps’ law for functional DNA words in the human genome},
author = {Wentian Li and Yannis Almirantis and Astero Provata},
url = {https://www.sciencedirect.com/science/article/pii/S0022519324001620},
doi = {10.1016/j.jtbi.2024.111878},
year = {2024},
date = {2024-01-01},
journal = {Journal of Theoretical Biology},
volume = {592},
pages = {111878},
abstract = {Heaps’ or Herdan-Heaps’ law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps’ law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf’s law is well known, their translation to the Heaps’ law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps’ law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps’ law was observed, but a quadratic regression in log\textendashlog type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.},
keywords = {Herdan\'s law},
pubstate = {published},
tppubtype = {article}
}
2021
Caetano-Anollés, Gustavo
The compressed vocabulary of microbial life Journal Article
In: Frontiers in Microbiology, vol. 12, pp. 1273, 2021, ISSN: 1664-302X.
Abstract | Links | BibTeX | Tags: Herdan's law, Menzerath-Altmann law, Zipf's law for word frequencies, Zipf's law of abbreviation
@article{10.3389/fmicb.2021.655990,
title = {The compressed vocabulary of microbial life},
author = {Gustavo Caetano-Anoll\'{e}s},
url = {https://www.frontiersin.org/article/10.3389/fmicb.2021.655990},
doi = {10.3389/fmicb.2021.655990},
issn = {1664-302X},
year = {2021},
date = {2021-01-01},
journal = {Frontiers in Microbiology},
volume = {12},
pages = {1273},
abstract = {Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf’s law, a special case of the scale-free distribution, the Heaps’ law describing sublinear growth typical of economies of scales, and the Menzerath\textendashAltmann’s law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a “triangle of persistence” describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A “causal” word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.},
keywords = {Herdan\'s law, Menzerath-Altmann law, Zipf\'s law for word frequencies, Zipf\'s law of abbreviation},
pubstate = {published},
tppubtype = {article}
}
2017
Nasir, Arshan; Kim, Kyung Mo; Caetano-Anollés, Gustavo
In: Frontiers in Microbiology, vol. 8, pp. 1178, 2017, ISSN: 1664-302X.
Abstract | Links | BibTeX | Tags: Herdan's law
@article{10.3389/fmicb.2017.01178,
title = {Phylogenetic Tracings of Proteome Size Support the Gradual Accretion of Protein Structural Domains and the Early Origin of Viruses from Primordial Cells},
author = {Arshan Nasir and Kyung Mo Kim and Gustavo Caetano-Anoll\'{e}s},
url = {https://www.frontiersin.org/article/10.3389/fmicb.2017.01178},
doi = {10.3389/fmicb.2017.01178},
issn = {1664-302X},
year = {2017},
date = {2017-01-01},
journal = {Frontiers in Microbiology},
volume = {8},
pages = {1178},
abstract = {Untangling the origin and evolution of viruses remains a challenging proposition. We recently studied the global distribution of protein domain structures in thousands of completely sequenced viral and cellular proteomes with comparative genomics, phylogenomics, and multidimensional scaling methods. A tree of life describing the evolution of proteomes revealed viruses emerging from the base of the tree as a fourth supergroup of life. A tree of domains indicated an early origin of modern viral lineages from ancient cells that co-existed with the cellular ancestors. However, it was recently argued that the rooting of our trees and the basal placement of viruses was artifactually induced by small genome (proteome) size. Here we show that these claims arise from misunderstanding and misinterpretations of cladistic methodology. Trees are reconstructed unrooted, and thus, their topologies cannot be distorted a posteriori by the rooting methodology. Tracing proteome size in trees and multidimensional views of evolutionary relationships as well as tests of leaf stability and exclusion/inclusion of taxa demonstrated that the smallest proteomes were neither attracted toward the root nor caused any topological distortions of the trees. Simulations confirmed that taxa clustering patterns were independent of proteome size and were determined by the presence of known evolutionary relatives in data matrices, highlighting the need for broader taxon sampling in phylogeny reconstruction. Instead, phylogenetic tracings of proteome size revealed a slowdown in innovation of the structural domain vocabulary and four regimes of allometric scaling that reflected a Heaps law. These regimes explained increasing economies of scale in the evolutionary growth and accretion of kernel proteome repertoires of viruses and cellular organisms that resemble growth of human languages with limited vocabulary sizes. Results reconcile dynamic and static views of domain frequency distributions that are consistent with the axiom of spatiotemporal continuity that is tenet of evolutionary thinking.},
keywords = {Herdan\'s law},
pubstate = {published},
tppubtype = {article}
}