Sunday, April 12, 2009

Numerical Analysis of Profanity in the PubMed Database

Here at UotH headquarters, we are primarily concerned with bringing about the Hydrocalypse and self-aggrandizement. In order to continue receiving our government welfare stipend checks, however, we must occasionally do research. As Systems Biologists, this means we spend as much time on the computer as possible trying to avoid doing actual experiments. This often means looking at data that other people collected and pulling conclusions out of a hat. As an example, we present an analysis of profanity in the PubMed database.

PubMed is a massive database curated by the NIH that contains most articles written this century on biology and medicine. It is also full of swears:

This is a good first pass at the data, but we know that power law distributions are all the rage these days, so we need to make the data look more power-law-y. Randomly adding parameters to our model revealed a missing parameter that completes our graph:

Much better. A thoughtful scientist would want to delve deeper into the data to discover the mechanisms through which swears end up on PubMed. Running a hidden-Markov chain model on the data classified the causes of bad words on PubMed into the following categories:

1. Snarky scientists
"Poop" is a prime offender for this one, as evidenced by the following titles of actual papers on PubMed:

Using cutesy terms in articles about dysfunctionally defecating babies is something we approve.

2. Poor abbreviation choices
The study of the metabolism of fucose has lead to awesomest group of genes in the history of biology: the "fuc" genes. As is tradition in the identification of E. coli genes, each gene in a pathway gets a four letter name, leading to fucA, fucB, fucC...all the way to the #1 gene of all time: fucK, which as everyone knows encodes the enzyme L-fuculokinase. As expected, this leads to some great papers, like this one:

A mutant crp allele that differentially activates the operons of the fuc regulon in Escherichia coli.

Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, Massachusetts 02115.

L-Fucose is used by Escherichia coli through an inducible pathway mediated by a fucP-encoded permease, a fucI-encoded isomerase, a fucK-encoded kinase, and a fucA-encoded aldolase. The adolase catalyzes the formation of dihydroxyacetone phosphate and L-lactaldehyde. Anaerobically, lactaldehyde is converted by a fucO-encoded oxidoreductase to L-1,2-propanediol, which is excreted. The fuc genes belong to a regulon comprising four linked operons: fucO, fucA, fucPIK, and fucR. The positive regulator encoded by fucR responds to fuculose 1-phosphate as the effector. Mutants serially selected for aerobic growth on propanediol became constitutive in fucO and fucA [fucO(Con) fucA(Con)], but noninducible in fucPIK [fucPIK(Non)]. An external suppressor mutation that restored growth on fucose caused constitutive expression of fucPIK. Results from this study indicate that this suppressor mutation occurred in crp, which encodes the cyclic AMP-binding (or receptor) protein. When the suppressor allele (crp-201) was transduced into wild-type strains, the recipient became fucose negative and fucose sensitive (with glycerol as the carbon and energy source) because of impaired expression of fucA. The fucPIK operon became hyperinducible. The growth rate on maltose was significantly reduced, but growth on L-rhamnose, D-galactose, L-arabinose, glycerol, or glycerol 3-phosphate was close to normal. Lysogenization of fuc+ crp-201 cells by a lambda bacteriophage bearing crp+ restored normal growth ability on fucose. In contrast, lysogenization of [fucO(Con)fucA(Con)fucPIK(Non)crp-201] cells by the same phage retarded their growth on fucose.

This paper received extra points for remarking that growth on fucose is "retarded."

3. Unfortunate language issues
Many scientific papers are written in english by scientists who don't primarily speak english. As ugly Americans, we feel free to mock these papers. It appears that "bitch" is a favorite in this category, as female dog research seems to be popular overseas. Our old friend "poop" again makes an appearance in this category, along with his PG-13 friend, "shit":

Of course, defunctionalized poop is no laughing matter, but farts are always funny:

Automatic analysis of signals with symbolic content.

Department of Applied Physics, University of La Laguna, C/ Astrofísico Sánchez. Ed. de Física y Matemáticas, CP 38200, La Laguna, Spain.

This paper presents a set of methods for helping in the analysis of signals with particular features that admit a symbolic description. The methodology is based on a general discrete model for a symbolic processing subsystem, which is fuzzyfied by means of a fuzzy inference system. In this framework a number of design problems have been approached. The curse of dimensionality problem and the specification of adequate membership functions are the main ones. In addition, other strategies, which make the design process simpler and more robust, are introduced. Their goals are automating the production of the rule base of the fuzzy system and composing complex systems from simpler subsystems under symbolic constrains. These techniques are applied to the analysis of wakefulness episodes in the sleep EEG. In order to solve the practical difficulty of finding remarkable situations from the outputs of the symbolic subsystems an unsupervised adaptive learning technique (FART network) has been applied.

We couldn't gain access to the actual paper for this one, so we took the liberty of reconstructing what the FART network looks like:

As you can see, the FART network is also scale-free.

4. Funny last names
Amazingly, despite the breakthroughs in fuc research, many of the papers that appear on a PubMed search for "fuck" are written by scientists with the last name of "Fuck." Since we've never seen ads for Fuck family reunions around these parts, we're guessing that this is also a language issue.

5. BONUS: In rare cases, multiple swears end up in a single paper. The preponderance of Dr. Fucks, combined with the important bitch research being conducted throughout the world, has lead to a single paper:

Ovarian teratoma in a bitch.

Department of Preventive Veterinary Medicine, Universidade Estadual de Landline, Londrina, Paraná 86051-990, Brazil.

Scientific writing doesn't often inspire poetry, but this paper is beautiful in its simplicity and directness. Feel free to recite this at your next poetry slam:

Ovarian teratoma in a bitch.

Fuck. Fuck.


Equinspire said...

You deserve an Ig Nobel prize nomination for that effort. Awesome work.

khaynes said...

"a lambda bacteriophage bearing crp+," fucK, fucO, and fucR all in the same abstract. Priceless.