---
date: "`r format(Sys.time(), '%d. %B %Y')`"
output: 
    word_document:
      reference_docx: style_ref.docx
---

```{r setup, include=F}
load("/Users/paulsanfilippo/Dropbox/Research Projects/2017/Polygamous Affiliations/polygamy.RData")
library(knitr)
library(tidyverse)
```

#####

### Polygamy in biomedical research:  Multiple institutional affiliations are associated with improved research output

PG Sanfilippo PhD,1,2 AW Hewitt PhD FRANZCO,1,2,3 DA Mackey MD FRANZCO 1,2,3

1. Centre for Ophthalmology and Visual Science, University of Western Australia, Lions Eye Institute, Perth, Australia.
2. Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital.
3. School of Medicine, Menzies Research Institute Tasmania, University of Tasmania, Hobart, Tasmania, Australia. 


Word Count: 

Corresponding Author  
Dr Paul Sanfilippo  
Centre for Eye Research Australia  
32 Gisborne St, East Melbourne.  

E-MAIL: prseye@gmail.com

Declaration of competing/conflicts of interest  
No conflicts of interest were related to this work.

Financial Support  
The Centre for Eye Research Australia receives Operational Infrastructure Support from the Victorian Government. PGS is supported by an NHMRC Early Career Fellowship.

#####

```{r summary, include=F}
num_art <- length(append_wide$Study)
max_cit <- max(append_wide$Tot_Cites)
mea_cit <- sprintf("%0.1f", mean(append_wide$Tot_Cites))
med_cit <- sprintf("%0.1f", median(append_wide$Tot_Cites))
max_aut <- max(append_wide$Num_Auth)
mea_aut <- sprintf("%0.1f", mean(append_wide$Num_Auth))
med_aut <- sprintf("%0.1f", median(append_wide$Num_Auth))
max_aff <- max(append_wide$Num_Affil)
mea_aff <- sprintf("%0.1f", mean(append_wide$Num_Affil))
med_aff <- sprintf("%0.1f", median(append_wide$Num_Affil))
n_nat <- length(append_wide$Journal[append_wide$Journal=="NATURE"])
n_sci <- length(append_wide$Journal[append_wide$Journal=="SCIENCE"])
n_pna <- length(append_wide$Journal[append_wide$Journal=="PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF   AMERICA"])
n_plo <- length(append_wide$Journal[append_wide$Journal=="PLOS BIOLOGY"])
n_most <- length(append_wide$Num_Affil[append_wide$Cite_Split==1])
n_leas <- length(append_wide$Num_Affil[append_wide$Cite_Split==0])
```

With the Digital Revolution, the time-honoured model of scientific discovery being contingent on a singular intellect working independently of others, has expired. In the modern age of global travel and the interactive capabilities afforded by the internet, there is an expectation that good researchers are internationally mobile, both physically and virtually. Researcher mobility is not a goal in itself, but rather a means of fostering collaborative networks at the many levels (e.g. institutional, interdisciplinary, international, etc.) that may drive successful scientific discovery. The increasing dominance of collaborative teams both within and between institutions has been documented to enhance efficiency and productivity as well as produce better science.{Wuchty, 2007 #2742; Jones, 2008 #2741}  Entangled within this collaborative research milieu, the institutional affiliations held by a researcher may also be viewed as a marker of capacity to facilitate knowledge exchange.{ESF, 2013 #2744}  However, to date there has been little research from the burgeoning scientometric and bibliometric fields exploring the role of multiple institutional affiliations on scientific output. {Hottenrott, 2017 #2740} To improve our understanding of this phenomenon, we conducted a large-scale analysis of scientific publications from four multi-disciplinary science journals (Science, Nature, Proceedings of the National Academy of Sciences [PNAS], PLOS Biology [PLOS]).  

We retrieved all 'articles' listed for the above journals from Web of Science (WoS) for the years 2010 - 2014, inclusive (search performed on 14/06/17). Articles were exported from WoS as BibTeX files, with complete metadata, then imported into the R statistical environment {Team, 2017 #275} for further processing. The bibliometrix package {Aria, 2016 #2739} was used to create a bibliographic data frame with cases (rows) corresponding to manuscripts and variables (columns) to Field Tags (metadata) in the original BibTex file. In this way the bibliographic attributes for each article (i.e. title, author's names, author's affiliations, citation count, document type, keywords, etc.) are formatted appropriately for subsequent analysis. The most important Field Tag for the purposes of this study is the Author Address (C1) tag which provides institutional address information for each author and where an author has multiple affiliations, lists these addresses separately. We split each manuscript record by author name and affiliation address, with the sum of author name occurrences indicating the number of distinct affiliations for that author.  

Of the 27651 articles retrieved, 39 did not have affiliation data recorded and were excluded. The total number of articles available for analysis was `r num_art`, with Science (n = `r n_sci`), Nature (n = `r n_nat`), PNAS (n = `r n_pna`), and PLOS (n = `r n_plo`).  The maximum number of citations for a single paper was `r max_cit` (mean and median: `r mea_cit` and `r med_cit`, respectively). The maximum number of authors for a single paper was `r max_aut` (mean and median: `r mea_aut` and `r med_aut`, respectively), and the maximum number of author affiliations was `r max_aff` (mean and median: `r mea_aff` and `r med_aff`, respectively). Author affiliations were recorded as presented by WoS.

Table 1: Frequency distribution of articles and author appearances in most- and least-cited articles, stratified by the number of author affiliations attached to each article. As individual articles may have contained multiple authors with different numbers of affiliations, they may appear more than once in the summary (i.e. an author may appear on multiple papers).

```{r citesplit, echo=F}
lowcite <- subset(append_wide, Cite_Split == 0)
highcite <- subset(append_wide, Cite_Split == 1)
lowcite_sum <- map(lowcite[,8:18], sum)
lowcite_sum <- unlist(lowcite_sum)
highcite_sum <- map(highcite[,8:18], sum)
highcite_sum <- unlist(highcite_sum)
row_tot <- lowcite_sum + highcite_sum
row_perc <- row_tot/sum(row_tot)
col_tot <- c(sum(lowcite_sum), sum(highcite_sum), sum(row_tot), 1) 
col_perc <- col_tot/sum(row_tot)
citedf <- data.frame(lowcite_sum, highcite_sum, row_tot, row_perc)
citedf <-  rbind(citedf[1:10,], c(0, 0, 0, 0) , citedf[-(1:10),]) #Insert row for Affil 11
citedf <- rbind(citedf, col_tot, col_perc)
citedf$row_perc[14] <- NA
rownames(citedf)[13] <- "col_tot"
rownames(citedf)[14] <- "col_perc"
kable(citedf)

cit_af1 <- as.numeric(sprintf("%0.1f", citedf$row_perc[1]*100))
cit_afr <- 100 - cit_af1
cit_af2 <- sprintf("%0.1f", citedf$row_perc[2]*100)
num_art2 <- length(append_wide_sub$Study)
cor_auaf <- cor.test(append_wide_sub$Num_Affil, append_wide_sub$Num_Auth)
cor_auci <- cor.test(append_wide_sub$Tot_Cites, append_wide_sub$Num_Auth)
cor_afci <- cor.test(append_wide_sub$Tot_Cites, append_wide_sub$Num_Affil)
```

Table 1 shows the distribution of publications and author appearances stratified by the number of author affiliations for the most- and least-cited articles split at the median citation value (most-cited = citations > `r med_cit` [n = `r n_most`], least-cited = citations ≤ `r med_cit` [n = `r n_leas`]). While the vast majority of author appearances were associated with only one institutional affiliation (`r cit_af1`%), `r cit_afr`% of author appearances were linked with two (`r cit_af2`%) or more affiliation addresses.  The maximum number of institutional affiliations held by an author was 12. As these are non-independent observations, classical tests of contingency tables are not appropriate, however, one can easily appreciate the increased frequency of author appearances in the more-cited publications. Indeed, the correlation between the citations a paper received and the number of authors on that paper was statistically significant ($\rho$ = `r round(cor_auci$estimate[[1]],2)`, p = `r cor_auci$p.value`). Similarly, the correlation coefficient for the citations a paper received and the number of instiutional affiliations on that paper was `r round(cor_afci$estimate[[1]],2)`, p = `r cor_afci$p.value`. The correlation between the number of authors and number of affiliations listed for each paper was greater, indicating closer correspondence between the variables (`r round(cor_auaf$estimate[[1]],2)`, p = `r cor_auaf$p.value`).     

To facilitate a simple yet fruitful investigation of the relationship between the citations a paper received and the influence of authorship and affiliation frequency, we categorised the latter two variables. The number of authors attached to each paper was split into quartiles to create an 'Author Number' variable, with the following categories:  1 = 1 – 3 authors/article, 2 = 4 – 5 authors/article, 3 = 6 – 9 authors/article, and 4 = 9 – 2908 authors/article. Due to the low cell counts (Table 1) and to improve estimation in subsequent modelling, the maximum number of author affiliations held on a single paper was limited to 6. This resulted in the exclusion of a further 47 papers, with `r num_art2` articles available for analysis. 'Maximum Affiliation' represents the maximum number of institutional affiliations held by a single author on an article. For example, if WoS listed an article with three authors each having 2 affiliations, and two authors each having 3 affiliations, in this case maximum affiliation would equal 3. Table 2 shows the frequency distribution of articles by author number and maximum affiliation.    

Table 2: Frequency distribution (%) of articles in each category of author number and maximum affiliation.

```{r freqdist, echo=F}
source("/Users/paulsanfilippo/Dropbox/Research Projects/2017/Polygamous Affiliations/Functions/crosstab.r")
xtabf <- crosstab(append_wide_sub, row.vars = "Cat_Auth", col.vars = "Affil_Max", type = "f") # Frequency
xtabj <- crosstab(append_wide_sub, row.vars = "Cat_Auth", col.vars = "Affil_Max", type = "j") # Percent
fmat <- as.matrix(unlist(xtabf$table))
jmat <- as.matrix(unlist(xtabj$crosstab))
kable(fmat, digits=4)
kable(jmat, digits=4)
```

Maximum Affiliation = The maximum number of affiliations held by a single author for each article.  
Author Number:  
1 = 1 – 3 authors/article.  
2 = 4 – 5 authors/article.  
3 = 6 – 9 authors/article.  
4 = 9 – 2908 authors/article.  

Figure 1: Boxplots of citation counts stratified by author number and maximum affiliation. The horizontal line and adjacent number indicate the median, the top and bottom of the boxes the interquartile range, and the number below each plot, the mean citation count. Citations are truncated at 500.

```{r boxplot, echo=F}
med.n <- function(x){
  return(c(y = median(x)+10, label = round(median(x),0))) 
  # experiment with the multiplier to find the perfect position
}
mean.n <- function(x){
  return(c(y = -8, label = round(mean(x),0))) 
  # experiment with the multiplier to find the perfect position
}
# BOXPLOTS
ggplot(append_wide_sub, aes(Affil_Max, Tot_Cites)) +
geom_boxplot(outlier.size = 0.2) +
coord_cartesian(ylim=c(0, 500)) +
facet_grid(~Cat_Auth) + theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1), 
    axis.line = element_line(colour = "gray50", size = 0.5, linetype = "solid"), 
    panel.background = element_rect(fill = NA)) + labs(x = "Maximum Affiliation", y = "Citations") +
  stat_summary(fun.data = mean.n, geom = "text") + 
  stat_summary(fun.data = med.n, geom = "text")
```

Figure 1 shows boxplots of citation counts for each category of author number and maximum affiliation. There is a general trend of citation count increasing across both factors. We explored this relationship further in a linear regression model with citation count as the outcome, and author number and maximum affiliation as predictor variables (Supplementary Table). Although these are technically count data, the mean citation value is high and the distribution of the count model approximates the normal. Consequently, we have considered citations a continous variable and utilised a linear model. We initially fit a model with an interaction term (author number x maximum affiliation) and evaluated its signficance with a Wald test. The resulting p value was highly significant (< 0.001) suggesting the 15 coefficients for the interaction terms are not simultaneously equal to zero, and an interaction effect exists between the two variables (i.e. the effect of maximum affiliation on citations received, varies depending on the value of author number). The model was checked for multicollinearity using the generalized variance inflation factor (GVIF). The raw output from the regression model are supplied in the Supplementary Table. As interaction terms make coefficient interpretation difficult, results for the effect of each level of predictor are presented in a stratified manner, while holding the other predictor constant (Table 3).    

Table 3 shows the effect for each combination of maximum affiliation and author number on citation count. To further facilitate interpretation, we have limited maximum affiliation data to four addresses. The effect size (Average Change in Citation Count) was computed using a series of linear contrasts that enables the comparison of differences among coefficients beyond the standard regression output. There are two main findings from these data: first, the effect on citation count of an author holding more institutional affiliations increases as the number of authors on a paper grows; and second, increasing the number of authors on a paper tends to result in more citations received irrespective of the number of affiliations held.  

When there are between 1 - 5 authors/article (author number = 1 or 2), increasing the number of affiliations an author holds (relative to one) does not affect the average change in citation count. However, when there are between 6 - 9 authors/article, an author with two institutional affiliations (relative to one) will, on average, increase the citations a paper receives by 11.8 (p < 0.001). This effect is even more pronounced when there are more than 9 authors listed; here, citations increase on average by 20.8 (p < 0.001) for two affiliations, 39.2 (p < 0.001) for three affiliations and 57.3 ( p < 0.001) for four affiliations, relative to the reference group.

If we now interpret these effects while holding the number of affiliations constant, for researchers with only one affiliation, increasing the number of authors on a paper results in a mean increase in the citations received across all levels of author number (e.g. 35.8 for author number = 4, relative to 1, p < 0.001). However, this effect remains significant for only greater author numbers (i.e. 4 vs 1) as the maximum number of affiliations held, increases.


Table 3: Summary of regression model output for the effect of author number and maximum affiliation on average citation counts. Within each stratum, the average change in citation count is relative to the first (reference) level. 

These data align with previous observations in highlighting the increasing leverage of teamwork in scientific research.{Wuchty, 2007 #2742; Jones, 2008 #2741} However, they also serve to provide some insight into the relatively novel notion that multiple author affiliations may also play a positive role in the production of high-impact science.{Hottenrott, 2017 #2740} The holding of multiple  affiliations by authors should be viewed by institutional boards as a virtue and not a vice, as it appears that this 'polygamous' behaviour may be advantageous to all. 


#####


Supplementary Table: Linear regression results for modelling the effects of author number and maximum affiliation on citations.

```{r model, echo=F}

mod1a <- lm(Tot_Cites ~ Affil_Max + Cat_Auth + Affil_Max*Cat_Auth, data=append_wide_sub)
# Create DF of results for input into table
cia <- as.data.frame(confint(mod1a))
mod1a_out <- cbind(mod1a$coefficients, coef(summary(mod1a))[,2], cia[,1], cia[,2], coef(summary(mod1a))[,3],  coef(summary(mod1a))[,4])
mod1a_out <- data.frame(mod1a_out)
colnames(mod1a_out) <- c("beta", "S.E.", "2.5%", "97.5%","t", "p-value")
kable(mod1a_out, digits=6)
```