This is a working document, and is not yet ready for public consumption.
1 Background
In July 2021, Roche stopped development with propriety statistical software, to focus on a new backbone of R packages for the analysis of clinical trials. A 10+-year old codebase written in a propriety language (named STREAM) went into maintenance only updates, and development resources were shifted in their entirety to the new R backboned codebase, that comprised rOAK, admiral (Straub et al. 2023) and NEST (NEST 2023), which form the core of the pharmaverse(pharmaverse 2023). The design of clinical trials and exploratory data analysis at Roche has a longer history of R use, with a packages like rpact (Anders Bilgrau and Krøgholt 2023) and crmPack (Sabanés Bové et al. 2019) used for many years. This has continued to increase in recent years through initiatives like openstatsware (openstatsware 2023), that aim to collaboratively fill software gaps in clinical trial design as open source software.
1.1 Aim
In this document we explore the ROI from the perspective of an organisation, from both qualitive and quantative assessments, shedding light on the tangible benefits derived from collaborative development in the pharmaceutical domain.
2 Qualitative analysis
Regardless of whether a company open sources it’s own code, with our industries away from proprietary languages we are likely to be both depending on and extending open source software. A core question is then whether there is an added benefit open sourcing our own code, and actively contributing back to projects we use.
2.1 PHUSE guidance
The PHUSE Open Source Guidance (PhUSE 2023) document outlines the benefits of open source across 3 main points:
- Code used in clinical reporting is about the summation and presentation of insights where the process of how they were generated should be transparent. The guidance refers to this code as ‘post-competitive’ intellectual property.
- The creation of ADaM datasets and TFLs is a process that is repeated across companies, so there is the opportunity to co-create with a common goal.
- Open source software ensure outputs are reproducible without needing to purchase a licence.
Post-competitive intellectual property is a term defined in the guidance as:
A less common term we have defined to be where code collaboration improves the efficency of insights, rather than the creation of insights that would otherwise not be possible. In the context of PHUSE collaborators, this includes packages that take CDISC data and apply templated data steps and visualizations to prepare a CSR, like those seen in the pharmaverse.
2.2 Increasing acceptance
A rising tide lifts all boats embodies a position Roche took in the early days of open sourcing, with the specific intent that by open sourcing our code, and collaborating with companies, we could help regulators and our industry gain confidence in the use of open source software regardless of whether we saw external contributions back onto the code we released.
This was a strategy that worked across three layers;
- Open sourcing our own code, and contributing to other projects we leverage
- Meaningful support for pharma specific packages
- Examples include allowing allocating sufficient time for employees to contribute to the codebase as a core of ‘how we work’, and the creation of a dedicated team to support the
pharmaversepackages like admiral and NEST, and active investment co-foundingR/Pharmaand it’s non-profitOpen Source in Pharma(Pharma 2023)
- Examples include allowing allocating sufficient time for employees to contribute to the codebase as a core of ‘how we work’, and the creation of a dedicated team to support the
- Supporting the wider R community
- Our pharma specific R code depends on non-industry specific R packages.
- Examples include being a platinum member of the
R Consortium(Consortium 2023), and the first Pharma to co-host useR!, the main global R conference.
2.3 Mitigating risks
2.3.1 PHUSE OS guidance
The PHUSE Open Source Guidance (PhUSE 2023) document highlights several risks from open sourcing code. Many of these can be directly countered, as shown in Table 1.
| Risk statement | Rebuttal |
|---|---|
| We could be liable to end users | All licences Roche use have provisions in them that explicity state the code is |
| Opening up our code could risk our reputation if bugs are found | Ultimately this would be in the benefit of patients if we are made aware of ways to improve the code |
| Open sourcing is an unnecessary risk |
|
2.3.2 EU CRA act
Another in-development complexity is the Cyber Resilience Act (Commission 2022), currently being discussed in the European Union. While we currently only have a draft of the new law, it is likely to have implications for the use and contributions to open source software by pharma companies.
Examples include;
- Execute risk assessments, including vulnerability scanning
- Have a policy and process for timely response to vulnerabilities
- A ‘bill of materials’ summarising components used
- To have this overseen by a ‘open-source software steward’, which if not roche could potentially be other entities like the R Consortium or PHUSE
2.4 Other benefits
- Many of talent graduating today (or in the last 5 years) across quantative science degrees have R and/or Python experience, and are often motivated by the idea of using common open source languages, rather than the proprietary and banking/pharma languages of the past
- We can share ideas, and get feedback from the wider community, allowing innovation to more easily impact across companies
- We can more easily collaborate with other companies, and have a common language to discuss ideas and code
- Talent can be more easily attracted to Roche, as they can see the contributions we make to the wider community, and the opportunities to work on open source projects
- Talent can more easily flow between companies, as we share more of our ADaM and TLG code
3 Quantitative analysis
As the code-base is version controlled with git, we are able to extract some information around who has made contributions, and what types of contributions those were. There are 20 packages in the pharmaverse relevant to Roche, and 6 packages from openstatsware. Across these packages, we have 203 Roche people working across 205 known Github user accounts.
In this analysis we can look at commits ( Section 3.2), lines of code (Section 3.3) and issues/commits (Section 3.4) to understand better how our code-base is influenced by individuals from outside of our own company.
3.1 Method
We maintain a list of Github handles of Roche employees used to contribute to pharmaverse packages we consider relevant to submissions within Roche or openstatsware packages used for statistical design or analysis.
The following known caveats exist with this analysis.
- Some projects were started internally on our self-hosted git server, than migrated to github.com. This means we know the commit can be attributed to Roche, but we do not have Github user data attributed to commits generated when the project was internal only.
- Some people have moved companies, but are still involved in the pharmaverse.
- We are likely missing some Roche github handles, so true Roche contributions are likely to be higher.
- When working with lines of code, some code is generated by scripts (e.g.
devtools::document(), and so certain file-types are ignored) is automatically adding the majority of the code in the repo. - Commits are an indicator of contributions, and a single commit could range from adding key piece of functionality, to fixing a spelling mistake.
3.2 Commits
3.2.1 Commits summary
Figure 1 shows.
3.2.2 Commits by package
| Contributions to the Roche relevant pharmaverse | ||||||
| Only includes commit data for now, other forms of contribution will be added (e.g. issues) | ||||||
| Package | Age (y) | Roche + Externals | Roche metrics | |||
|---|---|---|---|---|---|---|
| Contributors | Commits | % Contributors1 | % Commits2 | Monthly trend3 | ||
| pharmaverse | ||||||
| 6.7 | 4 | 252 | 100% | 100% | ||
| 6.3 | 26 | 1017 | 92% | 100% | ||
| 1.7 | 9 | 373 | 89% | 99% | ||
| 7 | 55 | 2809 | 93% | 95% | ||
| 5.4 | 30 | 1255 | 93% | 91% | ||
| 2 | 20 | 163 | 90% | 91% | ||
| 7 | 38 | 2666 | 89% | 91% | ||
| 6.5 | 48 | 1801 | 94% | 90% | ||
| 2 | 22 | 303 | 91% | 89% | ||
| 1.7 | 21 | 564 | 62% | 88% | ||
| 2 | 23 | 228 | 91% | 85% | ||
| 1.3 | 15 | 142 | 40% | 79% | ||
| 2.1 | 23 | 1000 | 48% | 78% | ||
| 3.1 | 74 | 4335 | 50% | 63% | ||
| 1.7 | 22 | 626 | 59% | 50% | ||
| 3.2 | 9 | 766 | 44% | 4% | ||
| 1.5 | 18 | 1505 | 22% | 3% | ||
| 3.1 | 8 | 369 | 12% | 1% | ||
| 3 | 4 | 87 | 0% | 0% | ||
| 2.2 | 3 | 160 | 0% | 0% | ||
| openstatsware | ||||||
| 2.8 | 8 | 1008 | 75% | 100% | ||
| 1.8 | 9 | 302 | 89% | 93% | ||
| 9.7 | 20 | 870 | 50% | 79% | ||
| 2.2 | 8 | 131 | 75% | 76% | ||
| 1.9 | 22 | 333 | 50% | 74% | ||
| 1 | 4 | 267 | 0% | 0% | ||
| Our list of github handles of Roche employees may not be complete. Our data on when employed (if they joined or left Roche) is manually added, so may not be exact. Generated with 23332 commits from 2014-06-12 to 2023-12-28. | ||||||
| 1 Person counted if ever employed by Roche, so contributions could occur while employed externally. | ||||||
| 2 Where a person joined or left Roche, commits are assigned to Roche only if employed when commit was made. | ||||||
| 3 Trend in % Roche comments month to month. X axis is not fixed across packages. | ||||||
3.3 Lines of code
This section looks at the lines of code per person present in the most recent version of the code. We exclude any code in the following folders in order to remove the risk of machine generated code skewing the results:
man/misc/inst/data/
The number of people may differ, as in this section we are using the information from the persons .gitconfig file, rather than the Github user account information.
3.3.1 Lines of code by package
This data source uses information from the git commits, so we are relying on defining whether a person is Roche or not based on what email was in their .gitconfig file (or gh account if change made in github UI). This is currently very innacurate, so there are many false negatives where we miss Roche people contributing from a personal email account we do not know.
| Lines of code attributed to Roche authors | |
Data comes from git blame of current package without /docs, man or inst folders |
|
| Package | % Roche1 |
|---|---|
| pharmaverse | |
| 100% | |
| 100% | |
| 98% | |
| 93% | |
| 93% | |
| 89% | |
| 87% | |
| 83% | |
| 75% | |
| 68% | |
| 60% | |
| 57% | |
| 55% | |
| 50% | |
| 48% | |
| 47% | |
| 17% | |
| 0% | |
| 0% | |
| 0% | |
| openstatsware | |
| 100% | |
| 98% | |
| 90% | |
| 64% | |
| 63% | |
| 0% | |
| Our list of github handles of Roche employees may not be complete. Our data on when employed (if they joined or left Roche) is manually added, so may not be exact. | |
1 A line of code is attributed to Roche if the person ever worked for Roche. xportr is an example where lines of code were contributed before the person joined Roche’s team. |
|
3.4 Issues & Comments
Issues, and comments on issues, incorporates an additional form of collaboration that does not require writing code.
3.4.1 Issue and comments summary
There are 1808 issues and 6165 comments in the pharmaverse. There are 205 issues and 1530 in open statsware.
`summarise()` has grouped output by 'namespace'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'namespace'. You can override using the
`.groups` argument.
3.4.2 Issues and comments by package
`summarise()` has grouped output by 'full_name'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'full_name'. You can override using the
`.groups` argument.
`geom_line()`: Each group consists of only one observation. ℹ Do you need to
adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation. ℹ Do you need to
adjust the group aesthetic?
`geom_line()`: Each group consists of only one observation. ℹ Do you need to
adjust the group aesthetic?
| Issue and Comments from Roche vs External collaborators | ||||||
| Data is captured going back 12 months only. | ||||||
| Package | Roche + Externals | Roche | ||||
|---|---|---|---|---|---|---|
| Issues | Comments | % Issues | Issues trend1 | % Comments | Comments trend1 | |
| pharmaverse | ||||||
| 415 | 1680 | 51% | 52% | |||
| 85 | 238 | 38% | 31% | |||
| 35 | 86 | 77% | 78% | |||
| 62 | 107 | 81% | 80% | |||
| 59 | 138 | 7% | 7% | |||
| 19 | 34 | 100% | 100% | |||
| 47 | 95 | 49% | 46% | |||
| 8 | 14 | 12% | 7% | |||
| 5 | 17 | 20% | 29% | |||
| 1 | 2 | 0% | 0% | |||
| 193 | 515 | 92% | 91% | |||
| 127 | 592 | 80% | 81% | |||
| 49 | 233 | 96% | 98% | |||
| 72 | 253 | 85% | 77% | |||
| 77 | 264 | 81% | 78% | |||
| 57 | 158 | 77% | 80% | |||
| 209 | 1016 | 85% | 79% | |||
| 187 | 385 | 97% | 95% | |||
| 101 | 338 | 15% | 24% | |||
| openstatsware | ||||||
| 44 | 214 | 0% | 0% | |||
| 153 | 388 | 100% | 100% | |||
| 76 | 244 | 100% | 100% | |||
| 128 | 580 | 55% | 54% | |||
| 10 | 37 | 60% | 35% | |||
| 22 | 67 | 86% | 94% | |||
| Our list of github handles of Roche employees may not be complete. Our data on when employed (if they joined or left Roche) is manually added, so may not be exact. | ||||||
| 1 Trend in % coming from Roche over last 12 months. | ||||||
4 Discussion
5 References
Reuse
Citation
@online{black2024,
author = {Black, James},
title = {Exploring External Contributions to the {R} Codebase Used by
{Roche} to design and Analyse Late-Stage Clinical Trials},
date = {2024-03-12},
langid = {en},
abstract = {R is increasingly used in the pharmaceutical industry as
the backbone for the pan-study codebase for the design and analysis
of clinical trials. In parrallel with this shift to R, many
companies are open sourcing, and collaborating, on the
post-competitive code used across studies. The Pharmaverse and
openstatsware are two example initiatives for statistical
programming, and biostatistics, respectivly. Whlie numerous benefits
come from companies open sourcing their R codebase, from better
talent acquisition, to transperancy with regulators, activity on git
repos provides an insight into the return on investment (ROI) from
external contributions to the codebase a company depends on. In this
document we explore the ROI as assessed via external contributions
to the late-stage codebase at Roche, shedding light on the tangible
benefits derived from collaborative development in the
pharmaceutical domain.}
}


