You Don’t Need OOP for Data Science

tidyverse

tidymodels

For R users, the R6 package - among several object-oriented systems for R - brings forth the elegance of object-oriented programming (OOP) to a traditionally ‘stats first’ language. That said, do you need to know OOP to be a successful data scientist? No, probably not.

Author

Affiliation

Javier Orraca-Deatcu

Scatter Podcast

Published

November 23, 2023

OOP with R

For R users, the R6 package - one of several object-oriented systems for R - brings forth the elegance of object-oriented programming (OOP) to a traditionally ‘stats first’ language. My friend the Tidy Wizard, a barely-configured GPT tool that I crafted specializing in the R language, says that R6 “bestows upon the user the ability to craft sophisticated and modular software, a boon in various business scenarios.”

When meeting data scientists in the wild, they’re almost always Python users and they often ask me the rhetorical question of “can R do OOP and encapsulation?” I used to be embarrassed about this… Did I miss some mission critical component of data science in grad school? Have I been learning from the wrong books and tutorials?

Having worked alongside amazing software engineers over recent years, I’ve come to realize OOP is fundamental in computer science education. The de facto OOP nature of Python makes the transition from C/C++/Java to Python a breeze and delight. This is one of the many reasons why we see so much OOP development in data science organizations from Python-based data science practitioners and educators. But what if you don’t have a formal software background and gravitated to R for statistics, functional programming, and data science?

My thoughts on this topic will differ from others in the R community, likely from folks that have formal computer science / software engineering backgrounds. I do not have a formal CS background, so bear with me while I work through this heresy: You don’t need to know OOP to be a successful data scientist.

In data science, would knowing OOP help your future career prospects? Probably; Probably in the same way a PhD might help you climb the corporate ladder more quickly. In my experience (and again, as a primary R user), not knowing the intricacies of OOP shouldn’t slow you down from interactive web app development, ML training / production at scale, containerization via Docker, parallelized / vectorized function writing, package development, NLP, deep learning, etc.

On a daily basis, most R users don’t have a need for OOP since they utilize objects and functions. They apply functions to objects to create new objects, and all this is typically done to one class: the data.frame/tibble. OOP in R also isn’t limited to R6; In fact, we regularly apply functions in R that inherit the role of generic-function OO. For example, base R’s sum() and print() - to name just a couple! - are S3 functions.

Copying from Hadley Wickham’s Advanced R, he shares the following insights about different OO systems in R:

Generally in R, functional programming is much more important than object-oriented programming, because you typically solve complex problems by decomposing them into simple functions, not simple objects. Nevertheless, there are important reasons to learn each of the three systems:

S3 allows your functions to return rich results with user-friendly display and programmer-friendly internals. S3 is used throughout base R, so it’s important to master if you want to extend base R functions to work with new types of input.

R6 provides a standardised way to escape R’s copy-on-modify semantics. This is particularly important if you want to model objects that exist independently of R. Today, a common need for R6 is to model data that comes from a web API, and where changes come from inside or outside of R.

S4 is a rigorous system that forces you to think carefully about program design. It’s particularly well-suited for building large systems that evolve over time and will receive contributions from many programmers. This is why it is used by the Bioconductor project, so another reason to learn S4 is to equip you to contribute to that project.

Not mentioned in the above is the OOP system S7, expected to be the successor to S3 and S4. Given the nature of this early-stage framework, I’ve focused most of this post to highlight R6. S7 “has been designed and implemented collaboratively by the R Consortium Object-Oriented Programming Working Group” and their high-level motivation for S7 post details how it aims to resolve some of the challenges with R’s two built-in OO systems S3 and S4.

Additional Resources

After a pattern of writing, modifying, and deleting a more robust background section for object-oriented systems with R, I’ve decided to instead point you to some of the best resources I’ve come across:

The Advanced R book, by Hadley Wickham, includes several chapters on OOP with R including base types, S3, R6, S4, and a trade-offs chapter. The Introduction to OOP section is a great place to start for a high-level overview.
Dario Radečić published a phenomenal complete R6 guide for Appsilon’s blog. It’s a bookmark-worthy R6 overview that touches on its classes and constructors/methods including $new(), $initialize(), $print(), and more.

R6 Example Use Cases

With the assistance of the Tidy Wizard, let’s dive into some examples that aim to “illuminate a few practical applications where R6 can lend its prowess.” 🧙

CRM Systems

When we ponder Customer Relationship Management, we find ourselves amidst a vast sea of data, encompassing customer interactions, preferences, transaction histories, and much more. In the traditional approach, this data might be stored in spreadsheets or databases as tabular data. While this method is straightforward, it often leads to a fragmented view of the customer, with data scattered across multiple tables or sheets.

Herein lies the profound advantage of employing R6 in a CRM context:

Unified Customer View: By defining an R6 class for a customer, you encapsulate all relevant information and behaviors related to that customer in one place. This approach fosters a holistic view of each customer, integrating data from various sources. It allows for more coherent and comprehensive analysis and decision-making.

Behavior Encapsulation: With R6, you’re not just storing data; you’re also encapsulating behaviors. For instance, an R6 class for a customer could include methods to calculate lifetime value, predict churn, or even trigger personalized marketing actions. This encapsulation of behavior with data enriches the CRM’s capabilities.

Flexibility and Scalability: R6’s object-oriented nature allows for more scalable and adaptable solutions. As business needs evolve, you can extend or modify the R6 classes without overhauling your entire data structure. This flexibility is crucial in the ever-changing business landscape.

Interactivity and Automation: R6 classes can interact with each other, enabling automated workflows. For instance, a class representing an email campaign can interact with the customer class to send personalized emails, log interactions, and update customer profiles, all in an automated manner.

# Defining a Customer class
Customer <- R6::R6Class("Customer",
    public = list(
        name = NULL,
        email = NULL,
        transaction_history = list(),
        initialize = function(name, email) {
            self$name <- name
            self$email <- email
        },
        record_transaction = function(transaction) {
            self$transaction_history <- c(self$transaction_history, list(transaction))
        },
        calculate_lifetime_value = function() {
          self$transaction_history |> 
            purrr::map_dbl(~ .x$amount) |> 
            sum()
        }
    )
)

# Example usage
customer1 <- Customer$new(name = "John Doe", email = "john@example.com")
customer1$record_transaction(list(date = "2021-01-01", amount = 100))
customer1$record_transaction(list(date = "2021-02-01", amount = 150))
lifetime_value <- customer1$calculate_lifetime_value()

print(lifetime_value)

[1] 250

In this enriched example, our Customer class now not only stores basic information but also tracks transaction history and calculates the lifetime value. This enriched model provides a more dynamic and insightful perspective on the customer, far surpassing the static nature of spreadsheets or basic tabular data.

Financial Modeling

Let’s craft an R6 class that models a portfolio of financial instruments, each with its own characteristics and behaviors. This portfolio can include various types of investments like stocks, bonds, and mutual funds, each represented by their own R6 classes.

The beauty of using R6 in this context lies in its ability to encapsulate the complexities of financial instruments within their respective classes, thereby making the overall portfolio management more intuitive and robust. Let’s delve into this with a code example:

library(R6)

# Define a generic FinancialInstrument class
FinancialInstrument <- R6::R6Class("FinancialInstrument",
    public = list(
        identifier = NULL,
        principal = NULL,
        initialize = function(identifier, principal) {
            self$identifier = identifier
            self$principal = principal
        },
        calculate_return = function() {
            # Placeholder for return calculation
        }
    )
)

# Extend FinancialInstrument for specific types
Stock <- R6::R6Class("Stock", inherit = FinancialInstrument,
    public = list(
        dividend_yield = NULL,
        initialize = function(identifier, principal, dividend_yield) {
            super$initialize(identifier, principal)
            self$dividend_yield = dividend_yield
        },
        calculate_return = function() {
            return(self$principal * self$dividend_yield)
        }
    )
)

Bond <- R6::R6Class("Bond", inherit = FinancialInstrument,
    public = list(
        interest_rate = NULL,
        initialize = function(identifier, principal, interest_rate) {
            super$initialize(identifier, principal)
            self$interest_rate = interest_rate
        },
        calculate_return = function() {
            return(self$principal * self$interest_rate)
        }
    )
)

# Define a Portfolio class that can hold multiple financial instruments
Portfolio <- R6::R6Class("Portfolio",
    public = list(
        instruments = list(),
        add_instrument = function(instrument) {
            self$instruments[[instrument$identifier]] <- instrument
        },
        total_return = function() {
          self$instruments |> 
            purrr::map_dbl(\(x) x$calculate_return()) |> 
            sum()
        }
    )
)

# Example usage
portfolio <- Portfolio$new()
portfolio$add_instrument(Stock$new("AAPL", 10000, 0.02))
portfolio$add_instrument(Bond$new("US-Gov", 5000, 0.03))

total_return <- portfolio$total_return()
print(total_return)

[1] 350

In this expanded scenario, we have:

A base class FinancialInstrument that defines a generic financial instrument.
Derived classes Stock and Bond that inherit from FinancialInstrument and implement specific behaviors, such as different ways of calculating returns.
A Portfolio class that can contain a collection of different financial instruments. It can calculate the total return of the portfolio by summing the returns of each individual instrument.

This object-oriented approach provided by R6 allows for clear separation and encapsulation of the logic for each type of financial instrument. It makes the code modular, easier to understand, and maintain. For a business leader or data analyst, this translates to a system that is more adaptable to new types of financial instruments and changing financial models, ultimately aiding in making more informed investment decisions.

Library System

Let’s develop a simplified system for managing a library which includes tracking books, patrons, and book loans. This example will demonstrate how R6 can be used to model real-world entities and their interactions.

In our library system, we’ll have three main classes:

Book: Represents a book in the library.
Patron: Represents a library member who can borrow books.
Library: Manages the collection of books and patrons, and handles book loans.

Here’s the R code for this use case:

library(R6)

# Book class
Book <- R6::R6Class("Book",
    public = list(
        title = NULL,
        author = NULL,
        is_loaned = FALSE,
        initialize = function(title, author) {
            self$title <- title
            self$author <- author
            self$is_loaned <- FALSE
        },
        loan = function() {
            self$is_loaned <- TRUE
        },
        return_book = function() {
            self$is_loaned <- FALSE
        }
    )
)

# Patron class
Patron <- R6::R6Class("Patron",
    public = list(
        name = NULL,
        books_loaned = list(),
        transaction_history = list(),
        initialize = function(name) {
            self$name <- name
            self$books_loaned <- list()
            self$transaction_history <- list()
        },
        borrow_book = function(book) {
            if (!book$is_loaned) {
                book$loan()
                self$books_loaned[[book$title]] <- book
                self$transaction_history <- c(self$transaction_history, 
                                              list(paste(Sys.time(), ": Borrowed", book$title)))
            } else {
                message("This book is currently loaned out.")
            }
        },
        return_book = function(book) {
            book$return_book()
            self$books_loaned[[book$title]] <- NULL
            self$transaction_history <- c(self$transaction_history, 
                                          list(paste(Sys.time(), ": Returned", book$title)))
        },
        get_transaction_history = function() {
            return(self$transaction_history)
        }
    )
)

# Library class
Library <- R6::R6Class("Library",
    public = list(
        books = list(),
        patrons = list(),
        initialize = function() {
            self$books <- list()
            self$patrons <- list()
        },
        add_book = function(book) {
            self$books[[book$title]] <- book
        },
        add_patron = function(patron) {
            self$patrons[[patron$name]] <- patron
        }
    )
)

# Example usage
library_system <- Library$new()
book1 <- Book$new("The Great Gatsby", "F. Scott Fitzgerald")
book2 <- Book$new("1984", "George Orwell")
patron1 <- Patron$new("John Doe")

library_system$add_book(book1)
library_system$add_book(book2)
library_system$add_patron(patron1)

# John Doe borrows "The Great Gatsby"
patron1$borrow_book(library_system$books[["The Great Gatsby"]])

# Check if the book is loaned
print(library_system$books[["The Great Gatsby"]]$is_loaned)

[1] TRUE

# John Doe returns "The Great Gatsby"
patron1$return_book(library_system$books[["The Great Gatsby"]])

# Check if the book is loaned
print(library_system$books[["The Great Gatsby"]]$is_loaned)

[1] FALSE

# View John Doe's transaction history
print(patron1$get_transaction_history())

[[1]]
[1] "2023-11-28 14:49:54.173788 : Borrowed The Great Gatsby"

[[2]]
[1] "2023-11-28 14:49:54.175161 : Returned The Great Gatsby"

In the above example:

The Book class represents a book with methods to loan and return it
The Patron class represents a library member who can borrow and return books
- The Patron class includes a transaction_history list to record each borrowing and returning action
- Methods borrow_book and return_book append a descriptive string and timestamp to the transaction_history each time they are called
- get_transaction_history retrieves and displays the transaction history
The Library class manages the collection of books and patrons
- It allows adding books and patrons to the library

This code provides a simple yet effective demonstration of how R6 can be used to model and manage real-world entities and their interactions, showcasing the utility of object-oriented programming in R. Any time John Doe borrows or returns a book, the transaction is recorded. You can then view all his transactions by calling the get_transaction_history method on his Patron object.

Closing Remarks

It’s crucial to recognize that OOP, while a powerful paradigm, is not a fundamental requirement for effective data science in R. The essence of R, especially in the realms of statistical programming, data analysis and visualization, is encapsulated in its rich ecosystem of packages and the functional programming approach, which often supersedes the need for a traditional OOP approach.

In the world of R, the tidyverse and tidymodels collections of packages stand out as quintessential examples of how functional programming can be both efficient and sufficient for most data science tasks. These packages provide a comprehensive suite of tools that are designed to work seamlessly with data objects, enabling users to perform complex data manipulations, analyses, and modeling without delving deeply into the nuances of OOP. Even web app development in R via Shiny allows data scientists to build robust client- and server-side programs without the need for OOP.

The functional programming paradigm empowers R users to apply a plethora of functions to objects in a way that is often more aligned with the typical workflows in reproducible statistics and machine learning. I find myself writing packages (often) and have yet had a need to employ traditional OOP in my package development work; That said, some of my favorite R frameworks leverage OOP (R6) in the backend. Whether you choose to dive into OOP or stick with the functional programming nature of R, you’ll most certainly be offered the tools and frameworks necessary to achieve your data science goals efficiently and effectively.

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quarto_1.3  purrr_1.0.2 R6_2.5.1   

loaded via a namespace (and not attached):
 [1] digest_0.6.33     later_1.3.1       fastmap_1.1.1     xfun_0.41         magrittr_2.0.3   
 [6] knitr_1.45        htmltools_0.5.7   rmarkdown_2.25    lifecycle_1.0.4   ps_1.7.5         
[11] cli_3.6.1         processx_3.8.2    vctrs_0.6.4       compiler_4.3.2    rstudioapi_0.15.0
[16] tools_4.3.2       evaluate_0.23     yaml_2.3.7        Rcpp_1.0.11       rlang_1.1.2      
[21] jsonlite_1.8.7

Reuse

CC BY 4.0