In Drew Conway’s famous Venn diagram of data science skills, programming chops might be the most elusive target. New libraries and packages are constantly emerging — and sometimes even entire new programming languages enter the stage.
True, Python continues to dominate. But despite its widespread footing and ease of use, it hardly holds a monopoly in data science. And the alternatives extend well beyond R, despite what partisans in the tired Python vs. R debate might have you think.
With that in mind, we assembled a short list of notable data science programming languages, along with a breakdown of their strengths and weaknesses. Plenty of this will be old hat for seasoned professionals, but anyone needing a quick survey of landscape should read on.
Whether it’s the most popular programming language in the world or simply in the top three, Python has undoubtedly boomed in recent years. Even though it’s been around for 30 years, the language finally began to catch on more broadly around 2007, when Python-heavy Dropbox launched, providing extensive real-world proof of concept for its strengths. (And the fact that Google signed on around the same time was also huge.) But nowhere is Python’s popularity more evident than in data science circles, where it’s often considered the go-to.
Python has become widely used for several reasons. It prioritizes readability, it’s dynamically typed and it sports intuitive syntax, which makes it relatively easy to learn and use. Also, it has an incredibly rich ecosystem of libraries for data preprocessing and analysis (NumPy, SciPy, Pandas), visualization (Matplotlib, Seaborn, Bokeh) and more. Perhaps most notably, as machine learning and deep learning matured and grew mainstream, so too did Python, which added watershed platforms and libraries like scikit-learn, Keras, the Facebook-developed PyTorch and the Google-developed TensorFlow.
And as Python’s traction has cemented, so too has its support resources. The language has a large, dedicated support community and a plethora of Python-specific books and bootcamps and courses — all reasons, perhaps, the language regularly lands among the “most loved” in Stack Overflow’s annual developer surveys.
Even though its reputation as less-than-desirable for production is something of a misnomer, R is still considered best-suited for data mining and statistical analysis, of which it offers a wide range of options. (Indeed, R’s favored status among statisticians is something of a chronic meme-generator.) It’s also regarded as more approachable than Python for non-developers, since it’s possible to whip up a statistical model — and a sharp-looking visualization — with just a few lines of code.
R also sports a robust constellation of packages. Most notable is the tidyverse family, created by R-community icon Hadley Wickham. It features popular packages for organizing data (tidyr), munging data (dplyr) and visualization (the groundbreaking ggplot2).
You can see it in action via the much-loved #TidyTuesday project. Each week, a new data set is released for data scientists to practice and demonstrate their data wrangling and visualization skills. It’s the kind of learning (and honing) in public that’s also emblematic of R’s famously supportive — and inclusive — community.
Much has been made of Julia’s swift ascendancy in data science circles over the last few years — and for good reason. It’s now a top 20 language in the IEEE Spectrum rankings, and it also cracked the top 20 of the influential TIOBE index last year. (As of writing, it stands at #32).
As developer Anupam Chugh noted in Built In last year, Julia is faster than Python; dynamic, yet more type-safe (thanks to its just-in-time compiler and multiple dispatch); and better equipped for distributed and parallel computing.
That power has made it an emergent option for big-data analysis in the private sector (BlackRock, Apple, Oracle and Google are all users) and, especially, for scientific research, where it’s being used in noteworthy projects for climate modeling, weather forecasting and astronomical surveys, among others. Its wide-ranging interoperability — compatible with everything from Python to old-school Fortran — and its continued boost from MIT, where it has roots, also make Julia a likely long-term contender. And like Python, it too garners a lot of love.
Venerable, general purpose language C and its object-oriented cousin, C++, are hardly considered must-knows for general data science, but both sometimes pop up in job listings for machine learning engineering roles. That may have to do with the fact that the major ML libraries PyTorch and Tensorflow are both, at the core, written largely in the low-level C++, even if most practitioners employ both with Python. (Pytorch is considered more “pythonic,” but the Python wrapper is essentially still just that.)
Another object-oriented language, the “write once, run anywhere” Java is most associated with web development, but it has a prominent role in data contexts, in large part because Java Virtual Machines (JVMs) are a common component in big-data frameworks. The Apache Hadoop ecosystem (including Hive, Spark and MapReduce) relies on JVMs, which means at least a passing familiarity will be helpful for efficiently running data jobs that require large storage and processing requirements.
Speaking of Java, Scala was explicitly designed to be a cleaner, less wordy alternative to that popular language. “You can write hundreds of lines of confusing-looking Java code in less than 15 lines in Scala,” wrote Packt. Designed in Switzerland and released in 2004, Scala runs on the same JVMs mentioned above, which makes it a hand-in-glove fit for distributed big-data projects and data pipelines. A principal software engineer at Seattle-based Hiya told Built In in 2019 that he’s built petabyte-scale data projects on Scala.
SQL sometimes isn’t even considered a proper programming language, since it’s domain-specific. That’s silly of course, and just because you can’t, say, build an app with SQL doesn’t mean it’s not a must-know. SQL extensions can be added to allow more complex processes to run alongside a database, but it’s primarily used as a way to communicate with relational databases. Most data scientists won’t be futzing with actual database administration, but querying the data — and being able to investigate the nuances of how data might be manipulated by the database — are crucial skills to have.
True to its name, this Apple-developed, compiled language quickly garnered a reputation for speed after making its public debut in 2014. (Like similarly emergent languages Rust and Clang, Swift is backboned by the powerful compiler framework LLVM.) As Built In previously noted, machine-learning circles have taken particular notice for a few reasons: Swift is interoperable with Python, and Google, which once helped establish Python’s popularity, is strongly supporting Swift. (The program’s creator joined a top Google deep-learning research team in 2017.)