The data scientist’s tools
27 March 2017
The data scientist must be comfortable with different types of analytics tools. Within engineering related to the construction of data processing systems, there are three basic tools to embark upon the analysis of huge volumes of information: Python, R, and Hadoop. While these programming languages are relatively new and not as widespread, they are easier to grasp for professionals already proficient in programming languages like Java or C.
Considered the standard among statistical programming languages, some know it as “the golden boy” of data science. R is a free software environment dedicated to statistical computing and graphics, compatible with UNIX, Windows, and MacOS platforms. It is a must in data science, and being proficient in it practically guarantees a job offer, given the increasing number of commercial applications and its advantageous versatility.
- R is free: anyone can install, use, upgrade, clone, modify, redistribute, and even resell R. Not only does it save money on technology projects, but it also provides constant updates, which are always useful for any statistical programming language.
- R is a high-performance language, which helps users handle large data packages, making it a great tool for managing big data. It’s also ideal for intense and resource-intensive simulations.
- Given all its advantages, it is increasingly popular. It has about 2 million users, who make up an active and supportive community. There are more than 2,000 free libraries with statistical resources devoted to finance, cluster analysis, and much more.
Another flexible and straightforward open-source programming language. A programmer working with Python ends up writing less code thanks to its “friendly” features for beginners, such as code readability, simplified syntax and ease of implementation.
- As with R, programming in Python is suited to a great deal of industries and applications. Python powers Google’s search engine, as well as YouTube, Dropbox, or Reddit. Institutions such as NASA, IBM, and Mozilla also depend heavily on Python.
- Python is also free, which benefits startups and small businesses. Since the language favors simplification, it can be handled by small teams. And a good knowledge of the basics of this target-focused language lets you migrate to another similar language just by learning the syntax of the new language.
- As a high-performance language, Python is the option often chosen to construct fast-access applications. Plus, its huge library of resources provides the necessary help to ensure that productivity is just a few clicks away.
Another staple for anyone who wants to venture into the analysis of big data. Available as an open-source framework, Hadoop facilitates the storage and processing of huge amounts of data. It is considered the cornerstone of any flexible and forward-thinking data platform.
- Hadoop is one of the technologies with the greatest potential for growth within the data industry. Companies like Dell, Amazon Web Services, IBM, Yahoo, Microsoft, Google, eBay, and Oracle are firmly committed to Hadoop’s implementation.
- One of its major benefits is to help companies with their marketing needs: Identifying customer behavior patterns on the website, providing recommendations and custom targeting, etc.
- Hadoop opens great career opportunities up in a wide variety of positions. Given its relevance in many industries, Hadoop specialists can find work as an architect, developer, administrator or data scientist.
Another frequent interaction in the data scientist‘s work is with databases. Here it’s common to work with NoSQL databases, Apache Storm, and processing tools like Spark, as well as with virtual machines like Storm.
Visualization tools are not as important for creating value as they are for convincing. In this sense, they’re associated with the results communication phase and the actual work of rediscovering the value of the data: it’s not the same to trawl through numbers as it is to present them. Programs such as QlikView, Tableau, and Spotfire are used for this.
Finally, there’s a pretty unglamorous part of the data scientist’s work, which is a process known as data wrangling. Raw data is often presented in a confused or imperfect way, so the data first needs to be manually collected and cleaned up before it can be converted into a structured format to be explored and analyzed. And this is a task that can take up more than 50% of the data scientist’s working time, using tools like OpenRefine or Fusion Tables.
Open source or proprietary software?
As in any area where specific software is required, data science professionals can choose between programs marketed by private companies and open-source software.
Before embarking on a data science project, it’s very important to know exactly which technological needs will be required to adapt resources and budgets accordingly. This is one of the reasons why more and more companies are opting for the flexibility of open-source alternatives. The variety of options arising from the open-source environment has also helped to expand the use of new technologies and knowledge. Fee-charging commercial tools that dominated the market up until recently are increasingly seeing their prominence diminished in favor of free alternatives.
Some experts have warned about manufacturers who try to impose their commercial solutions on businesses, which end up investing heavily in proprietary applications that always have an open-source alternative. This captive nature is replaceable by open-source projects, which are scalable and can offer a performance that’s comparable to proprietary software.
This article is part of the study “Data Scientists: Who are they? What do they do? How do they work?“, available on Rebel Thinking.
Other articles from the study: