One of the crucial decisions while doing data analysis is an appropriate choice of statistics software and language. In this article, I am going to analyze and help you choose the right data mining and statistics software for your purpose. I have used the following 12 software / languages (tools) with different levels of experience over the years. Let me report my comfort levels and experience with these tools
|Experience||Statistics Software (tools)|
|Good||Excel; SAS; SPSS; R; Minitab; Statistica|
|Medium||Knime; Stata; WEKA; Python|
This Worries Me ..
Let me get this off my chest before I explore the above tools. I am really worried about the way many companies and new analysts are approaching the science of analysis. They feel that a software can do everything for them. Trust me, software is just calculators (highly sophisticated calculators). In the end, humans have to be creative in problem identification & definition, analysis and generation of actionable insights. In this process, a software serves as a faithful assistant as long as the analyst is in control of the problem. I must tell you that you are the most valuable resource your company can have to generate cool insights for business growth. In short stay curious while doing the analysis and always question the results your software of choice is throwing at you.
Let us have a look at how different statistical tools are used in industry. The data for the adjacent figure is taken from r4stats.com (link). The data is from early 2014 and generated through text mining the job descriptions (link) on the most popular job site in the US, indeed.com. By and large I believe this trend is similar globally. Though with the current trend, I expect R and Python (both open source languages) to surpass SAS in the next 5 years. In addition to industrial usage, the focus of this article is on helping you choose the right statistical tool for your purpose. SAS, Python and R are relatively heavy duty tools with a fairly steep learning curve. Other tools have their own benefits which we will explore in some detail as we will go further.
Criteria for Evaluation
I have set the following criteria for evaluating these 12 tools.
- Ease of Use : to evaluate relative effort required to learn these tools
- Price Tag : to assess dollar impact
- Analysis Algorithms : inclusion of data mining and statistical algorithms in the tool
- Data Handling : capabilities of the tool to handle large data, cleansing, and modification
After this I have ranked these tools on a scale of 5. It would be great if you could also rank the software of your preference on the same scales. Here, 5 represents the best score. These are early ranking scores and I will keep modifying these scores based on your inputs. Let me share my reasons for the scores I have assigned to these tools in the subsequent sections. I divided these tools in the following 4 categories
- Tools to Start Your Journey : these tools are great to start learning data analysis since they are really easy to use and comes with no extra baggage for new learners.
- GUI : once you have learned the art of analysis and are in no mood to learn an additional coding language, then these graphical interface based (GUI) based tools are perfect for usage. Some of these tools are at par with the ‘Power Tool’ in terms of heavy lifting analysis.
- Power Tools : These tools rely heavily on coding to do analysis, coding is a good way to simplify repetitive analysis additionally it provides analyst with additional freedom for analysis
- Specialized Analysis : these tool are good for time series analysis and forecasting for economic data
Review of Data Mining & Statistics Software / Language
Tools to Start Your Journey
Microsoft Excel: Many people blame Microsoft for their crappy software design. However, Excel is a wonder tool from Microsoft factory. I am a huge fan of Excel because as an analyst you cannot get this close to your data with any other tool. Excel, though not obvious, can actually be used for high end modeling like logistic regression and cluster analysis. Additionally, if you will do it in Excel you will learn the finer nuances of these techniques not possible in any other software. Read the book ” Data Smart: Using Data Science to Transform Information into Insight’ by John W. Foreman to learn more.
Minitab was originally designed for quality management but because of its simple GUI, it is a good place to start learning analysis. While using Minitab you will waste very little time on learning this software and most of your effort will be for analysis. Minitab includes modeling techniques like survival analysis and logistic regression hence has a decent set of algorithms to start your journey in the analysis world.
SPSS & Modeler: SPSS (Statistical Package for Social Sciences) as the name suggests was originally designed for social science research in the late 1960s. However, over the years it has evolved into a full-fledged statistical tool with an easy to use graphical interface. Modeler, an SPSS product, is an added tool with a drag-and-drop interface for data mining. Collectively SPSS and Modeler are a highly advanced tool for data analysis (at par with SAS). In 2009 IBM acquired SPSS Inc and since the products are called IBM SPSS and IBM SPSS Modeler.
Statistica is another great GUI tool is at par with SPSS & Modeler in terms of functionality. It has an add-on data mining suite which is possibly slightly inferior to Modeler, but on the surface, Statistica and SPSS are equally viable options. (StatSoft) Statistica is now a Dell product.
KNIME: this is high-quality open source software for analysis. The team at KNIME is putting in a consistent and great effort for exceptional user experience. The interface is somewhat similar to commercial GUI based software which makes life simple for analysts. There are a few minor performance issues with really large datasets. However, the team at KNIME is constantly working to resolve them. Additionally, KNIME’s free (open source) and licensed versions are exactly the same. The licensed version has an additional support subscription which an individual learner or small company can avoid. I strongly recommend KNIME. You must download KNIME and judge it for yourself.
Rapid Miner was open source free software just a year ago when without much warning the latest version (v. 6) was launched with a commercial license. This sudden move has certainly created substantial angst among its user-base which feels cheated. The software is good but I believe they could have adopted an innovative strategy to go commercial. I find the interface of Rapid Miner a bit clunky – didn’t get much into it even when it was free.
WEKA is another open source and free software used extensively in academia. WEKA has a simple interface (some may argue a bit bland), but it does its job. However, WEKA may choke a bit with larger data size. Try this one for yourself as it is free to use.
SAS & E-Miner: SAS (Statistical Analysis System) is great software and the market leader for analysis software but it is bloody expensive. In recent times Python and R have democratized this highly monopolistic market landscape. There are exciting times ahead and SAS will struggle to stay the market leader if they continue with their current strategy. Having said this, SAS still has the edge in terms of data handling capabilities.
R is a wonderful tool and it is reassuring that universities and smart folks around the globe are contributing to the development of R. One can find virtually any advanced algorithm on its package library (CRAN). Coding in R could have a slightly steep learning curve but I can assure you it is perfectly logical. You could learn R through these 12 resources (Link).
Python is another great language with wide usage outside data analysis. Python is a perfect language for data management and handling capabilities. For data analysis, there are advanced add-on libraries like NumPy, SciPy, and MatPlotLib. However, R still has an edge over Python for data analysis.
EViews: I have never seen anyone other than econometricians and economists using Eviews. It is a great tool for time series modeling and forecasting. However, the graphical interface for Eviews is not that great for a new user.
Stata: is another great tool with an SPSS-like interface and it has all the important time series algorithms.
Look forward to hearing from you about your experience with the different software. It would be great if you could also rank the software of your preference on a scale of 5 for the following criteria.
- Ease of Use
- Price Tag
- Analysis Algorithms
- Data Handling
I will update my table with your numbers so let us start the discussion!