One of the crucial decisions while doing data analysis is an appropriate choice of statistics software and language. In this article, I am going to analyze and help you choose the right data mining and statistics software for your purpose. I have used the following 12 software / languages (tools) with different levels of experience over the years. Let me report my comfort levels and experience with these tools
Experience | Statistics Software (tools) |
Good | Excel; SAS; SPSS; R; Minitab; Statistica |
Medium | Knime; Stata; WEKA; Python |
Low | E-View; RapidMiner |
This Worries Me ..
Let me get this off my chest before I explore the above tools. I am really worried about the way many companies and new analysts are approaching the science of analysis. They feel that a software can do everything for them. Trust me, software is just calculators (highly sophisticated calculators). In the end, humans have to be creative in problem identification & definition, analysis and generation of actionable insights. In this process, a software serves as a faithful assistant as long as the analyst is in control of the problem. I must tell you that you are the most valuable resource your company can have to generate cool insights for business growth. In short stay curious while doing the analysis and always question the results your software of choice is throwing at you.
Industrial Usage
Let us have a look at how different statistical tools are used in industry. The data for the adjacent figure is taken from r4stats.com (link). The data is from early 2014 and generated through text mining the job descriptions (link) on the most popular job site in the US, indeed.com. By and large I believe this trend is similar globally. Though with the current trend, I expect R and Python (both open source languages) to surpass SAS in the next 5 years. In addition to industrial usage, the focus of this article is on helping you choose the right statistical tool for your purpose. SAS, Python and R are relatively heavy duty tools with a fairly steep learning curve. Other tools have their own benefits which we will explore in some detail as we will go further.
Criteria for Evaluation
I have set the following criteria for evaluating these 12 tools.
- Ease of Use : to evaluate relative effort required to learn these tools
- Price Tag : to assess dollar impact
- Analysis Algorithms : inclusion of data mining and statistical algorithms in the tool
- Data Handling : capabilities of the tool to handle large data, cleansing, and modification
After this I have ranked these tools on a scale of 5. It would be great if you could also rank the software of your preference on the same scales. Here, 5 represents the best score. These are early ranking scores and I will keep modifying these scores based on your inputs. Let me share my reasons for the scores I have assigned to these tools in the subsequent sections. I divided these tools in the following 4 categories
- Tools to Start Your Journey : these tools are great to start learning data analysis since they are really easy to use and comes with no extra baggage for new learners.
- GUI : once you have learned the art of analysis and are in no mood to learn an additional coding language, then these graphical interface based (GUI) based tools are perfect for usage. Some of these tools are at par with the ‘Power Tool’ in terms of heavy lifting analysis.
- Power Tools : These tools rely heavily on coding to do analysis, coding is a good way to simplify repetitive analysis additionally it provides analyst with additional freedom for analysis
- Specialized Analysis : these tool are good for time series analysis and forecasting for economic data
Review of Data Mining & Statistics Software / Language
Tools to Start Your Journey
Microsoft Excel: Many people blame Microsoft for their crappy software design. However, Excel is a wonder tool from Microsoft factory. I am a huge fan of Excel because as an analyst you cannot get this close to your data with any other tool. Excel, though not obvious, can actually be used for high end modeling like logistic regression and cluster analysis. Additionally, if you will do it in Excel you will learn the finer nuances of these techniques not possible in any other software. Read the book ” Data Smart: Using Data Science to Transform Information into Insight’ by John W. Foreman to learn more.
Minitab was originally designed for quality management but because of its simple GUI, it is a good place to start learning analysis. While using Minitab you will waste very little time on learning this software and most of your effort will be for analysis. Minitab includes modeling techniques like survival analysis and logistic regression hence has a decent set of algorithms to start your journey in the analysis world.
SPSS & Modeler: SPSS (Statistical Package for Social Sciences) as the name suggests was originally designed for social science research in the late 1960s. However, over the years it has evolved into a full-fledged statistical tool with an easy to use graphical interface. Modeler, an SPSS product, is an added tool with a drag-and-drop interface for data mining. Collectively SPSS and Modeler are a highly advanced tool for data analysis (at par with SAS). In 2009 IBM acquired SPSS Inc and since the products are called IBM SPSS and IBM SPSS Modeler.
Statistica is another great GUI tool is at par with SPSS & Modeler in terms of functionality. It has an add-on data mining suite which is possibly slightly inferior to Modeler, but on the surface, Statistica and SPSS are equally viable options. (StatSoft) Statistica is now a Dell product.
KNIME: this is high-quality open source software for analysis. The team at KNIME is putting in a consistent and great effort for exceptional user experience. The interface is somewhat similar to commercial GUI based software which makes life simple for analysts. There are a few minor performance issues with really large datasets. However, the team at KNIME is constantly working to resolve them. Additionally, KNIME’s free (open source) and licensed versions are exactly the same. The licensed version has an additional support subscription which an individual learner or small company can avoid. I strongly recommend KNIME. You must download KNIME and judge it for yourself.
Rapid Miner was open source free software just a year ago when without much warning the latest version (v. 6) was launched with a commercial license. This sudden move has certainly created substantial angst among its user-base which feels cheated. The software is good but I believe they could have adopted an innovative strategy to go commercial. I find the interface of Rapid Miner a bit clunky – didn’t get much into it even when it was free.
WEKA is another open source and free software used extensively in academia. WEKA has a simple interface (some may argue a bit bland), but it does its job. However, WEKA may choke a bit with larger data size. Try this one for yourself as it is free to use.
SAS & E-Miner: SAS (Statistical Analysis System) is great software and the market leader for analysis software but it is bloody expensive. In recent times Python and R have democratized this highly monopolistic market landscape. There are exciting times ahead and SAS will struggle to stay the market leader if they continue with their current strategy. Having said this, SAS still has the edge in terms of data handling capabilities.
R is a wonderful tool and it is reassuring that universities and smart folks around the globe are contributing to the development of R. One can find virtually any advanced algorithm on its package library (CRAN). Coding in R could have a slightly steep learning curve but I can assure you it is perfectly logical. You could learn R through these 12 resources (Link).
Python is another great language with wide usage outside data analysis. Python is a perfect language for data management and handling capabilities. For data analysis, there are advanced add-on libraries like NumPy, SciPy, and MatPlotLib. However, R still has an edge over Python for data analysis.
EViews: I have never seen anyone other than econometricians and economists using Eviews. It is a great tool for time series modeling and forecasting. However, the graphical interface for Eviews is not that great for a new user.
Stata: is another great tool with an SPSS-like interface and it has all the important time series algorithms.
Sign-off Note
Look forward to hearing from you about your experience with the different software. It would be great if you could also rank the software of your preference on a scale of 5 for the following criteria.
- Ease of Use
- Price Tag
- Analysis Algorithms
- Data Handling
I will update my table with your numbers so let us start the discussion!
Thanks for the summary. It is very useful especially for those who are going to acquire the knowledge. By the way, how do you rank MATLAB in terms of the 4 criteria?
In my opinion, MATLAB is a completely different tool; it’s a great (probably the best) tool for engineering applications, simulations and numeric computations. However, for the kind of data analysis & predictive analytics we generally do in industry (banks, telcos, retail etc.) you would mostly like to avoid MATLAB. Hence, I feel it’s not appropriate to compare MATLAB with the above tools. Hope this helped.
Hi Roopam:
With regard to statistical tools and statistics, one must take great care to insure that the necessary and sufficient conditions of a particular statistic are met. Absent that, the results maybe incorrect and misleading.
With regard to predictive analytic’s, experience has shown that it is very difficult to predict the future.
I’ve never quit understood data mining. What would cause someone to go data mining? Moreover, is the mined data correct?
Regards,
Jay Sorenson
Hi Jay,
That’s a really good question. Let me try to answer it, as best I can, in fewer words. However I must tell you there are deeper philosophical and conceptual puzzles in your question. First, one needs to distinguish between correlation and causation. Primarily, most statistical analyses (except controlled experiments) try to establish correlation between two or multiple variables. This is same for data mining – it is all about correlation, don’t confuse it with causation. For many business problems correlation is good enough to move forward. Secondly, statistical tests of significance usually play a crucial role with smaller data, though with larger data those significance tests break-down – many times noise may look significant with data mining. Hence, while mining data one needs to be extremely careful and make astute business and scientific judgment at every step. This is something you learn with a lot of experience. The best way to reach there is keep asking questions.
On a more philosophical level, could I every say my experience is the only experience? And even if I extend it with the experience of everybody who has ever lived, is that the complete experience? I believe the beauty of life is in living with incomplete information and its unparalleled mysteries, whether you are mining data or doing statistical analysis it stays the same. The idea is to identify that small piece of information (within excessive noise) as best as possible.
Rexer Analytics does a survey every year comparing a lot of the same tools. http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html They’ll send you a summary for free on request.
Paige
For some years now what you call “SPSS” has been named IBM SPSS Statistics (Modeler is IBM SPSS Modeler). It would be helpful to use the current names.
On substance, SPSS Statistics has extensive apis for using Python or R code (as well as Java). Currently there are 101 extensions posted on the SPSS Community website that are implemented in Python or R. SPSS Modeler as of version 16 also has R apis as well as some support for Python.
Thanks Jon, That is a good point you made. Additionally, even Statistica is now a Dell product. However, I deliberately kept the names without the name of the giant parent companies to emphasis the statistical software. Have mentioned these facts in the short writeup about the software in the above article. I am glad you brought this point up.
Roopam,
I think you’ve done a great job here, but I’d like to offer some (hopefully) constructive criticism. The score for SAS on algorithms is, IMO, far too high.
1. Because of their business model, SAS will always be behind R and Python when it comes to algorithm implementation.
2. Because SAS functions so poorly as a programming language, no one is realistically creating their own model types from base algorithms. Want to create a new flavor of boosting? Have an interesting ensembling algorithm? Good luck doing anything with those ideas in SAS.
Perhaps you’ve limited the comparison to “baked in” algorithms and have ignored the extensibility aspect OR perhaps you mainly talking about EDA and not modeling?
Thanks Dean, I accept your constructive criticism. I think I have been generous to SAS in my rating for modeling algorithms. I have avoided SAS for so long (while enjoying R) that I have forgotten how annoying it is. Moreover, even for ‘designing experiments’ or ‘optimization tasks’ one has to buy add-on packages at obnoxious price. This is particularly annoying when your organization has paid more than a few hundred thousand dollars for enterprise licenses. I am really happy that R and Python are democratizing the playing field.
Thanks for the summary Roopam. I’m currently using SAS and SPSS. I’m very keen on learning to use Python and R.
All the best Julien while learning Python and R. You will enjoy them! If you are migrating from SAS and SPSS, I suggest you start with R as I find it more suitable for analysis.
2 years later and I have chosen Python as my language of choice. R was great but Python was addictive. I’m also participating in Kaggle competitions which is fun. I don’t see much on the future of SAS or SPSS, managers are getting skeptical on costs and licenses.
That’s wonderful Julien that you could find your match in Python. I believe it will be extremely useful for other readers if you could describe what aspects of Python you find addictive over R. A short comparison between them based on you personal experience will be much appreciated.
In the chart on usage, the horizontal scale is ‘thousands. Of what?
In the chart on ranking, which is better, 1 or 5
Get serious.
Thanks for your comment. It’s interesting that you noticed ‘thousands’ in the bar chart. Ask yourself does it really matter whether it is thousands or millions? It’s a sample of job-postings on ‘Indeed’. The bar chart is to present the relative positions of software in terms of job requirement. In the table, 5 is the best score from the users’ perspective as already mentioned in the article.
Hi Roopam,
Thanks for the information fruitful indeed may i request to please provide a comparison between available algorithm between R & Python -Scikit-Learn as i struggled in vector size allocation in R , even though i was using 32GB machine for production algorithm development.
As far as my understanding data frame (3GB size ) having more than 2crores records ,R is very very slow
Many Thanks
Mahi
Thanks Mahi,
For high-performance tasks & machine learning Python NumPy & SciPy libraries work significantly faster than R because of their internal design. I have personally experienced this with large data-sets. However, Python data analysis packages are still not as evolved as R. Python has good libraries for machine learning tasks such as SVM, regularization etc. However, I must point out; R is designed & developed from the perspective of statisticians and analysts, not programmers. Hence, R has highly advanced libraries & packages for analysis and data thinking.
Size of data frames is calculated using both rows (number of records) and columns (number of variables). Hence, if you have lots of variables and several of them are in character format then a much lesser number of records than 2 Crore (or 0.2 billion) can constitute to 3 GB size.
For your analysis, I recommend you do most of your data pre-processing in SQL before analyzing the data in R. Also, see if you could divide your analysis in pieces (this is usually possible through right sampling or by using random forest or ensemble methods). 32 GB RAM in my opinion should handle data-sets for most practical purpose analyses as long as you have spent enough time on data pre-processing and refined your problem nicely. However, there are always exceptions and it might not be possible in your case. In such case try Python or Julia (which is better for high-performance analysis tasks but again not as evolved as R).
Thanks for your suggestions. I am learning Python for fun while I am using R for data mining. My question is how about Julia?
Julia is projected as a language for high performance computing, data analysis and big data (what ever that means). Julia has an advantage over R and Python when it comes to speed of performing operations. However, both R and Python have advanced packages for high end analyses that are missing in Julia. I think Julia has applications particularly in algorithmic trading and similar sort of works where speed is a crucial factor. Other than speed Julia, with its current structure, has very little to offer over both R and Python. Additionally, for speed C and C++ are better suited than Julia. As of now, I think Julia still has a way to go before it can bring some serious value on to the table for business analysts and data scientists.
It is a very good article, I got many information by reading your article. I am looking forward for some more articles which will help for beginners
Hello.
You forgot Mathematica, well-known for its ability to do complex symbolic, infinite precision calculations and solve abstract problems, but it’s also a great tool for statistics, interactive graphics, data cleansing and almost any other task you could do with the tools you mention on your article.
I have used Mathematica a long time ago and believe it is similar to Matlab than these specialized data science software used in industry.
Hi, I also very much enjoyed your brief comparison as well as readers comments. Generally I have a long history with Matlab but started to use Python a lot in recent years for data visualization, calculations, and some statistics. Have a book on R on my shelf but not started yet. During a few years used Mintab for analysis of production data since favored by my company at the time.
Now I work as consultant and face a customer that want to use Minitab for analysis of production data. I am in two minds here and consider to also use Python (or perhaps R). The main reason is that I understand they want to build up routines for continuous monitoring and evaluation of production and therefore need to streamline the work with scripts tailored to their processes. Try to figure out how convenient Minitabs script-functionals is. Did not use their script functionality when I did this kind of work ten years back and my job as not really to streamline the analysis for future use, rather making a thorough base-line report. When I google on Minitab I find very little documentation around macros and scripts and how Minicab interact with say an SQL-database etc. Would appreciate some comment from you.
For recurring and iterative analysis I recommend you use either R or Python. Minitab has a script of its own but its not that great. Minitab is a good tool for quick analysis and is never meant for iterative work. I think You will find the following links to learn R and Python useful:
1) Leaning resources for R – Free R Books
2) Leaning resources for Python – Free Python Books
Hi Roopam,
I think matlab is highly underrated for data science purposes. It’s relatively cheap, very stable and supports a lot of machine learning methods. My wife who is also a data scientist used it to build a complex predictive model for the largest retailer in the Netherlands and they were very happy with it.
I think matlab is fairly unknown because they don’t do much advertising. I’ve used it for the past 10 years in an electronics lab environment. The usual tool there is Labview, but Matlab outperforms Labview in almost every way (I’ve tried both and decided for Matlab). When I asked Matlab people about it they said that they knew they were better, but they just don’t advertise it. I find this very strange!
In the coming months I’ll be learning sas and R because I’m making a move to datascience and these are the most used tools, however, I would not write Matlab off.
Best regards,
Hans
Hi Hans,
I used Matlab a long time ago and mostly for engineering applications. I have never used Matlab for machine learning related work in industry. It will be great if you could share your experience while learning these tools.
Cheers,
Roopam
Dear Roopam,
This is my and my wifes experience with matlab up till now, we’ve both been using it for 10 years. I used it for measurement and signal processing and my wife for business analysis / data science. First, the supported machine learning statistics methods can be found in these pages.
http://nl.mathworks.com/products/statistics/
http://nl.mathworks.com/products/neural-network/
They also have a curve fitting and optimisation toolbox. My wife did her phd in artificial intelligence mostly with matlab and after that she started working for the biggest retailer in Holland as I mentioned, making a very complex forecasting model using huge records with millions of transactions for training the model. What I really like about matlab is the increadibly well written help pages that get you going in minutes most of the time. I once tried building a neural network 5 years ago (which I had never done before) but it took me less than an hour to write a full program that created a noisy 4th order function and use a neural network to retrieve the 4th order function perfectly. This is due to the great help pages. This is why I think Matlab is worth the money. It’s not that expensive, Matlab with all the toolboxes you need for machine learning is probably less than 10.000 euro’s for a single licence, but the great support earns back the money. Furthermore, it’s an increadibly flexible programming language, so you can do whatever you want and interface with anything you want. My wife has been in data science for over 10 years now, and she has knowledge of weka, Matlab, SAS, rapidminer and some R, but if she has free choice of software, it will be Matlab. So Matlab is affordable, can do the job, can handle big datasets (if you have a powerful pc) has fantastic help pages and is highly flexible and has proven to work perfectly for large complex projects. That’s why I’m saying: ‘Do now write it off, it is a very powerful tool, it’s just not promoted for this job’. I think that is a big pity, because it really does a good job at a very fair price compared to the competition.
What is also a big plus with matlab is backwards compatibility. In my 10 years of using matlab I’ve never had a backwards compatibility problem. A colleague of mine has extensively been using Python the last few months and he ran into a wall of compatibility problems where he had to install exactly the same versions of every python ‘sub-tool’ to get a piece of code running that was also used at another site. This took days to get going. I consider this a risk that comes with open source software, there’s not much quality control being done before something is sent out and backwards compatibility certainly does not seem to have a priority. This time lost would already pay a big part of the cost of buying Matlab. The remark here is: cheap is not always better, however, sas and spss are way too expensive, those prices feel a bit like theft to me. That is why I think they’ll get in trouble in the coming years. However, free software is on the other end of the scale. so my feeling is that Matlab is the perfect combination of all variables here with it’s good price, fantastic support, flexibility and performance.
I find it a pity that Matlab is not more ‘famous’ in the data science world. Again, I think it is highly underrated. If anyone wants to play with Matlab, they’re very easy in giving one month trials. I regularly got those to try a toolbox I was curious about. This gives you a good opportunity to play with the tool and see what it does.
Best regards,
Hans
Thanks Hans, for sharing this extensive feedback on Matlab along with your personal experience. I am sure readers on YOU CANalytics will find it useful.
I agree with your point about issues of backward compatibility for Python (or even R) when compared with Matlab. The former being opensource platform will always struggle a bit with issues of backward compatibility because of its wide community of users and developers. However, this wide community is also Python and R’s strength.
Personally, I think creativity, and innovative thinking is the key to successful data science implementation than a platform. All the platforms are more or less at par with each other. If I have to make a large investment then I will invest it on human capital & creative thinkers rather than a platform.
Thanks, some compatibility issues are a fact of life, but I like the idea of the wide community that is developing. Its very good to have these kind of insights from someone who is very experienced in this field.
I really agree with the last part, in the end, it’s the creative problem solvers that can handle any problem. You can real easily pick them out of the crowd you work with. These are also the people that learn from previous mistakes because they recognize the similairity in a different situation and don’t make that mistake again. And these are the people that solve a problem when they sense it, rather than leaving it in there, causing a huge amount delay and cost lateron in the project (like not being 100% sure of the business question that needs to be solved)
Nice blog post. information are so useful & helpful. Thank you for sharing the post.