I am a huge fan of open source software because I believe they democratize the playing field and promote creativity. I am always on a lookout for open source software for predictive analytics and data mining. KNIME, the Konstanz Information Miner, is one of the best open source software for data mining, analytics, reporting and data integration. KNIME has a wonderful, and easy-to-use graphical-user-interface that competes with commercial software like SAS, and IBM SPSS. KNIME has most of the state of the art data mining and statistical algorithms built in. Additionally, it provides features to integrate other open source software such as Weka and R.
Today we have the president and one of the founders of KNIME, Prof. Dr. Michael Berthold, with us at YOU CANalytics. He will talk to us about different facets of KNIME including inception, development process, use cases, and future plans. Michael has promised us that KNIME will always be an open source software! So I recommend that after reading this interview you download and try KNIME for yourself without worrying about the license fee. The following is my conversation with Michael:
Roopam Upadhyay: Hi Michael, Thanks for talking to YOU CANalytics! I have read that you are the author of the first line of code for KNIME.
Michael Berthold: Well, let me first say that I am _arguably_ the author of the first line of KNIME code – there are three other founders who claim the same. Bernd claims that he may not have written the first line but that his first line is at least still part of the current code base.
Roopam: It is slightly more than 10 years since you started working on KNIME in January 2004. This was just a few months after you joined the University of Konstanz. What motivated you to work on KNIME?
Michael Berthold: We decided to start KNIME because three of us had worked together at a software company where we actually worked on a similar type of architecture – I still remember heated discussions in 2002/2003 when we argued over moving data in buckets or tables or… So when we started the group in Konstanz, we knew what we wanted to build and we actually had done what many software projects sorely miss: build a prototype, trash it, and start from scratch. KNIME, in a way, is really the second generation of its kind already.
It is interesting to go back and check what we aimed for when we started with KNIME: we wanted to build an open environment that was intuitive, easily extensible, and application agnostic. We wanted to build a professional and scalable architecture from the start, too, because some of the application that drove the development, such as pharmaceutical data analysis, were already doing “big data” before that term became hype. So, in contrast to many other open source projects, KNIME is _not_ a commercialized version of a PhD student’s project. It was aimed to be professional software from day one. It was also clear that the platform needed to be open source so that others could deploy their cool algorithms using it as well.
Roopam: These 10 years must have been a great journey; what motivates you now to work on KNIME?
Michael Berthold: It’s fun to see how – even without excessive/aggressive marketing – lots of people use and love KNIME. Actually we are quite proud of the latter: in surveys about analytics tools, KNIME users are often the most satisfied ones. That’s what still drives us: talking to happy users, sometimes meeting people who’ve been using KNIME for years and are doing truly powerful things.
Roopam: Have things changed since you started?
Michael Berthold: Have things changed? Absolutely and in many ways. We now have a professional organization in place, next to the research group in Konstanz still feeding cool new technology into the open source platform. But the key vision is still the same: we are building an integrative, transparent, flexible, collaborative and open platform. And through its openness it is actually more powerful than many of the other, closed solutions out there.
Roopam: KNIME has a great graphical user interface which is really easy to use though it still requires a little time to get used to. What learning resources can you suggest for new users of KNIME to get used to the interface and the incredible list of data-mining algorithms it has?
Michael Berthold: We have a lot of resources out there – our learning hub is the best place to get started with tutorials, white papers, YouTube videos, and more. And KNIME connects directly to our example workflow server which hosts hundreds of nice examples as well. Then there are the KNIME Press books, written by Rosaria Silipo and one co-authored by Mike Mazanetz. Both long time KNIME experts. And last, but definitely not least: use our forum! The KNIME community is super active and very helpful 24/7!
Roopam: Could you describe one of your favorite data-mining projects on KNIME, including the problem statement, analysis, and insights?
Michael Berthold: Oh, there are many – but the coolest one was probably a presentation at our User Group Meeting in 2013: someone had analyzed hundreds of KNIME workflows across the company and pumped them through KNIME’s frequent subgraph mining algorithm to identify pieces of workflows that appear often. It was using lots of KNIME technology to mine KNIME – and it actually had quite a few surprises for us, too. We expanded a number of nodes to include functionality that people seemed to use often together.
There are many other real world examples out there, too – I am not quite sure which ones to pick. The ones I like is where people use the integrative powers of KNIME, pulling together various different data resources, integrating them in KNIME, and then running joined analyses. Maybe three examples can serve to give a bit of an impression how widely KNIME is being used:
Triggered by a large Telko that uses KNIME for online discussion analysis we built our own workflow to analyse the KNIME forum: combining text, network, and “classical” data mining to find out who influences the forum: it’s interesting (although maybe not surprising) to see that positive sentiment tends to result in longer, more in-depth discussions and for us it was really nice to see that we now have many more external experts being active on our forum that just a few years ago. And that’s not because we are not posting/replying anymore but the KNIME community is really taking over the forum.
A second, more machine learning oriented example was presented by a group at our last User Group Meeting – they are using KNIME to train literally thousands of models for price prediction. The level of automation is helped by infrastructure built by one of our partners, Dymatrix. Their Dynamine environment connects to the KNIME Server and handles automatic training and re-training and, of course, continuous evaluation of those models.
On the complete opposite of the spectrum are applications such as the ones by a local bank – all they use KNIME for is to integrate their diverse data sources in a well-documented, reproducible way. They used to do that in Excel once a quarter and spent days every month manually putting all of this together. Now they literally click on button on the KNIME WebPortal and the final report is ready. Even we find it interesting to see how often KNIME is being used to replace pretty sophisticated Excel spreadsheets that are hard to understand or used in a reproducible way.
There are many more interesting applications in predictive maintenance, finance, health care, customer intelligence, pharma, games, online shopping – even casinos are using KNIME to better understand their customers. We have a number of white papers on our web site that describe some of these applications (or similar ones when where the data is not publicly available) in much more detail (link). For all white papers you can download the corresponding workflow and modify it for your own data/analyses.
Roopam: KNIME has an exhaustive list of data management, statistics, and data mining algorithms. What is your process for research, selection, and integration of new algorithms to KNIME? Also how often do you modify the existing program?
Michael Berthold: We include what we believe are standard/often used algorithms and constantly grow that list – and we also pay attention to what people are using via some of our integrations (most notably R or Weka). If we see some functionality pop up more frequently, we consider adding it to the library of native KNIME nodes – those scale better (KNIME nodes can -but are not forced to- run in-memory) and we can process more complex data types.
But we do not aim to replace R (or Weka) – quite a few of the more bleeding-edge algorithms are available there and it’s a great asset for people to be able to reach out to those libraries and use whatever they want to use.
Note that we never modify algorithms – if we do make changes to an algorithm in such a way that it changes its behavior we deprecate the previous node so that existing workflows still use that version and produce the same results as before. That way, KNIME workflows stay 100% backwards compatible and only newly created workflows make use of those modifications. Users are, of course, free to upgrade their workflows to the newer implementations.
Roopam: I think it’s great that the entire ‘Open Source’ community is leveraging each other’s strengths. Coming to Text mining and Natural Language Processing (NLP), they are quickly becoming essential tools in the toolkit for data scientists. Could you describe a few applications of text mining and NLP and explain how KNIME assists in solving these business problems?
Michael Berthold: We have integrations with a few packages for this type of data, the Stanford package comes to mind and the Palladin extensions. So you can combine textual data with other sources right in KNIME. The forum analysis that I mentioned above is such an example: it provides a sentiment score on posts and extracts a network of authors from their interactions. In the bioinformatics domain, people have used the text processing extensions to look at research articles and other type of scientific information. And one of our partners is using these tools to do language detection for automatic email sorting.
Roopam: Personally, I am a fan of KNIME and occasionally try to replicate my analysis done with other software on KNIME for fun. It works great on most occasions but still there are times when KNIME struggles. What is a good forum for analysts to reach out to you? Is there a suggested configuration for the computer (personal machines and servers) for working with KNIME seamlessly?
Michael Berthold: Let us know about it! We are always curious to hear about problems people encounter. We do have a very extensive test setup but there are always types of data and weird setups that cause problems and we are eager to fix that asap – but we need to know about it, of course.
Roopam: KNIME provides extension nodes for other open source packages such as R and WEKA. What is the purpose of these extensions? And what is the best way for a new user to get comfortable with these extensions?
Michael Berthold: See also my comments above – we don’t believe in closed analytics packages. No single vendor can really claim to provide all of the functionality out there. If at all then the R community can make that claim – but it is a community effort and let’s face it, if you aren’t a programmer it’s hard to use! We believe in the power of R for analytics and in the power of an intuitive and easy to use environment to get analytics onto people’s desktops. So being able to do lots of ETL and analytics in KNIME and being able to reach out to more advanced (R) or specialized (Weka) packages for other routines when needed gives you both. Simplicity and transparency and a connection to bleeding edge methods when you want to. Note that in KNIME you can even wrap R scripts so nicely that your non-programming neighbor never needs to touch R code…
Roopam: About a year ago Rapid Miner, another open source product, decided to cease its open source development and launched the latest version with a commercial license. Personally, I would be really sad if KNIME goes the same way though I understand there are a huge cost and effort involved in software development which needs to be compensated. Does KNIME also have any such plans?
Michael Berthold: KNIME will always be open source, we believe strongly in the value of our community – see also above. KNIME is powerful as-is but it’s really powerful when you reach out to the cool extensions provided by the community or our partners. This is a real asset and we do not plan to abandon them.
Plus I personally believe that the platform for data (analysis) has become commodity – it’s more like an operating system that you need in order to combine the data and tools that you want to use. You cannot afford to be limited by a vendor’s choice of tools to really trigger data-driven innovation – that’s also why we coined the “open for innovation” tag-line. In order to be open for innovation you need to have an open platform that allows you do what _you_ want to do not what your vendor thinks you should be able to do.
Roopam: Are there creative ways in which the business analytics community can support KNIME (through finance, ideas and work)?
Michael Berthold: Well, you can always donate extensions adding to the wealth of analytics functionality in the KNIME ecosystem. We won’t return checks either but you can help us most by spreading the word. If you use and like KNIME – talk about it!
Roopam: What are some of the exciting new things the team at KNIME is working on? What can the users expect in the recent future?
Roopam: Thanks so much Michael for talking to us and sharing your ideas and views. I am really excited about the latest release of KNIME!
- Download KNIME from this link - Read an earlier post about comparing statistical & data mining software