Comment on “Why R is Bad for You”
There seems to be a lot confusion about the role of programming in relation to the Data Science platforms that research firms Gartner and Forrester have identified as the future of Data Science in large corporations. For example, numerous people have stated that SAS is pushing a drag-and-drop platform (Enterprise Miner) that somehow limits choices and is destined to fail due to the fact that using programming (that is, R) allows greater flexibility.
It’s certainly true that Gartner is telling people (based on feedback from the largest firms) that “drag and drop” is a minimum requirement for large data science teams (corporate and line-of-business teams, to use Gartner definitions) and that open source programming is not manageable in this context. Gartner does not address the issue of whether programming in general can, or should, play any part in these platforms. This is most likely because Gartner (and the large firms) are focused on dealing with the issue of whether unsupported open source programming—with all its shortcomings—can be used in an enterprise-level “production environment.” Various folks in the open-source community apparently assume this means that data science platforms and programming are mutually exclusive, that platforms are therefore limiting, R is the obvious choice in terms of extensibility/innovation, and that the latter will prevail over Data Science platforms. However, it is not the case that Enterprise Miner is just a point and click app. Enterprise Miner actually has a full programming interface as seen in the quotes below from the Enterprise Miner Reference document.
So, you ask, why does SAS include a full programming capability if drag and drop Data Science platforms (with modules, building blocks, etc.) are the wave of the future in Data Science? It’s because of what several people have said at various points in this thread (me included, I’m the “Paul” whose note Bill pasted in a few days back; Thanks, Bill.) That is, it’s not feasible to build drop-downs for every possible contingency, even with the exhaustive capabilities of a top-notch data science platform. For instance, if you look at the initial course for SAS Enterprise Miner (“Applied Analytics for EM”), it’s essentially all drop-downs. But in the advanced course (“Advanced Analytics Using EM”), you’re using drop-downs for 2/3 of the material and programming statements for maybe 1/3. As the SAS description below says, using programming allows Enterprise Miner users to utilize all aspects of the entire SAS system and expand use of EM beyond those provided in the drop-downs. (Including full programming capability has been the case for years not just in Enterprise Miner but also in Enterprise Guide, a point and click interface that’s very good, but which falls short of the capabilities of Enterprise Miner. See the papers below that talk about how you use the programming interface in Enterprise Guide.)
My own view of this is that while it’s nice to have drop-downs for something relatively simple like Enterprise Guide, I’d prefer to use programming statements, so that’s what I do using the main SAS application (comparable to using an IDE for R/Python.) That’s because the programming interface for Enterprise Guide is “too good” – it’s better than the programming window in the main SAS app. For now, I actually want to be staring at a blank window to input programming rather than having SAS suggest code/files/etc as I start typing text. If I don’t know the right programming statement or get the syntax wrong, I want to know that, learn the right code, and not make the same mistake again. The types of things you do in Enterprise Guide are relatively “discrete” and programming statements work best for me (this situation is comparable to what you’d be doing with Python or R, in my opinion.)
However, when you get to the enormous complexity of Data Science tasks in big firms with large Data Science staffs, drag and drop applications like Enterprise Miner become a necessity. You may be running hundreds/thousands of models using multiple techniques (regression, decision trees, neural networking, perhaps additional input from sources like R/Python), then scoring everything and developing your own unique model/equation. You could no doubt write the code for all that but it would be enormously time consuming.
At some point, however, you may need to push the analysis on a specific aspect even deeper than what the platform drag and drops provide, and at this juncture you can use programming statements since there is no effective limit on what you can do with programming. Needless to say, if you’re using Enterprise Miner and are regularly using programming statements to extend Enterprise Miner’s capabilities, then you are in a very different world than most people who use Enterprise Guide or “discrete programming,” whether that means Base SAS, R, or Python. This sort of capability probably exceeds what is normally necessary in firms, and probably doesn’t get much explicit attention from the syndicated research firms or from companies (i.e. as a job requirement, since few people have the expertise in Enterprise Miner, SAS database and statistical programming, that allows them to implement these kinds of super-advanced capabilities.)
So the question is not do we do drag and drop, or do we do programming, for Data Science? It all depends on the complexity of what you’re doing (i.e. is it “discrete” analysis or things that require a full-fledged Data Science Platform) and whether you know programming. Enterprise Guide and Enterprise Miner are agnostic about what you use: you can do either, or both in the same workflow with each complementing the other.
In terms of the original question posed in this thread regarding R, if (like me) you’re fine with paying a yearly license fee for Base SAS, then R is not the best option for learning or doing data science (I can fill in whatever gaps remain with some R/Python after learning SAS programming.) If you don’t like license fees, go with R but then you “pay another price” in terms of increased complexity/learning difficulty/productivity issues, and also face the question of how to deal with the need for data science platform capabilities when R doesn’t seem to have anything like this (one way to partially address this shortcoming, I guess, is to use a cloud service like Azure ML.) Regardless of whether you use SAS programming, R, or something else, if you are good at programming you have the option of extending the capability of the data science platform you use on the rare occasions when that’s required.
SAS Enterprise Miner 14.1 Reference Help
Chapter 75 – SAS Code Node (p. 1121)
“The SAS Code node enables you to incorporate new or existing SAS code into process flow diagrams that were developed using SAS Enterprise Miner. The node extends the functionality of SAS Enterprise Miner by making other SAS System procedures available for use in your data mining analysis. You can also write SAS DATA steps to create customized scoring code, conditionally process data, or manipulate existing data sets. The SAS Code node is also useful for building predictive models, formatting SAS output, defining table and plot views in the user interface, and for modifying variables metadata. The SAS Code node can be placed at any location within a SAS Enterprise Miner process flow diagram…..The exported data that is produced by a successful SAS Code node run can be used by subsequent nodes in a process flow diagram….The code pane is where you write new SAS code or where you import existing code from an external source. Any valid SAS language program statement is valid for use in the SAS Code node with the exception that you cannot issue statements that generate a SAS windowing environment.”
Technical Papers on Programming in SAS Enterprise Guide
Comment on “Why R is Bad for You”