American Community Survey data in R

American Community Survey data in R

I met up with an old grad school friend a few weeks back. He’s an accomplished professor of demography, basically the statistical study of populations, particularly surrounding births, deaths, and fertility. Where he once traveled extensively to conduct demographic surveys in remote parts of the world, my friend’s now content to work with available U.S. census data to develop models and test hypotheses. Like me, he’s a stats/techie guy accomplished in statistical packages such as SPSS, SAS, and Stata. He notes he’s played around with R but was warned early on it couldn’t scale for his demanding needs. We then morphed to an engaging discussion about open source and the evolution of statistical software. My sense was that he was wary of the R nay-Sayers, but found his statistical comfort zone hard to abandon.

Alas, though perhaps not appreciated in many statistical settings, the R of 2018 is quite a step up over the R of 2008, driven significantly by contributions such as tidyverse and data.table from its extensive open source ecosystem. Indeed, the R code I write today would be unrecognizable to a core R stickler of even 5 years ago. And as for R’s very real RAM limitation, the 64 GB RAM, 2 TB SSD computer I purchased 2 years ago for $2500 has handled 95% of the challenges I’ve thrown at it with aplomb. The belligerent 5% I take to the Cloud and Spark.

As our discussion progressed, we opted to “challenge” R with the type of work he now does in his research. We settled on the 5 year (2012-2016) ACS Public Use Microdata Sample files PUMS as our data source. The PUMS “are a sample of the actual responses to the American Community Survey and include most population and housing characteristics. These files provide users with the flexibility to prepare customized tabulations and can be used for detailed research and analysis. Files have been edited to protect the confidentiality of all individuals and of all individual households. The smallest geographic unit that is identified within the PUMS is the Public Use Microdata Area (PUMA).”. Tellingly, there’re freely-available R packages demography and censusapi that make life a lot easier for R demographers.

The data we used consists of a series of household and population files. Each household has one or more persons (population), while each person is for one and only one household. It turns out that there are about 7.4M households consisting of 15.7M persons represented in this sample. A common key, “serialno”, connects the data.tables. Our task was to load both household and population data, with the mandate of joining the two to build topic-oriented data sets on demand. The raw household and population data are downloaded from zip files consisting of 4 CVS’s each. 

The development environment consists of Jupyter Notebook with a Microsoft R kernel. R’s tidyverse, data.table, and fst packages drive the data build. What follows is the implementation code.

Read the entire blog here.

American Community Survey data in R

The R Medicare Factor

The R Medicare Factor

Several years ago, I came across a report that led me to a new data set revolving on Medicare payment and utilization for physicians and other health care providers. The data created quite a storm, highlighting what appear to be extreme payments to individual physicians and practices. In 2012, just 100 physicians received 160 M in Medicare reimbursements, with one Florida ophthalmologist netting over 21 M.

To get a feel for the data, I confirmed much of the analyses that were published early on by the mainstream media. Over time, my interest in the data grew, as additional annual files were added. At this point, there are four text files, 2012-2015, each in excess of 9 M records. An analysis pattern for these data can be summarized as follows:

1) readily downloadable files, generally delimited or xls. These files can either be copied by hand or moved programmatically. 2) multiple files, often dimensioned by time or other variable(s). 3) a common, consistent format to the files, so that “reads” will work similarly on each and the data can be “stacked”. 4) a structured file naming convention, either given or assigned, that provides dimensional info for data loads. Date/Time is the most common cut.

My interest in the data has as much to do with its format as it’s content. Those familiar with R have not doubt been exposed to the factor data type, used to store categorical or ordinal data. Factors consist of levels and labels, and are represented as one integer per record signifying the level and pointing to the relevant character label that is stored only once. Factors “compete” with character attributes, since it’s generally true that a factor can be stored as a character and vice-versa. Historically, factors have been used mostly to represent dimensional attributes such as gender, race, or income category, but in theory at least, there may be an opportunity for factor variables to save storage for any character column where there are relatively few unique column values compared to the total number of records.

With over 37 M records and 30 attributes as of today, the size alone of this data creates challenges which can help answer questions that smaller fabricated data sets cannot. And many of the attributes such as name, street address, and city are inherently character, allowing testing as to how they might optimally be stored. So for me, the decision to experiment with the character storage options was a no-brainer.

To conduct the tests, I downloaded the four annual files to my notebook, developing scripts using Jupyter Notebook and Microsoft R 3.4.3. With the R data.table package, I created two versions of the medicarephysician structure, the first storing character columns as character, the second storing the same as factors. I compared the memory requirements of each data.table, and contrasted performance/size with writing/reading of serializable export files.

The results follow…..

Read the entire article here.

The R Medicare Factor

College Hoops Conference Rankings

College Hoops Conference Rankings

It’s that time of the year again — the start of March Madness, 2018. Last Sunday, the selection show announced the 68 teams, and the preliminary round games start tonight. Of the 68 schools, 32 were automatically selected as conference champions, while the remaining 36 received at large bids.

What determines what 36 teams are chosen for at large slots? That generally boils down to won-loss performance against strength of schedule. Teams playing well against tough competition are rated highly. On the other hand, a solid record against mediocre competition might not make the cut. The 32 D1 conferences are quite unequal in hoops. The Atlantic Coast Conference (ACC) historically has been the cream of the crop, so one would expect multiple at large bids there. The Colonial Conference, on the other hand, might get no at larges, sending only the conference champ.

A month or so into the season, once teams have played ten games or so, the ratings systems such as the rating percentage index (rpi)  kick in, providing a metric to rank all teams from 1 to 351 for basketball. rpi is composed of a team’s winning percentage (25%), its opponents’ winning percentage (50%), and the winning percentage of those opponents’ opponents (25%). There’s also an adjustment for home, away, and neutral venue games. A team’s strength of schedule is thus critical to it’s rpi potential, giving large school conferences a big advantage. Other metrics such as kenpom refine the rpi.

Since the 32 conferences are very unequal in their rpi and kenpom metrics, one would predict the “majors” to differentiate on rankings derived from rpi/kenpom, and subsequently to get more at large bids. And indeed that played out in spades this year, with the ACC hoarding 9 tournament slots (champion + 8 at large), the Southeastern Conference (SEC) 8, the Big 12 (B12), 7, and the BIG East (BE), 6. 

Read the entire article here.

College Hoops Conference Rankings

Tutorial Day at Strata Data San Jose, 2018

Tutorial Day at Strata Data San Jose, 2018

Tuesday of Strata Data Conference is my favorite of the four days. The calm before the storm of the keynotes and short presentations of Wednesday-Thursday, Tuesday revolves on half day training sessions that afford reasonably deep dives into technical data science topics. This year my choices were Using R and Python for scalable data science, machine learning, and AI in the AM and Time series data: Architecture and use cases in the afternoon.

I was somewhat wary going into the first session, since the presenters were from Microsoft, which markets a commercial version of open source R, as well as Azure, its comprehensive set of cloud services that competes with AWS. My concern was that the technology presented would be geared to uniquely Microsoft solutions that wouldn’t generalize and thus be of limited value in non-Microsoft environments. It turns out I was both right and wrong: yes, the solutions revolved on Azure and utilized Microsoft extensions to R and Python; but no, at the same time, the material was of significant value for a non-Microsoft-committed developer like myself.

The presentations were on featurization and active supervised learning of data sets with limited percentages of final “y” outputs. In these cases, labeling is often very expensive and otherwise painful, so that labels are often built over time and used to construct training data in drips and drabs. Two interesting use cases, involving both text and image classification, were comprehensively reviewed.

The development environments were Jupyter Notebook for Python/Spark and Rstudio/RMarkdown for R/Spark. The illustrated code was comprehensive and showed both the data build and machine learning processes. While I wasn’t hands-on as others during the session, I was able to follow the thinking, and will download the code and run it on Azure at a later time.

This was a very solid tutorial by senior data scientists from Microsoft. Not only were they knowledgeable presenters, but they also well-covered the large classroom to handle live technical issues. Two thumbs up.

I had the opportunity to take a time series analysis class in grad school with the venerable George Box back in the day. And in the 40 years I’ve been in the data/statistics work world since, I’d guesstimate that 80% of my effort has been devoted to assembling data, with 80% of the remaining statistics work given to forecasting.

I must acknowledge that I sometimes interchange time series with forecasting but, as confirmed this afternoon, the two are quite different. As I now appreciate, time series in its most general usage is much larger than forecasting – having to do with data assembly, management, and analytics.

So the time series session was more about the latest in storage technologies such as S3 and HDFS, streaming products like Apache Flink, Apache Kafka, Apache Storm, and Spark Streaming, and access engines Cassandra and Spark SQL, than it was about specific forecasting algorithms.

But no worry for me. With a background in R, Python, SQL, and Spark as well as the time series data management packages in R, I was able to follow along seamlessly. Especially the first two hours that were devoted to the logic of time series concepts like lead and lag, tumbling windows, sliding windows, session windows, event times vs processing times, and inflection points.

I did struggle with configuring the hands-on Kafka and Spark Streaming tutorials – but what the heck, I’m more on the data/analytics side of data science than I am on adminstration/app dev. Students implementing on streaming platforms in their current jobs hit the technical mother lode with instructor Ted Malaska. Malaska is very accomplished, if a bit disorganized at times. He was at his best discussing specific technologies like Kafka and Spark and answering participant questions. In all a quite productive session, perhaps half a step below the first. 1.5 thumbs up.

Tutorial Day at Strata Data San Jose, 2018

"Kindof" Big Data in R

“Kindof” Big Data in R

Ask for feedback from just about any critic of the R statistical package and you’ll hear two consistent responses: 1) R is a difficult language to learn, and 2) R’s data size limitation to physical RAM consigns it to toy academic applications. Having worked with R for over 15 years, I probably shared those views to some extent early on, but no longer.

Yes, R’s a language where array-oriented processing and functional programming reign, but that’s a good thing and pretty much the direction of modern data science languages — think Python with Pandas, NumPy, SciPy, and scikit-learn. As for the memory limitation on data size, that’s much less onerous now than even five years ago.

I point this out as I develop a Jupyter Notebook on a nifty ThinkPad with 64 GB RAM, 1 TB SSD, and 1 TB SSD that I purchased two years ago for $2500. That’s about annual license maintenance for a single seat of one of R’s commercial competitors.

With my 64 GB RAM, I generally haven’t overwhelmed R except when I set out to do so — to find its limits. Indeed, when all is said and done, most consulting R analyses I’ve completed over recent years have been on final R data sizes of 2 GB or less.

My recommendations for R installations are for SSDs along with 64 GB RAM for notebook computers, and SSDs plus 256 GB RAM+ for servers. Memory is a good investment. Also, for legitimately large R ecosystem data, analysts should certainly configure R interoperability with relational/analytic databases such as PostgreSQL/MonetDB, which support virtual memory. The Spark/R collaboration also accommodates big data, as does Microsoft’s commercial R server.

Most R aficionados have been exposed to the on-time flight data that’s a favorite for new package stress testing. For me its a double plus: lots of data plus alignment with an analysis “pattern” I noted in a recent blog. The pattern involves multiple files for input, each of which has the same structure and also dimension information encoded in its name.

I took on the analysis to elucidate a strategy for loading “pretty large” data in R while showcasing favorite packages data.table, tidyverse, and the splendid new fst for “Lightning Fast Serialization of Data Frames for R”. Incidentally, none of these packages is part of core R, but rather the bounty of an exuberant R ecosystem. Sadly, some naysayers are still fighting the can’t do battle from ten years ago — before the ecosystem largesse exploded and changed the R munging landscape.

What follows is an illustration of using a functional programming style plus powerful package capabilities to combine in-memory and file processing. The focus is on data assembly/wrangling rather than broader data analysis. To that end, I downloaded 216 monthly on-time flight files from 2000 through 2017. Each of these consists of 110 attributes and about 500,000 records. With my 64 GB RAM, I was able to load only about 2/3rds of the total data into memory at once. So my strategy became to handle three years at a time using data.table with tidyverse, then to offload to a compressed file with fst. At the end of the work stream, I’d created six fst files for subsequent use. From there I could exploit fst’s speedy access to select a subset of columns and build a new data.table by stacking the files. The power and efficiency of data.table, tidyverse, and fst delivered.

Below is the code used to read the text files and write the fst files — and then to read and stack the fst data to produce sample projected data.tables. The top-level technologies are JupyterLab with an R 3.4 kernel.

See the entire post, here.

“Kindof” Big Data in R

A Common Data Analysis Pattern with a Simple Solution in R

A Common Data Analysis Pattern with a Simple Solution in R

It seems that much of the data analysis work I’ve done over the last few months has followed a “script”. First, identify data, often government-sponsored and freely-available, that’s of keen interest. Next, find the websites that house the data and download the relevant files to my notebook. The downloads might ultimately be one-off or included in the data analysis programs. Finally, load the data into either R or python and have at it with queries, visualizations, and statistical/learning models.

Several examples of recent work come to mind. The first involves data from the Bureau of Labor Statistics on jobs and educational attainment from 1992-2017. On this site, 24 spreadsheets are available detailing monthly statistics on six measures of employment by 4 levels of educational attainment. For a previous blog, I downloaded 12 of these files, including measurements laborforce, employed, and unemployed by educational levels no HS, HS grad, some college, and college grad. I renamed the spreadsheets on my SSD to include indicators of dimension — eg “clf_laborforce_collegegrad.xls”.

The second example, the topic of this blog, has to do with Medicare charges summarized by hospital and diagnostic related group (DRG). At this point there are five files (either csv or spreadsheet) available for download, representing the years 2011-2015. Year is embedded in the website file names.

A third illustration, and the topic of a coming blog, revolves on census data from the American Community Survey.

The pattern behind the three cases includes:

  1. readily downloadable files, generally either csv or xls. These files can either be copied by hand or moved programmatically.
  2. multiple of these files, often dimensioned by time or other variable(s).
  3. a common, consistent format to the files, so that “reads” will work similarly on each.
  4. a structured file naming convention, either given or assigned, that provides dimensional info for data loads. Date/Time is the most common diemnsion.

In this article, I looked at the Medicare charges data that consists of five yearly csv files. I expect the next year in the sequence, 2016, to be posted in the near future. I first downloaded the five files to my laptop, programmatically in this case. I then used Jupyter Notebook and R 3.4 to analyze the data with code that takes advantage of the commonalities. The data.table package along with the tidyverse ecosystem and fast file compression, read/write library fst, plus a functional approach, are central to the work.

This is not your older sibling’s R. None of the highly-performant packages used here are included in core R. Rather, they are the bounty of an exhuberant and capable R ecosystem of developers. Incidentally, if this notebook were written in python, I’d be giving similar accolades to its community.

The remainder of the notebook details the R code to implement the loading of this data into R and some simple subsequent analysis. The structure of the code in this instance is very similar to that of the other examples cited.

To read entire content, click here

A Common Data Analysis Pattern with a Simple Solution in R