Pages

Wednesday, December 12, 2012

What is Big Data



Anyone who reads about data science will be aware of the term “big data”. It’s one of those terms that seems to have multiple definitions, so I thought it was about time to put my own thoughts down on paper so I could have a broad understanding of what the term means for myself. Also, I wanted to know if traditional statistical ideas remain relevant with “big data”?


Business Intelligence Sales Pitch

Let’s get this use of the term out of the way first. Big data is used as a buzzword by Business Intelligence vendors to indicate a solution to almost all business problems - if any business could analyse all the data it owns it could find both the problems the business is facing and the solutions.

The best response to this use of the word big data is to ask why bother about big data is you can’t get basic customer service issues right.

Robert Plant uses the example of scheduling service calls :

Ever waited hours, in vain, for a repair service to arrive at your home? Of course you have. We all have. Chances are you've also shifted your allegiance away from a company that made you wait like that.

So why do companies spend millions on big data and big-data-based market research while continuing to ignore the simple things that make customers happy? Why do they buy huge proprietary databases yet fail to use plain old scheduling software to tell you precisely when a technician is going to arrive?

Size

The next obvious meaning of big data is the size of the data sets involved. These are big data sets.

The size of big data sets is a consequence of technology. If we go back to the early 1900’s, when Fisher, Gossett and others were developing sampling theory, data was collected manually. This necessarily meant that sample sizes were small (at least by today’s standards. For example, when Gossett was developing small sample techniques (such as the student’s t-test), biometricians were using samples that were comprised of hundreds of observations and saw no reason to develop small sample techniques.

Today, computer technology is allowing the production and storage of data sets that can be many terabytes in size – for example - click data from a website. An ecologist can collect numerous virtually continuous measurements using digital instruments – temperature, wind speed and direction, humidity, sunlight and so on.

However, it is worth considering that technology has developed over the centuries, and size is relative. Annie Pettit sums it up well when says that “there is no such thing as big data, just bigger data sets than you are used to working with”. Annie continues by observing that to work with bigger data sets than you are used to working with, “you just need the right tools”.


Tools and Technology

The increasing size of data sets leads us on to tools and technology. When the term big data is used, what is often being referred to is the technology used to process data sets that are large in comparison to the computer resources available. In this context, big data refers to computer software that can handle data sets that are, for example, larger than the RAM available.

A related use of big data refers to software that can handle data of differing structures and sources. 

For example, think of the word “Hadoop” – here’s what the Cloudera website says:

Apache Hadoop was born out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it. Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.
Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.
[http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html ]

Here, whilst technologies such as Hadoop enable different analytical approaches, it is analytic-agnostic. This approach to big data provides the infrastructure to work with large size data sets in a timely and efficient manner.

Data Structure
The ability to work with large volumes of data also enables the researcher to work with disparate forms of data. Or at the least, to work with disparate databases or sources. An example here would be the ability to access geographical location data and then use that to link existing data sets to other data sets

Sampling versus Census

Implicit in many uses of the big data term is the assumption is that there is no need to sample, we can analyse all data “because we can”.
What are the implications of analysing “all data” rather than a sample? With big data, we are mostly still sampling, it’s just that the sample sizes are enormous. We can store and analyse click data from a website, but it still remains a sample over time (past, present and future).

First, let’s look at why we analyse a sample rather than a census. The main reason for sampling in a social sciences environment is that taking a census is either too expensive or not practical. 

Once we decide to sample however, then the next decision is what sample size do we need. How many cases should we analyse.

In classical statistics, one issue we need to be aware of as sample size increases is that weak effects can become statistically significant. So this suggests a possible issue with big data; with huge sample sizes, very weak relationships or patterns can be identified. These will be effects that are not very insightful.

Signal versus Noise

Big data often includes a lot of noise. This is obvious with data sources such as social media. Anyone who has worked with twitter feed will understand
Big data also involves repitition. A continuous digital record of say temperature involves substantial redundant information : the temperature this second is generally not greatly different from the temperature in 15 seconds.

Ability To Predict

The ability to produce better quality predictions is often cited as an outcome or benedit of big data. There is a well known example of target US being able to predict whether its female customers were pregnant. 

[http://www.nytimes.com/2012/02/19/magazine/shopping-habits.htm ]

This illustrates what I think is one aspect where big data has a different focus to classical statistics. Classical statistics has always had the goal of prediction – it’s aim is to produce insights that can be generalised to the population. Big data instead seems to have the aim of classifying individual transactions or individual people so that a customised transaction can take place. With classical statistics in a social science setting, the aim is to produce an actionable insight that can be used to inform policy development.

Case versus Variable

Another differentiating feature of big data is to increase not only the number of cases, but also the number of variables. 

Douglas Merrill makes this point:

This means that to get more accurate results, you'll need to expand your data set. There are a couple of ways to scale up the amount of data you are using to make better predictions:

First, you can add more cases.

But the more powerful way is to add signals. Adding signals (columns) allows you to do two things: First, it can reveal new relationships, enabling new inferences — with a new variable, you may see a correlation in the data you never realized before. Second, adding signals makes your inferences less subject to bias in any number of individual signals. You add cases, keeping the same signals, to make your understanding of those variables better. In contrast, you add signals to make it possible to overcome errors in other signals you rely on.

Although much of the discussion of big data has focused on adding cases, — in fact, the common perception of "big data" is being able to track lots of transactions — but adding signals is most likely to transform a business. The more signals you have, the more new knowledge you can create. For example, Google uses hundreds of signals to rank web pages.

In the early 1970's, Fair Isaac rose to global prominence as a provider of the standardized FICO score that supplanted much of the credit officers' role. The standardized score massively increased credit availability and thus lowered the cost of borrowing. However, FICO scores have their limits. The scores perform especially poorly for those without much information in their credit files, or those with relatively bad credit. It's not FICO's fault — it's the math they use. With fairly few signals in their models, the FICO score doesn't have the ability to distinguish between credit risk in a generally high risk group.

The way to address this is to add more signals. For example, thousands of signals can be used to analyze an individual's credit risk. This can be everything from excess income available, to the time an applicant spent on the application, to whether an applicant's social security number shows up as associated with a dead person. The more signals used, the more accurate a financial picture a lender can get, particularly for thin file applicants who need the access to credit and likely don't have the traditional data points a lender analyzes. 


However, whichever way we increase the size of the sample, we are not qualitatively changing the way we analyse the data.

Summary

So in summary, what is “big data” ? How is it different from classical statistics?
Overall, my view is that “big data” is not a new way of looking at the world (as might be the case when comparing Frequentist and Baysian statistics) and nor does it offer new methodologies . Instead, I prefer to see big data as an incremental step in one direction that is a response to changing and evolving technology.

However, there are a number of issues to consider:


  •  Sampling versus census

One view is that big data avoids the need to sample. However in my view this is not the case. Big data sets are still samples, although large samples. For example, click data from a website is a sample up to a point in time and the target population is click data – past, present and future.

The difficulty then with very large sample sizes is that very weak effects will be virtually always statistically significant.  This will increase the chance of researchers detecting patterns and relationships that are not meaningful, and which don’t expand the insights gained from the data.
The issue here is that the information in a sample does not increase at the same rate as the amount of data in a sample.

A positive advantage of large sample sizes is that rare cases or relationships can be more easily detected. This may well be the case with the data mining example with Target in the US, where their data mining efforts enabled them to identify pregnant women, and indeed, to identify roughly what stage of pregnancy they were in. A sample of even several thousand Target shoppers (and their transactions) may not have selected sufficient pregnant women to identify meaningful patterns in their purchasing.


  • Big Data has led to increased use and development of data mining and machine learning techniques.  This has had the advantage of expanding the number of tools available to statisticians and data scientists.


The disadvantage with the use of many of these machine learning tools is that users have not gained a good understanding of how they work, and we have “black box” models which detect correlations which may or may not be meaningful. The issue here is that this approach has allowed the practitioner to avoid developing a theory that has a hypothesis about a causal relationship, which the research can then test. 

The black box use of machine learning algorithms could result in the classical problem of multiple comparisons. The more comparisons that are made, the more likely a statistically significant effect is likely – just by chance alone.





  • One difference big data is making is in at the transactional level. With small data (in a social science context), the findings of a survey would be generalised to the target population, and used to inform, for example policy development.
Big data has enabled organisations to skip broad policy development, and instead to develop customised responses for individuals. This was the “breakthrough” of the Obama data strategy.But this is not a methodological or theoretical breakthrough – it’s just adapting improving technology.


  •    Finally, it’s clear that the technical aspects of big data are beneficial in allowing larger data sets to be processed and in allowing disparate data sources to be combined.  But this is itself not a theoretical or methodological breakthrough. We’ve seen that sort of progress before. At one time, researchers had to handle their calculations manually, and then mechanical calculators were developed. What we are seeing with big data is a continuation of evolving technology.











Tuesday, December 4, 2012

Job Search 2012



When I found out my then employer was going out of business, I knew I needed to reinvent the way I went about looking for a new job.  Business conditions in 2012 are not as prosperous as they were pre GFC, and the SME sector in which I work has been particularly hard hit, particularly in its sense of how bad conditions are.

I needed to revisit my job search strategy and techniques to be successful. 

After a lot of hard work and numerous applications and interviews, I ended up landing a position with a company where I’m very happy. This outlines my reinvented job search strategy: what worked and lessons learnt.

What Worked
One Page Summary

These days, it’s stating the obvious to say that advertisers are swamped by the number of applications they receive. It’s important to have a well presented cover letter and resume; however there are several things you can do to help your application stand out.

One idea is to prepare a one page summary, as shown below. This works well if you are able to e-mail your application, and are not restricted by the 2 attachments only functionality of Seek.Com.


        

I’ve included sections on  (from top to bottom, left to right) : general description, employment summary, education, training and affiliations, industry and technology summary and work experience highlights.
As a CPA qualified accountant, I added the CPA logo in the top left corner to add visual interest.
My experience is that the one page summary worked particularly well with employer-direct recruitment.

Resume Modifications

The resume is a vital part of anyone’s job search tool box. Typically, it’s an evolving document, updated on each job search occasion.  It’s worth taking the time to completely rewrite your resume, taking the opportunity to freshen up both the presentation and the content.


  • Achievements  - after a few years, it’s easy to forget the detail of what you accomplished in earlier roles.  One forward looking solution is to keep track of work achievements progressively. Every month, e-mail yourself an update of what you accomplished .  Also keep track of the key characteristics of your role. For example, what computer systems did you use in your job. I’ve completely forgotten what financial system I used at a company I worked for 10 years ago. Whilst in reality it is, in most cases, completely irrelevant, it could be a minor advantage to be able to list the system under your technology summary . If it is a popular system, having used it (no matter how long ago) may get you past the first culling of applications. If you haven’t kept a detailed outline of your achievements, then another solution is to ask former colleagues and staff for their assistance. It’s quite likely that you can validly claim responsibility for the achievements of staff who reported to you.



  • Use highlighting to emphasize particular points in your resume to address the specific position. This won’t work on all occasions, but will work when applications are reviewed on-screen or printed on a color printer.



  • Use hybrid titles.  Different organisations use different titles for similar positions. Hybrid titles can be used to emphasize common themes in different titles. My current title is CFO; however that title was more about creating an impression outside the organisation, and put me at a disadvantage when applying for financial controller positions, as it looked as though I was taking a backward step.  The hybrid title “Financial Controller / CFO” emphasized  career continuity



  •  Consider incorporating a one page summary as the first page of your resume. This resume version could then be used when applying through Seek.Com, where only two attachments are allowed.


Networking
Ask recruiters for the opportunity to introduce yourself. Each time I received a “we regret to advise you”  e-mail from a recruiter, I took that as opportunity to get another contact name and e-mail address.


E-mail the recruiter requesting an opportunity to meet and introduce yourself. I regularly scheduled annual leave days to allow me to request “meet and greet” opportunities.

These “meet and greet” interviews were very useful;  they provided excellent interview practice opportunities and were a great source of ideas on how to present my career and experience.  

Recruiters were also a good source of ideas on where to look for a new job.  For example, one recruiter suggested Geelong as a possible location (closer to home than many of the positions I applied for, but out of mind because it isn’t part of the Melbourne metropolitan area), and another recruiter recommended a couple a agencies that I hadn’t considered)

Finally, these “meet and greet” interviews resulted in interviews with two clients.

Once you have met with a recruiter, keep in contact. I’d e-mail recruiters  every 4-6 weeks, to help stay top of mind.  

Career or Executive Coach

In the day to day routine of working and job search , it’s not easy to step back and look at the big picture. That’s why a career or executive coach can be useful.
I worked with Robyn Pulman (http://www.robyn.com.au ).
What I liked  about Robyn’s approach was that she could move from offering specific job search tips (put your name and phone number on each page of your resume) to providing a conceptual framework of the overall job search process.  Job search is a essentially a business development activity, and business development is about product or service, communication, and relationships. Looking at job search this way encourages a business like and professional approach.
I’d schedule an hour’s conversation with Robyn every four to six weeks, and cover off on issues such as :
·         Developing referrals and networking

  • Effective questioning
  • Achieving high performance
  • Goal setting
  • Career planning
  • Alternative career and life options
  • Interview technique and planning
All up, these conversations helped keep me motivated and energized.

Cover Letter Checklist

A successful cover letter needs to be customized to the position being applied for.  I started keeping a list of all the phrases that I developed to highlight different strengths, experiences and abilities.  This made customizing cover letters a much easier task.


Lessons Learnt

Interview Preparation

After 10 or so interviews, I found myself getting a bit blasé and not preparing sufficiently for interviews.
Solution: prepare an interview checklist, so nothing is overlooked. For example, print out the “About Us” page of the organisations website.

Build Up Momentum Earlier

If I had to redo the job search, one thing I would do differently is to “move quicker” and be prepared to spend some money. For example, one idea was to produce a “With Compliments” slip. This would have made it possible to thank recruiters and interviewers in much more memorable way than by sending an e-mail.

Another idea I had but did nothing with was to get a professional to rewrite my resume.  The resume writer was based in the US and may well have re-written all or part of my resume in a way that improved its effectiveness (http://www.theladders.com/resume )

Manage Morale and Enthusiasm

Job search can be a tiring and demoralizing process. The longer it takes, the easier it is to start doubting your experience and skills.  The last thing you want to come across as in an interview is desperate – you want to present as interested and interesting.

Plan B

Depending on your circumstances, it may be worthwhile to develop a Plan B.  When it looked as though my current role would come to an end before I had obtained a new position, my plan B was to go back to study fulltime and finish off a Master’s degree that I was half way through.


Project Management

Job search is a tiring activity. Therefore it makes sense to not to keep revisiting decisions all the time.  Establish a plan, and generally keep to it. In my situation,  I decided at what point I would suspend my job search and resume full time study, at what point I would later resume job search,  and so on.
The aim is to reduce decision fatigue and encourage a sense of purposefulness.

Recommended Reading

Most of these books were recommended by Robyn Pulman.

·         Habits Aren’t Just For Nuns – Robyn Pulman  

·         Endless Referrals – Bob Bury

·         Skill With People – Les Giblin

·         Fierce Conversations

·         Purple Squirrels

·         How to Stop Worrying and Start Living – Dale Carnegie