Tuesday, December 10, 2013

The end of Masters - end of BI lecture!

This week is rather special - Masters degree comes to an end! Its been a great experience - rewarding simply because I have learned so much and pushed myself to limits I never really did imagine!

Well, Business Intelligence being my Major and area of interest was the most looked-forward to course (Sounds weird eh?!). University of Arizona - much known for this course and the professor, I was pretty excited to see what was in store.

Here a quick outline of what I learned in the past 4 months:

- Concepts of Data Warehouses
By far the first 4-5 lectures were the most informative and helpful. I say this because it is definitely a broad topic, not something that you can learn by yourself in a short period of time and more importantly it was taught very well

- Don't have any biases when it comes to Data Analytics! 
- Your analysis is only as good as your data! With big data comes big responsibilities!

-Importance of Data Quality Analysis

-Network Analysis - through my own project as well as other team's presentations. Although, I still think there's a long way to go for network analysis to actually look & feel right - and more on that later!

-Google Analytics


To do and not to do! Graphs & More

In this post I m planning to explore various types of charts. To be precise - when to use certain kinds of charts and what not to do

OBIEE - Intelligent or not?

Something annoying about Business Intelligence - OBIEE

Before I give my opinion on the tool, let me give an overview of what exactly i did with OBIEE.

To begin with - We were given a Cloud Airlines Datawarehouse.

We were to perform our analysis on the various facts, dimensions and then try to answer a few questions.

Here is a sample of what the questions were like:

1.    Flight Delay Analysis: Are there any general patterns of flight delays for Cloud Airlines. Do flight delays typically occur from a set of specific departure and arrival cities/airports? In general are there certain times of the year when there are more or less delays? 

   Secondly, Aircraft Utilization Analysis: How well is each aircraft utilized? Is there any variation in aircraft utilization over time?

   Finally, Seat Utilization analysis: How well is Cloud airlines doing in terms of seat utilization across their various arrival and departure cities, airports, and over time? Are there any seasonal and or temporal variations?

Now, as you might imagine, these questions seem to be quite straightforward to think of logically. When you do actually get to the datawarehouse and try to answer these based on the facts, there comes a bit of trouble.
I had the opportunity to work with OBIEE 10 g a few weeks back and my notion that Oracle produces the crappiest of products was strengthened!


Thursday, December 5, 2013

Network Analysis

Network Analysis 

In a world being dominated by Social Media, it is no surprise that the next big leap would be trying to analyze and effectively use interactions by people in the social media platform.

Network Analysis has been there for a while now, where people have tries to analyze complex sets of relationships between members of social systems. I believe with the hype around Big Data, social network analysis has gained some spotlight over the recent years.

LinkedIn introduced "InMaps", which it describes as "What if you could visualize what your network looks like?  Would your connections form clusters or groups?  Wouldn’t it be great if you could see the way all your connections are related to each other? Even be able to identify the elusive hubs between your professional worlds?" 

Here is an example of my LinkedIn profile's network visualization:

LinkedIn Maps:

Observations:
- Thick Clusters are being formed
- UoA forms a Densely Connected network
- Undirected network 

Network Visualization using Gephi:

- Gephi is another tool meant for network Analysis. As a part of a project to analyze behavior of News Agencies on Twitter- specifically Bloomberg & The Economist.

Following is the list of information I gathered on initial Analysis:
nBloomberg has double the tweets compared to The Economist
nNumber of followers increased by 1M for The Economist in the past 2 months
nMost tweeted topic with Bloomberg and Economist: Culture & Politics
nMost popular Hashtags: #Shutdown & #Syria
nGeographically significant areas for agencies
qBloomberg- USA
qThe Economist - UK
nDays during which news agency is less active
qBloombergNews: Sunday & Monday
qTheEconomist: Consistent

Network Developed:



nComparison of news agencies based on
nRate of Spread
nLife Span
nRate of spread for the top tweet is faster for Bloomberg
nThe original top tweet for Bloomberg started at Sep 27 05:01:52  and the last re tweet was done at Nov  14 06:12:23
nThe original top tweet for Economist started at Oct 3 11:43:01 and the last re tweet was done at Oct 3 23:48:09

Sunday, October 27, 2013

Balanced Scorecard & More



DataWarehouses - What happens in Reality?
Learning Data Warehouses 101 brought to light quite a few concepts that were fuzzy in my brain, but as a good read always does, it raised more questions.
Before I pour my questions and thoughts, leme write about DW as I have understood with the help of an example:
Consider the Airline industry - well I read and heard a lot about this, so that's the example I'm going for. Say, the Senior Management of Southwest wants to know how our business is performing and expects me to present a full picture. Each of these big shots want to know about a particular aspect - obviously, they don't tell me what it is they want to see - they probably don't even know !
I begin digging and try to figure out the entire process.
\
Multiple Data sources – As one would expect, there are a lot of sources of data – e.g. the Customer related data, Supply related data from Vendors, data about my own employees which is internal to the Company, Inventory and so on.  These are termed as the Operational Source systems



Anyone who knows DW has made or seen this diagram and with that assumption, I am also not going to delve into why we need a ODS or what is a Data mart.
So coming back to my airline example, I now begin to build my report trying to incorporate what the Senior management would really like to see.
1.       How has the overall revenue increased from previous Quarter to current Quarter and how do the forecasts look like? – This is easy enough to present because I know I need to get this number from my Warehouse section.

Big Data Symposium

You hear and use a multitude of Social Media – ranging from Twitter, Facebook, Pinterest, Vine, LinkedIn, Google+, I could fill more than a couple of lines with these, but you get the picture – obviously, this has led to the generation of massive data. More recently, the term “Big Data” has be thrown around and used as just another term. This comic – I felt was more real than funny!
Last year, one of my Professors raised a simple question- can you quantify and tell me how much data is being generated? Some 50 odd students sitting in the class, murmured numbers, because we all knew that there was a ‘huge’ amount of data – but how much was it? Petabytes? Is that even a correct term?!
The term ‘big data’ is being used in a variety of industries – Health, Government, Social media, Commercial & Retail, Sports, etc. Now is time when people are wondering – okay so I have this huge pile of data – garbage & useful – how do I put to use. This is where investing in Big data comes into play. Millions & Billions of dollars is being invested to dig through data and get some meaning out of it – i.e. Analytics.
Much like everyone else, the University of Arizona is also edging towards educating students about Big Data and as a result – a Big Data Symposium was held on the 10th October, 2013.
To be honest, I went in thinking lets learn something and not criticize it for not taking my breath away. It was quite exciting to walk into a room full of people who were immersed in the world of data, I even expected a few of the details to be beyond my comprehension. 


We had some great speakers:
1.    Brian Gentile, Chairman & CEO, Jaspersoft
2.    Tim Hood, Global V.P., Strategic Technologies, Chief Solution Architect, Retail Industry,
SAP AG
3.    David Cowart, Director Strategic Solutions, Mandiant
4.    Michele Polz, Head of Patient Insights, Sanofi and Mikki Nasch, Co-Founder, AchieveMint
5.    Zaheer Benjamin, VP of Business and Basketball Analytics, Phoenix Suns
6.    Brenda Dietrich, IBM Fellow and V.P. Strategy & CTO for Business Analytics
7.    Darren Stoll GVP, Interactive Marketing - Operations and Analytics, macys.com
8.    Kerem Tomak Vice President, Marketing Analytics, macys.com
9.    Sudha Ram Professor of MIS, Director, INSITE Center for Business Intelligence and Analytics

The foremost important concept that one should remember when it comes to Big Data is the 4 V’s.
Volume – As expected, this refers to vast amount of data being generated via various platforms
Velocity – The speed at which data is being generated. “98,000 tweets, 695,000 Facebook status updates, and 11 million instant messages are sent through the Internet every 60 seconds”
Variety – Different forms of Data – consider for example the internet consumption, probably one of the most important factors contributing to the variety of data, in close proximity is the size of data being generated in Healthcare.
Veracity – In my opinion, this is the most important factor contributing to the efforts & skills needed in Big Data. Why? With the speed that data is being generated, it could soon become outdated. Take for example a superstore such as Walmart during Christmas time. Say customers are actively posting about Christmas lights being out of stock. The news spreads fast through Social Media. Next thing you know, customers aren’t coming to Walmart instead going to its Competitor. Is it of any use if Walmart stocks up on the lights after Christmas? Walmart just lost a whole lot of $$$$$, not to mention many unsatisfied customers.


That’s just Big Data explained in simple words! 

Monday, September 16, 2013

My first try at Public Data Visualization

I am a bit of a Data Enthusiast (wanting to be a good story teller) - well they call it many names, I'm not too sure of where I fit in right now, but among the things that fascinate me are : Data, Exploring Data & Deriving meaning from Data. If only it were as easy as it sounds. Hell, it doesn't even sound easy to me anymore..

I do have a tendency to ramble on -even in writing, so I m going to cut to the chase and write what I really wanted to in the first place.

So, I have been on the lookout for open data sources that would help me derive meaning and visualize so that it would help tell a story. Despite the huge amounts of data available, you'd be surprised at how much of it clean. I think an even bigger problem - which I really need to find a solution to is this: Okay, so I have the data - a 100,000 rows for example. What do I really want to use it for and how am I going to use it?

Over the past week and a half - I have been playing with ESPN's cricket data on Sachin Tendulkar. This is what I have been able to build:

Things I learnt : 
  • In ODI's, 1998 was the time when SRT was at his peak and India won
  • Despite having a great record in terms of Strike rate, # of 100's, I couldn't help but notice that the last 3 years, before he retired from ODI's were the ones he made the least number of runs in his career.
UPDATE [Sept 22, 2013]
After a lot of drilling/slicing and doing what not, I decided to update this post with a few more of my analysis: Included Test data in addition to ODI's :)

Sachin v/s other players - 1998
  • In the matches that India played  - with Sachin in the playing XI, against Australia, Srilanka, Pakistan, South Africa and England  - he wasn't really the deciding factor of the match- in the sense, he did not really single-handedly win the game for India. In fact if you see the tab titles: Sachin v/s other players below, you will notice that in 1998, he was hardly ever responsible for more than half the runs that India scored.
Sachin - timeline
  • Well, I must say, this took me while to build but being a newbie, I still am not certain if it adds value. I tried to show SRT's performances (# of runs) against various teams throughout his career.. Its a motion chart, and the play button would work only if you downloaded the workbook and opened it on your local.
Last - Ground & Tournament
  • The last tab, well I began wondering what percentage of SRT's total runs were scored in big, important matches - and resorted to the Funnel chart.. This confirmed my initial hypothesis that he scored most in the Preliminary Matches. The table shows the exact number of matches played in major tournaments - great strike rate and seems to like playing against the Aussies(-ignoring the Match Result, of course)
  •  This leads to the final bit - Grounds where SRT has got his runs. In ODI's he scored the most runs in Asia but with in case of the Test matches - its more or less balanced