Article Preview
Buy Now
COLUMN
Column
Issue: 14.4 (July/August 2016)
Author: Craig Boyd
Author Bio: Craig Boyd is currently a data architect and senior consultant for a growing business intelligence consultancy. But in his 19 years of IT experience, he has been everything from a PC Technician, iSeries System Administrator, iSeries Programmer, Sr. Technical Lead, Data Modeler, Data Architect, and Oracle DBA. He lives in the great state of Texas with his wife and two kids.
Article Description: No description available.
Article Length (in bytes): 10,922
Starting Page Number: 77
Article Number: 14410
Related Link(s): None
Excerpt of article text...
It seems that with each passing year, more and more acronyms, terms, and accompanying technologies are being thrust upon us demanding our attention. In this article, I want to take a step back and help clear up some of the confusion around the data, architecture terms and concepts, and perhaps touch on some of the technologies that go with that. In this particular article, I am going to exam big data and its related terms and concepts.
The first term we are going to discuss is big data itself. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information-processing that enable enhanced-insight, decision-making, and process-automation. (Gartner 2012). The high volume aspect is the one we can most easily wrap our heads around. Except that the magnitude being referenced here is not megabytes or gigabytes, but rather terabytes or petabytes. The velocity aspect references the rate at which we receive data from all of one's various sources. The exciting piece of this is that we typically have no control over how fast we receive data from many of these sources or the cycle of this flow. For example a company might have a steady influx of seventy tweets per week but, if they release a new product that captures the public's interest or, worse, they do something to earn the public's ire, then the seventy tweets a week may quickly become 7,000 tweets for several days, perhaps weeks. Making sure your infrastructure can handle such sudden changes is critical in this day and age. High-variety refers to the fact that we are not just consuming one or two different types of data. In this day and age, you will often receive text, audio, video, and machine input. Sometimes it will be YouTube videos, or mentions on streaming audio or tweets or Facebook posts or whatever the newest rage is. Being able to store and aggregate all these different types of data is a must if you want to understand how your company is being perceived in the public arena. The cost-effective aspect is important for a couple of reasons. First and foremost, companies have come to loathe paying for a multi-year data warehouse project that will not produce anything during that time. Data warehouse projects that are not run in an agile fashion that is constantly and consistently producing usable products for the end-users is almost certainly doomed to failure. Whereas, projects that produce results that the business receives benefits from every 6 to 12 weeks are very palatable to end-users and cause them to be more open-minded about the expenditures that accompany this approach. The other aspects of the definition, enhanced-insight, decision-making, and process-automation are important because they impact which other related technologies may be implemented and how they are deployed.
It is important to note that big data strategies are not necessarily the sole domain of enterprise companies. There are, of course, appropriate cases for mid-sized companies to deploy this particular branch of data-architecture but, typically, enterprise- level companies are best equipped to do this because of the volume of data and the associated costs.
The next term we will address is related to big data and that is the data lake. A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly-skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse) (Gartner). This definition, I think, does a good job of explaining what a data lake is, but I want to add a couple of comments. Data lakes are fairly straightforward to implement since they are nearly exact copies of the source system from which they came. There can be variations. For example, it might be beneficial to have snapshots of the source systems which means that either timestamping the data now becomes required as the data is brought in or some kind of third party tool is needed to create point-in-time snapshots of the data. The bottom line here is that we are talking about storing huge sums of data and it is best that this be done on the most cost-effective medium possible which might also mean storing the data in a DBMS that differs from the source. If that is an option, it should most certainly be explored. Like big data, data lakes are typically the domain of enterprise-size companies, but can be found in some mid-sized companies as well.
Typically, there is a process flow to big data. The various stages of this flow are: Collect Data, Load Data, Store Data, Transform Data, Report, and Analyze Data. Please refer to Figure 1 to see how these different pieces flow together.
The Collect Data piece involves the identification of the data sources which will function as the source of the big data solution. Metrics are gathered such as size, growth rate, type of data (RDBMS, video, audio, streaming, spreadsheets, csv files, etc...). Once all the incoming data sources are identified and understood, the Load Data piece must be addressed. This piece is very important because of the downstream impacts it can have if strategies surrounding this process are not carefully addressed. Issues that must be addressed are:
...End of Excerpt. Please purchase the magazine to read the full article.