In all the discussion about Big Data’s capabilities, it’s easy to lose sight of where it gathers input from. To understand that we have to become familiar with a process called Extract, Transform, and Load (ETL).
Many companies take data from their CRM, ERP, and other stores and gather it together into a single Data Warehouse. The process of gathering this data is the “Extract” of ETL. Changing it into the format that the Data Warehouse database prefers is “Transform”, and putting it into that database is “Load”. Because of the predictable format of this data, we call it “structured”.
Pretty simple so far, right? Hadoop takes this process and repeats it while adding in data from other sources in addition to the Data Warehouse. Potential sources include:
- Social media feeds (Facebook searches, Twitter feeds, etc)
- Stock exchange streaming feeds
- Email records
- Transactional feeds
- Web server logs
- Images (satellite imaging, product pictures, X-Rays or CAT Scans, use your imagination…)
The list is almost limitless. Because most of this data doesn’t fit into the traditional structure of relational databases, we call it “unstructured”.
Hadoop’s strength is its ability to gather structured and unstructured data (Extract) put it into a searchable format (Transform) and keep it all in one place (Load).