Data engineering is a crucial subset of the data science toolbox because access to data is required for high quality analysis and story telling. A data scientist must understand the tasks and time required for data engineering and be prepared to roll up her sleeves, which may include hammering out low-level scripts or developing company-wide software to create a fully functional data science environment. The data can set up successes or failures at an organization. A data scientist’s work is only as good as her access to data. This posts provides questions to evaluate time requirements for engineering the ideal environment, data access, and resources?
How much time should a data scientist spend engineering his environment, data, and resources?
All “clients lie about how much data they have above all else” wrote a CEO and tech writer, Slate Victoroff, in Big Data Doesn’t Exist. He tells us to assume that any current or potential clients have only a fraction of the data they advertise externally. Victoroff applies a very low rule of thumb for this fraction, one-thousandth (1/1000). I assume that clients have the purported data but only have access to 25% of it. I believe the difference comes from how “data access” is defined, especially since managers and executives typically interact with aggregated and rolled up numbers. Data scientists typically require de-aggregated or raw numbers.
If you’re evaluating a potential job or project, the following questions can determine the level of data access in an organization:
- Does the business have a central database for collecting data? Not technical people may not know the answer to this question. You’re better off saving this question for engineering, IT, or operations.
- What data do different department view on a daily basis? Hint that there is less data, you only hear software as a service references, such as web analytics: Google Analytics or Web Trends; social publishing and listening tools: such as Hootsuite, Synthesio, or Social Flow; Sales/CRM tools: Salesforce or Operative; etc.
- How does your CEO keep their pulse on the business? Tease out the workflow and process for handing the CEO and drivers in the business information. It would be helpful to get a rough list of reports seen by executives and managers.
- How are reports generated and shared in your organization (note the size of the organization)? Organization with less fluid data access will create reports and share excel or google spreadsheets. A sign that the business has an established data warehouse and pipelines — that are actually in use — is a central dashboard system. Caveat, excel has neat trick for accessing databases and APIs, excel is decent for a smaller sized organization
- What is example of data driven decisions made within your organization? Listen for the story. You may hear indirectly a hint at their data culture.
- How many developers are dedicated to data, business intelligence, or reporting tasks? Ask about the specific data engineering and devOPs tasks required to get things up and running.
An organization or project with no data is not a deal breaker. A thoughtful plan for data development can established within the first few months on the job. In the process of crafting an ideal data science environment, take note of the length of your time being spend on data engineering and the responsibilities. If it turns out that there are limited resources for you then try to leverage awesome SaaS tools for making these tasks less painless:
- Splunk – Data from your server log files in an instance
- Xplenty – Data Integration and processing services. Get your data from one place to another
- Alooma – Data Plumbing SaaS. Potentially skip APIs when “plug-in” are provided for commonly used SaaS products
- Amazon Web Services – EC2 to S3 to Redshift. Nothing automatic but you won’t have to manage any databases. I choose postgreSQL for some of my analysis, which is a wonderful tool, but database manage is a pain!
- Import.io – Web scraping with machine learning tools for non-engineers
- Tableau – Business intelligence tool that can be plugged into database or excel uploaded
- More advanced framework for data engineering and pipelining are Airflow (AirBnb) and Luigi (Spotify) python frameworks and packages
Good luck on chasing down the data!