LATEST news

Blogs

hero image
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Extended Team Model: An Alternative To Outsourcing

Most companies and startups struggle with having proper resources onboard and building a  team with the right balance of all skill sets. Outsourcing is often done to get the work done, but without synchronization and the right talent on board, the execution process can be quite challenging! This is where the Extended Team Model can accelerate an in-house team’s progress by complementing it and fostering efficiency.

Imagine having extra hands complementing your team’s skill set and augmenting the development team! That is precisely how an extended team model contributes to an onsite team. In this article, we will understand how an Extended Team Model is the perfect, modernized alternative to Outsourcing and has been befitting multiple software development companies!

What Is An Extended Team Model (ETM) in Analytics?

Businesses are scaling higher these days with effective connections within the team and a network of professionals who can maximize a team’s productivity. The Extended Team Model is an alternative to Outsourcing, where the in-house team extends to a virtual team of professionals who group up with the core team to exercise the skill sets that are lacking in the core team.

If your company is rooting for growth on a larger scale on a long-term basis, then having an extended team model is your solution to scale higher. With an extended team model, there is more transparency and flexibility as the team members are in constant communication, working as one team with a fixed goal. Your focus also becomes the extended team’s focus, fostering highly effective collaboration between the in-house and extended team.

How is an Extended Team Model different from Outsourcing?

One might ask how an extended team model differs from outsourcing since the job gets done both ways. The answer is that the quality of the final delivery is always better when a team comes together and works in collaboration, which is the crux of the extended team model concept. In outsourcing, the company does not have the option to directly communicate with the developers or gain an insight into the workings; they simply get the code or product delivered.

The extended Team Model bridges this gap between the in-house and offshore outsourcing teams, making the execution process more efficient and collaborative.

Let’s look at the features of an Extended Team Model to understand better how it is a better alternative to Outsourcing and a more positive approach:

  1. Complement the core team:
  1. An Extended Team Model is meant to complement the core team, not replace it. If your local talent pool lacks specific technical skillsets or business expertise, the extended team model steps in to bridge that gap. The core team works onsite, while the Extended Team Model might be operating offshore.
  2. Sharing the same work culture
  1. An Extended Team is hired when a company wants to grow on a long-term basis and wishes to work with the extended team for future projects. This means that the extended and in-house teams can share the same focus, and the extended team will not get sidetracked by any other project. The Extended Team will receive the same objectives and training (depending on their expertise) and share the same work culture with the core team. This develops a strong team spirit and oneness between the inhouse and Extended teams, naturally improving the quality of the final product.
  2. You have control over the project
  1. In outsourcing, the requirements are shared with the offshore team, and you don’t have a say in their day-to-day productivity or ways of working. In an Extended team Model, a single point of contact and authority overlooks the work for both the core and extended teams.
  2. You have control over the project
  1. One of the significant aspects of hiring an extended team is that you have complete flexibility in adding people with new skill sets or reducing your team to the required professionals as you move to different stages of the delivery process. You can swap out developers who are no longer required for the project or grow your team headcount as you deem fit.
  2. Working towards a common goal
  1. In an Extended Team Model, the responsibilities are shared equally between all team members (both onsite and offshore) depending on their skill set, and everyone is equally responsible for the success or failure of a project. This makes the entire team stay invested in the execution process and share a common goal of delivering a quality product.
  2. Easy hiring process
  1. Once you provide your requirements to the Extended Team Provider, they perform an initial screening and provide professionals most suited for your project. You can have your own screening and interview process for them before making the final call regarding who you want in your Extended Team Model. Moreover, finding the right match for a missing skill set in your local talent pool can be costly and time-consuming. By hiring an Extended Team Model, you can access top techies and developers across the globe and narrow down on developers who are best suited for your project in a cost-friendly manner. Now, the most crucial aspect is understanding at what point you should consider hiring an Extended Team for your project. Read on to find out!

When to opt for an Extended Team?

Suppose you wish to expand your business in the long-term and deliver projects that are currently beyond your company’s scope, but you have the resources to make it happen. In that case, this is the perfect opportunity to hire an extended team to grow your business significantly.

An extended team will help you enrich your in-house team for long-term projects and augment the development process by bringing in skill sets that your local talent lacks.

Moreover, it is often costly to hire skillsets from a local pool, however, if you hire an extended team, the charges are much more reasonable, and the Extended Team provider will allocate all the required resources such as computers and workspace.

To conclude, scaling a business involves considering options that will help your business expand exponentially and accrue potential clients. Achieving this requires resources that will benefit you and your team. Hence, an Extended Team Model is the best way to deliver value while essentially adding value to your team and resources.

Data Modeling in Power BI: Building a strong foundation for Insights

Introduction:

In today’s data-driven world, Power BI has emerged as a leading tool for transforming raw data into actionable insights. However, the true power of Power BI lies not only in its visualization capabilities but also in its robust data modeling features. Effective data modeling lays the groundwork for meaningful analyses and impactful visualizations. In this blog, we will delve into the nuances of mastering data modeling in Power BI, exploring essential concepts and best practices to construct a solid foundation that maximizes the potential of your data.

Understanding Data Modeling:

At its essence, data modeling in Power BI involves structuring data to facilitate analysis and visualization. This begins with importing data from diverse sources, such as databases or spreadsheets, into Power BI Desktop. Subsequently, relationships between different tables are established, calculated columns and measures are created, and the data model is optimized for performance.

1. Establishing Relationships:

Relationships dictate how tables in the data model are connected. In Power BI, relationships are based on shared fields or keys between tables. By defining relationships, Power BI can perform complex cross-table calculations and aggregations. Understanding the various relationship types (e.g., one-to-one, one-to-many, many-to-many) and selecting the appropriate cardinality and cross-filter direction is crucial based on the data structure and analysis requirements.

                                        Relationship between tables (Source: learn.microsoft.com)

2.Creating Calculated Columns and Measures:

Calculated columns and measures serve as integral elements within the realm of data modeling. Calculated columns allow for the generation of new columns through the application of calculations to existing data, while measures dynamically aggregate data based on predefined conditions. Leveraging DAX (Data Analysis Expressions), the proprietary formula language of Power BI, provides a wide array of functionalities including summation, averaging, and counting. It’s imperative to prioritize simplicity, efficiency, and reusability when crafting calculated columns and measures to uphold maintainability and optimize performance.

   Calculated Column

3.Optimizing Performance:

Performance optimization is paramount in data modeling, particularly with large datasets or intricate calculations. Employing techniques such as minimizing calculated columns, utilizing appropriate data types, avoiding unnecessary relationships, and optimizing DAX formulas can significantly enhance report responsiveness and efficiency. Additionally, features like query folding, partitioning, and incremental data refresh can further optimize performance.

Best Practices and Tips:

To master data modeling in Power BI effectively, adhere to these best practices and tips:

– Thorough Understanding: Gain a comprehensive understanding of data sources, relationships, and business requirements before constructing the data model.

– Simplicity: Strive for simplicity by minimizing complexity and redundancy in the data model. Simplified models are easier to maintain and troubleshoot.

– Descriptive Naming: Use clear and descriptive names for tables, columns, relationships, and measures to enhance clarity and comprehension.

– Testing and Iteration: Rigorously test the data model with sample data and iterate based on feedback and performance evaluations.

– Stay Updated: Keep abreast of the latest Power BI features and updates to leverage new functionalities and optimizations.

Conclusion:

Developing proficiency in data modeling within Power BI is crucial for establishing a solid foundation conducive to in-depth analysis and engaging visualizations. By grasping fundamental concepts, embracing best practices, and employing optimization strategies elucidated in this guide, users can fully harness the capabilities of Power BI. Whether you’re new to the platform or a seasoned user, dedicating time and effort to mastering data modeling will undoubtedly yield significant benefits, enabling you to make informed decisions and propel business success through data-driven insights.

Follow us for more such updates!

Azure Data Bricks: Overview

What is Azure Databricks?

Azure Databricks is Apache spark based big data and analytics platform optimized for Azure cloud services. Databricks includes an interactive notebook environment, monitoring tools and security controls that make it easy to leverage Spark. Azure Databricks supports multiple languages such as Scala, Python, R and SQL. Along with these it supports multiple API’s. Azure Databricks offer three environments for developing data intensive applications:

  1. Databricks SQL
  2. Databricks Data Science & Engineering
  3. Databricks Machine Learning

                                                                 Fig.  Azure Environment

Azure Databricks empowers organizations to extract meaningful insights from their data, whether through interactive analysis, batch processing, or machine learning, and is a key component in Microsoft’s Azure analytics and data services ecosystem.

How do Databricks work in Azure?

Azure databricks is optimized for Azure and highly integrated with other azure services like Data Lake Storage, Azure Data Factory and Power BI to store all data in simple open lakehouse. On top of this Azure Databricks integrates seamlessly with Azure Active Directory for access control and authentication. Overall azure databricks provides well architected and tightly integrated environment for big data analytics and machine learning on Azure.

Components of Azure Databricks:

The key components of the Azure Databricks platform include:

  1. Workspace
  2. Notebooks
  3. Clusters
  4. Workflows
  5. Delta Lake
  6. Auto Loader

Workspace: Azure Databricks Workspace is an integrated development environment (IDE) provided by Microsoft Azure for data engineering, data science, and machine learning tasks. It’s a collaborative platform that allows multiple users to work together on big data analytics. We can write code and configure jobs using workspace.

                                                                Fig. Azure workspace

Notebooks:  Azure Databricks provides a notebook interface where users can write and execute code in multiple languages, such as Python, Scala, SQL, and R. Notebooks are interactive documents that combine live code, visualizations, and narrative text, making it easy to explore, analyze, and visualize data. Any type of business logic we can write and apply on data using notebooks.

                                                              Fig. Sample Notebook

Clusters: A databricks cluster is a set of computation resources and configurations on which you run data engineering, data science and data analytics workloads. These workloads such as ETL pipelines, streaming analytics, ad hoc analytics are run as a set of commands in notebook or as a Job. There are primarily two types of clusters, All-purpose clusters and Job clusters. All-purpose clusters analyze data collaboratively using interactive notebooks, while job clusters run automated jobs in an expeditious and robust way. It’s better to use All-purpose clusters for ad hoc requests and development work. Cluster’s usually takes 3 to 6 minutes to start, and we can stop it manually or it is auto terminated after certain set limit. Also, there is SQL warehouse compute available for ad hoc SQL queries which takes relatively less time to start.

                                                                     Fig. Cluster’s

Delta Lake: Delta Lake is the technology at the heart of Azure Databricks platform. It is open-source technology that enables building a data Lakehouse on top of existing storage systems. Delta Lake builds upon standard data formats, it is primarily powered by data stored in the parquet format, one of the most popular open-source formats for working with big data. Additionally, Delta Lake is default for all tables created in Azure Databricks.

Data Bricks Auto Loader:  Auto Loader provides an easy-to-use mechanism for incrementally and efficiently processing new data files as they arrive in cloud file storage. This optimized solution provides a way for data teams to load raw data from cloud object stores at lower cost and latency. By using Auto loader no tuning or manual code required. Auto loader can load files from ADLS Gen2, Azure Blob Storage and Data Bricks File System. Auto loader can be very useful and efficient when used with Delta Live Tables.

Workflows: A workflow is a way to run non interactive code in databricks clusters. For example, you can run ETL workload interactively or on a schedule. A workflow can consist of a single task or can be a large, multitask workflow with complex dependencies. Azure Databricks manages the task orchestration, cluster management, monitoring and error reporting for all the jobs. We can run jobs immediately or periodically through an easy-to-use scheduling system. Also, we can set dependency on upstream job by using file arrival trigger in workflow.

                                              Fig. Workflow Schedule and Triggers

Summary: Azure Databricks can be very useful and game changer in today’s modern big data analysis due to its optimized environment, Persistent collaboration in notebooks, real time team-work and user-friendly workspace.  Also, azure databricks integrates closely with PowerBI for hand-on visualization, this can be very effective for ad hoc analysis.

For more updates like this, please get in touch with us.

Getting Started with Pentaho Data Integration

Introduction:

Pentaho Data Integration (PDI) stands as a cornerstone in the realm of data integration and analytics. Whether you’re a seasoned data professional or a newcomer to the field, this guide will navigate you through the crucial initial steps in leveraging Pentaho Data Integration for your ETL (Extract, Transform, Load) needs. Unveiling Pentaho Data Integration (PDI)

Introduction to PDI:

Pentaho Data Integration, often referred to as Kettle, serves as the data integration powerhouse within the Pentaho Business Analytics suite. Renowned for its user-friendly graphical interface, PDI empowers users to craft intricate ETL processes without delving into intricate coding. Supporting an extensive array of data sources, PDI emerges as a versatile solution for diverse data integration challenges.

Installation and Configuration:

Step 1: Acquiring Pentaho Data Integration

Initiate your journey by downloading the latest version of Pentaho Data Integration from the official website

Step 2: Installation Guidance

Click on “Download Now” on the official website and choose the version you want to install. Typically, we opt for the one labeled “Pentaho Data Integration (Base Install).” Navigating the Pentaho Data Integration Interface

Crafting Your Inaugural ETL Job:

Step 1: Initiating a Transformation

Within Spoon, create a new transformation—a set of interconnected steps defining your ETL process. Introduce source and destination steps to depict the data flow.

Step 2: Step Configuration

Configure the source step to establish connectivity with your data source, whether it’s a database, CSV file, or another format. Simultaneously, configure the destination step to specify where your transformed data will be loaded.

Step 3: Exploration of Transformation Steps

Delve into the diverse transformation steps PDI offers. For beginners, commence with fundamental steps such as Select Values, Filter Rows, and Add Constants to manipulate your data effectively

Step 4: Transformation Execution

Execute your transformation to witness the ETL process in action. Monitor the log window for any potential errors or warnings during the execution.

Preservation and Reusability of Transformations:

Step 1: Save Your Transformation:
Once content with your transformation, save your work. This preserves your efforts and facilitates future modifications.

Step 2: Transformation Reusability:
PDI advocates for the reuse of transformations across different jobs, fostering a modular and efficient approach to ETL design. This approach proves invaluable in saving time and effort when encountering similar data integration tasks.

Conclusion:

Embarking on your Pentaho Data Integration journey unveils a realm of possibilities in the ETL landscape. This guide has initiated you into crafting ETL processes with PDI’s intuitive graphical interface. As you grow more accustomed to the tool, explore advanced features such as job orchestration, scripting, and integration with big data technologies.  Always remember, proficiency in Pentaho Data Integration is cultivated through practice. Begin with uncomplicated transformations and progress towards more intricate scenarios. The Pentaho community and documentation serve as indispensable resources for ongoing learning and troubleshooting. Happy ETL endeavors!

Please get in touch with us if our content piques your interest.

An overview on Azure’s NoSQL Cosmos DB

Azure Cosmos DB is a fully managed platform-as-a-service (PaaS). Offers NoSQL and relational database to build low-latency and high available applications with support to multiple data stores like relational, document, vector, key-value, graph, and table.  Azure Cosmos DB offers single-digit millisecond response times, high scalability. Guaranteed SLA-backed availability and enterprise-grade security.

Global distribution: Cosmos DB is a globally distributed database that allows users to read or write from multiple regions across the world. Helps to build low latency, high availability applications. Cosmos DB replicates the data across the globe with guaranteed consistency levels. Azure Cosmos DB offers 99.999% read and write availability for multi-region databases.

Consistency levels: Azure cosmos DB supports 5 different consistent levels.

  • Strong: Linearizable reads.
  • Bounded staleness: Consistent Prefix. Reads lag behind writes by k prefixes or t interval.
  • Session: Consistent Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads.
  • Consistent prefix: Updates returned are some prefixes of all the updates, with no gaps.
  • Eventual: Eventual

Cosmos DB resource hierarchy:

A Cosmos DB account can hold multiple databases. A Database can hold multiple containers.

Data is stored in containers. Each container contains a partition key. Partition key helps to distribute the data across all partitions equally. Partition key should be selected cautiously because choosing a wrong partition key will increase the consumption of RUs. The easiest way to determine the partition key is the field that will be used on your WHERE clause. Data is stored in physical partitions; Cosmos DB abstracts the physical partitions into logical partitions. If a container contains 10 distinct partition values, 10 logical partitions are created. Each physical partition is replicated at least 4 times to increase availability and durability.

Containers are schema-agnostic which means items in containers can be of different schema but with same partition key. All items are indexed automatically, a custom index policy is also available.

Pricing: Azure cosmos DB calculates all the database operations in Request Units (RU’s) irrespective of the API. One request unit equals to 1KB of item read using a partition key and ID value.

There are three modes we can use to setup the cosmos DB.

  • Provisioned Throughput: A fixed number of RUs per second is assigned to the Cosmos DB based on the requirement.
  • Serverless: No assignment needed, billed based on the consumption. Serverless mode comes with some limitations like single region only, can store maximum of 1TB, RUs ranges between 5000-20000.
  • Auto scale: Auto scales based on the consumption. Suitable for building scalable high available applications with unpredictable traffic. No need to handle rate limiting operations.

Cosmos DB emulator: Cosmos DB also offers an emulator that can be installed on your local system. Emulator comes with limited features and can be used for developing and testing applications locally without creating an actual cloud account.  Fixed RU’s, fixed consistency levels and supporting only NoSQL API are few on the limited features.

Follow us for more such updates.

Optimizing Power BI Performance: Unleashing the Full Potential of Your Reports

Power BI stands as a robust tool for transforming raw data into actionable insights. However, as reports and dashboards become more intricate, optimizing performance becomes paramount. Slow-loading reports and sluggish interactions can hinder user experience, diminishing the impact of your data-driven decisions. In this post, we will explore key strategies to optimize Power BI performance, ensuring a seamless and responsive user experience.

1.Streamlined Data Modeling

The foundation of every Power BI report is its data model. A well-designed data model not only improves report performance but also enhances overall usability. Here are some tips for efficient data modeling:

  • Simplify Relationships:

Ensure that relationships between tables are necessary and optimized. Remove unnecessary relationships and use bi-directional filtering judiciously.

  • Optimal Data Types:

Choose the appropriate data types for your columns to minimize storage requirements and enhance query performance. Avoid unnecessary data conversions.

  • Leverage Aggregations and Summary Tables:

Pre-calculate and store summarized data using aggregations and summary tables. This reduces the load on the system during report rendering.

2.Power Query Optimization

Power Query is a potent tool for data transformation, but inefficient queries can slow down the entire data refresh process. Consider these optimizations:

  • Early Data Filtering:

Apply filters as early as possible in Power Query transformations to reduce the data loaded into Power BI.

  • Column Limitation:

Import only the columns you need. Eliminate unnecessary columns during the Power Query stage to minimize the data transferred to Power BI.

  • Harness Query Folding:

Utilize query folding to push some transformations back to the data source, reducing the amount of data brought into Power BI for processing.

Screenshot of the Power Query Editor’s optimization steps show the outcomes of each stage when you pick it.

3.Effective Visualization Techniques

Visualizations are the face of your reports, and optimizing them is crucial for a responsive user experience:

  • Limit Visual Elements:

Avoid cluttering your report with too many visuals. Each visual adds to the load time, so prioritize key insights and remove non-essential elements.

  • Aggregations in Visuals:

Aggregate data within visuals instead of relying on Power BI to aggregate large datasets, significantly improving rendering speed.

  • Optimize Maps:

If using maps, limit the number of data points displayed, and consider using aggregated data for better performance.

Sales Dashboard using optimized visualizations.

4.Monitor and Optimize DAX Calculations

DAX calculations can be resource-intensive, impacting report performance. Optimize DAX with the following tips:

  • Measure Dependencies:

Review dependencies of your measures and ensure they are calculated only when needed. Avoid unnecessary recalculations.

  • Optimize Time-Intensive Calculations:

Identify and optimize time-consuming DAX calculations, especially those involving large datasets or complex logic.

5.Maintain Security and Governance

Implementing proper security and governance measures contributes to a secure and well-maintained Power BI environment:

  • Role-Level Security (RLS):

Utilize RLS to restrict data access based on user roles, ensuring each user sees only relevant data, thus improving query performance.

  • Regular Review and Clean Up:

Regularly review and clean up your Power BI workspace. Remove unnecessary datasets, reports, and dashboards to streamline the environment.

Conclusion

Optimizing Power BI performance is an ongoing process that involves efficient data modeling, optimized queries, effective visualization techniques, and careful monitoring of DAX calculations. By implementing these best practices, you can unlock the full potential of Power BI, providing users with fast, responsive, and impactful reports and dashboards. A well-optimized Power BI environment is the key to turning data into insights that drive informed decision-making.

For further updates, get in touch with us.

Databricks Lakehouse: Next Level of Data Brilliance

Databricks Lakehouse is the new architecture used for data management which merges the best parts from Data Warehouse with the best parts from Data Lake. It combines ACID transactions and data governance of DWH with flexibility and cost efficiency of Data Lake to enable BI and ML on all data. It keeps our data in massively scalable cloud object storage in open-source data Standards. Lakehouse radically simplifies the enterprise data infrastructure and accelerates innovation in an age when ML and AI are used to disrupt any industry. The data Lakehouse replaces the current dependency on data lakes and data warehouses for modern data companies that require.

  1. Open, Direct access to data stored in standard data formats.
  2. Low query latency and high reliability for BI and Analytics.
  3. Indexing protocols optimized for ML and Data Science.
  4. Single source of truth, eliminate redundant costs, and ensure data freshness.
Fig. A simple flow of data through Lakehouse

Components of Lakehouse

1. Delta Table

With the help of Delta tables, we can enable downstream data scientists, analysts and ML engineers to leverage the same production data which is used in current ETL workloads as soon as it is processed. Delta table takes care of ACID transactions, Data Versioning and ETL. Metadata used to reference the table is added to Meta store in declared schema.

2. Unity Catalog

It ensures that we have complete control over who gains access to which data and provides a centralized mechanism for managing access control without needing to replicate data. Unity Catalog provides administrators with a unified location to assign permissions for catalogs, databases, table and views to group of users.

Delta Lake

Delta lake is a file based, open-source storage format that provides ACID transactions and scalable metadata handling, unifies streaming and batch data processing. It runs on top of existing data lakes. Delta lake integrates with all major analytics tools.

  Fig. Lakehouse with Databricks

Medallion Lakehouse Architecture – Delta Design pattern

The Medallion architecture describes the series of data layers that denote the quality of data stored in Lakehouse. The term Bronze, Silver and Gold describe the quality of data in each of these layers. We can make multiple transformations and apply business rules while processing data through the different layers. This multilayered approach helps to build a single source of truth for enterprise data products.

  1. Bronze – Raw data ingestion.
  2. Silver – Validated, Filtered data.
  3. Gold – Enriched data, Business level aggregates.

Data Objects in Databricks Lakehouse

The Data Bricks Lakehouse organizes data stored with Delta Lake in cloud object storage with familiar relations like database, tables and views. There are Five primary objects in Databricks Lakehouse.

  • Catalog
  • Database
  • Table
  • View
  • Function

Lakehouse Platform Workloads

  • Data Warehousing
  • Data Engineering
  • Data Streaming
  • Data Science & Machine Learning

Pros

  • Adds Reliability, performance, governance and quality to existing data lakes.
  • ACID Transactions
  • Handling large metadata
  • Unified data teams
  • Reducing the risk of vendor lock-in
  • Storage is decoupled from Compute.
  • ML and Analytics support

Cons

  • Complex setup & Maintenance – The platform can be complex to set up and maintain, requiring specialized skills and resources.
  • Its advanced capabilities may not be suitable for some lower functionalities use cases.

The purpose of this post is to provide a broad overview of Databricks Lakehouse. Please get in touch with us if our content piques your interest.

Transforming data using DBT (Data Build Tool)

Software tool that allows us to transform and model data in the data warehouse.

DBT supports ELT (Extract, Load, Transform) process. Data is extracted, loaded into a data warehouse, and then transformed using DBT.

Shift from ETL to ELT has increased the popularity of DBT.

How DBT Differs from Other ETL Tools:

While traditional ETL tools focus on moving and transforming data before it reaches the warehouse, DBT operates directly within the Datawarehouse.

    Capabilities of DBT:

  • Performance: By transforming data directly in the warehouse, the computational power of modern data warehouses (Snowflake, Big Query, and Redshift) is enhanced.
  • Version Control: DBT uses SQL and Jinja2, which helps in version control. This ensures transparency and traceability of changes made to data models.
  • Data Modeling: DBT offers a strong framework for creating and maintaining data models. SQL-based reusable models that are simple to maintain.
  • Testing and Documentation: Data transformations can be automatically tested with DBT, thus ensuring accuracy and integrity. It automatically creates documentation, offering insight into the data transformation procedure.
  • Workflow management and collaboration: DBT makes it possible for team members to work on the same project at the same time, which promotes cooperation. By integrating with version control systems, it facilitates an organized change and release workflow.

    Transformation flow in DBT:

  • DBT has native integration with cloud Data Warehouse platforms.
  • Development: Write data transforming code in .sql and .py files.
  • Testing and documentation: It is possible to perform local tests on all models before submitting them to the production repository.

  • Deploy: Deploy code in various environment. Version control enabled by Git allows for collaboration.

Versions of DBT:

There are two versions of DBT:

  • DBT Core: It is an open-source command-line tool that allows for local data transformation.
  • DBT Cloud: It is a web interface that enables fast and reliable implementation of DBT. Through this interface, it is possible to develop, test, schedule, and visualize models.

Core components of DBT:

  • Models: SQL queries that define data transformations.
  • Tests: Ensure data quality by validating models.

DBT supports both built-in tests (like unique or not null) and custom tests defined in SQL.

  • Snapshots: Track historical changes in data.
  • Documentation: Auto-generates documentation for clarity on data processes.
  • Macros: Reusable SQL code snippets.
  • Data: This directory is used to store CSV files or other data sources used for testing or development purposes.

Basic commands in DBT:

dbt init: Initializes a new dbt project.

dbt debug: Runs a dry-run of a dbt command without actually executing the command.

dbt compile: Compiles the SQL in your dbt project, generating the final SQL code that will be executed against your data warehouse.

dbt run: Executes the compiled SQL in your data warehouse.

dbt test: Runs tests defined in your dbt project, checking for errors or inconsistencies in your data.

dbt deps: Installs dependencies for your dbt project.

dbt docs generate: Generates documentation for your dbt project.

dbt docs serve: Serves the documentation generated by dbt docs generate on a local server.

dbt seed: Seeds your data warehouse with initial data.

dbt snapshot: Takes a snapshot of your data warehouse, capturing the current state of your data.

dbt snapshot-freshness: Checks the freshness of your snapshots and generates a report indicating which snapshots need to be refreshed.

dbt run operation: Runs a custom operation defined in your dbt project.‍

 

For more updates like this, please get in touch with us.

Power BI’s Role in Data Storytelling

In the ever-evolving realm of business intelligence, the ability to weave a compelling narrative through data sets successful professionals apart. Enter Power BI, Microsoft’s formidable analytics tool, offering a canvas for crafting narratives through dynamic visualizations. In this blog, we delve into the core of data storytelling and how Power BI serves as your artistic tool in this creative endeavor.

The Power of Data Storytelling

While data alone may seem dry and intricate, when skilfully woven into a narrative, it transforms into a potent decision-making tool. Data storytelling is the art of translating raw numbers into a coherent and engaging tale that resonates with your audience. It transcends mere charts and graphs, aiming to make data relatable and easily understandable.

   Raw Data

Visually compelling data story

Unleashing the Potential of Power BI

Power BI stands as an ideal platform for sculpting data stories, boasting an intuitive interface and robust visualization capabilities. With an array of chart types, maps, and customizable dashboards, Power BI empowers users to create compelling narratives that drive insights and actionable outcomes.

Steps to Craft a Data Story in Power BI

  1. Define Your Audience:

Recognize your audience and tailor your story to their specific requirements and interests.

  1. Identify Key Insights:

Before visualizing, pinpoint the crucial insights your data can offer. Determine the story you want to tell and the answers your audience needs.

  1. Choose the Right Visualizations:

Select visualizations that enhance your story and effectively convey your data. Power BI offers a vast collection of graphs, charts, and maps to bring your narrative to life.

  1. Create a Logical Flow:

Organize your visualizations in a logical order to guide readers through the information effortlessly.

  1. Add Context and Commentary:

Enhance your visualizations with context through annotations and commentary. Explain the significance of each data point, highlighting trends that contribute to the overarching narrative.

  1. Use Interactivity to Engage:

Leverage Power BI’s interactive features to allow users to explore the data themselves, fostering engagement and a deeper understanding of the story.

Real-world Examples

1.Sales Performance Dashboard:

Showcase a sales team’s achievements through a Power BI dashboard, highlighting fluctuations, successful strategies, and the impact of market dynamics.

     Sales Performance Dashboard

2.Operational Efficiency Story:

Illustrate how process improvements increased efficiency using Power BI visuals, demonstrating cause-and-effect relationships through clear data representation.

Operational Efficiency Dashboard

Conclusion

Mastering the craft of data storytelling is imperative in the era of data-driven decision-making. Power BI serves as a creative tool, enabling individuals and organizations to communicate complex information in an engaging and approachable manner. Each chart, annotation, and color choice contributes to the narrative canvas. Utilize Power BI to unleash the creative potential of your data, telling a compelling tale that inspires action and fosters success.

Reach out to us for more updates like these.

Snowflake’s Unparalleled Cloud Data Warehouse Features


Snowflake is a cloud data warehousing solution that offers several unique features that give it an edge over other data warehouse solutions. Here are six of the distinctive features of Snowflake:

1. Major Cloud Platform Support: Snowflake is a cloud-agnostic solution that is available on all three major cloud providers: AWS, Azure, and GCP.

    • All major functionalities and features are available across the cloud providers.
    • This enables support for multiple cloud regions and organizations can host the instances based on their business requirements.
    • Pricing depends not on the cloud provider but on the snowflake edition that you are planning for your data platform.
    • You only pay for what you store and running compute. When compute is not used, you are not charged anything for compute.

2.  Scalability: Snowflake is natively built using cloud technologies. Hence, it takes advantage of very high scalability, elasticity, & redundancy features. You can store more data and scale up or down your computing resources as needed.

    • Snowflake has implemented auto-scaling and auto-suspend features.
    • Auto-scaling feature enables Snowflake to automatically start and stop resource clusters during unpredictable load processing.
    • Auto-suspend feature stops the virtual warehouse when resource clusters have been sitting idle for a defined.

3.Near Zero Administration: Snowflake is a true SaaS offering with No hardware (virtual or physical) to select, install, configure, or manage.

    • Snowflake handles Ongoing maintenance, management, upgrades, and tuning.
    • Companies can set up and manage their database solution without any significant involvement from DBA teams.
    • Storage, compute, cloud service, and data transfer monitoring and alerts (via notification & hard stop) are provided by Snowflake so compute credits can be managed by businesses very easily.

4. Support for Semi-Structured Data: Snowflake allows the storage of structured and semi- structured data.

    • Snowflake supports reading and loading of CSV, JSON, Parquet, AVRO, ORC, and XML files.
    • Snowflake can store semi-structured data with the help of a schema on read data type called VARIANT.
    • As data gets loaded, Snowflake parses the data, extracts the attributes, and stores it in a columnar format.
    • Snowflake supports ANSI SQL plus Extended SQL. You can query data using simple SQL statements. Snowflake extended SQL is very feature-rich and adds many useful libraries to help you become more productive.

 VARIANT datatype to store Semi-Structured Data

5.Time Travel and Fail Safe: As part of a continuous data protection lifecycle, snowflake allows you to access historical data (table, schema, or database) at any point within the defined retention period.

    • Time Travel allows Querying, cloning, and restoring historical data in tables, schemas, and databases based on the retention period. This retention period is adjustable between 0 to 90 days based on the Snowflake edition.
    • This feature can help in restoring data objects that might have been accidentally deleted or for duplicating or backing up data from key points in the past.
    • Fail Safe is a data recovery service that can be utilized after all other options have been exhausted.
    • It provides a 7-day time window during which Snowflake can retrieve prior data. This time begins after the Time Travel retention period expires.
    • Both these features require additional data storage and hence incur additional storage costs as well.

6.Continuous Data Loading: Snowflake has a Serverless component called Snow pipe, which can be integrated with external object storage like S3 or Azure Blob.

    • It facilitates rapid and automated data ingestion into Snowflake tables. It allows for immediate loading of data from files as soon as they become available.
    • It doesn’t require manual specification of a warehouse because Snowflake automatically provides the necessary resources for its execution.
    • Once set up, a Snow pipe automatically reads files that arrive in the source location and loads them into target tables without any manual execution or predefined schedule.
    • Snow pipe closely works with the other 2 objects called stream and task and these objects capture the data changes and their combination can help build micro-batch or CDC solutions.
Loading Files from Amazon S3 to Snowflake using Snow pipe

These are the few major distinguishing features of Snowflake Cloud Data Warehouse. Snowflake offers many other features that have made it a go-to Cloud Data Warehouse solution for countless enterprises.

For further updates, get in touch with us.

Workflows: A full-fledged orchestration tool in Databricks platform

Introduction:

Databricks workflows is a component in Databricks platform to orchestrate the data processing, analytics, machine learning and AI applications. Workflows have evolved so much that we don’t require a third-party external orchestration tool. Depending on external tools for orchestration adds complexity in managing and monitoring capabilities.

Basic features like scheduling, managing dependency, git integration to advance level capabilities like retires, duration threshold, repair and conditional tasks are available in Databricks workflows.

A task is a unit of execution in the workflow. Workflow supports a wide range of tasks like notebooks, SQL and Spark scripts, Delta Live Tables pipelines, Databricks SQL queries, and dbt jobs. As it supports all varieties of tasks, a workflow can be sufficient to orchestrate an end-to-end pipeline for a subject area.

Databricks job:

A job in Databricks workflow is to group multiple tasks into one for better management and reusability. For each job we can set different types and sizes of compute clusters, notifications and triggers based on requirement. Databricks clusters add reliability and scalability to jobs. Databricks jobs automatically generate the lineage providing upstream and downstream tables for that job.

Jobs can be managed from Databricks UI or Databricks REST API. REST API opens a whole set of capabilities to easily integrate with any outside tool.

For example, a data engineering team can create a job for ETL, a data science team can create a job for their ML models and finally, an analytics team can create a dashboard refresh. All these jobs can be tied together into a single parent workflow, reducing complexity and better management.

A company dashboard or report can only be built using data that was processed by different teams in an organization. So, each team’s job is dependent on the preceding jobs. Since all jobs are dependent on one another, we can either set dependence on preceding jobs or schedule jobs at fixed times or set file-based triggers that can be set on external storage services like ADLS.

Notable features:

The Retry Policy, as shown in the picture below, allows you to set the maximum number of retries and a defined interval between the attempts.

Repair job is a very useful feature for developers while testing a job or for production failures. When we repair a job, it doesn’t run from the beginning, it will re-trigger the pipeline from the failed activity. In contrast, the Re-run feature will run the pipeline from the beginning of the task.

Provides a graphical interface matrix view to monitor the workflow at task level.

Databricks workflows are also integrated with Git. Using Databricks REST API we can streamline the deployment process of the workflows by a CI/CD pipeline.

Like any other component in Databricks, workflows also come with access control. There are four types of access levels available.

Databricks workflow integrates with popular tools like Azure Data Factory, Apache Airflow, dbt and Five Tran.

Notifications:

Orchestration tool cannot be complete without notifications/alerts, data bricks workflows provide various types of notifications. Email based notifications which sends email containing information of start time, run duration, status of the job. Other supported integrations are Microsoft Teams, Slack, PagerDuty and a custom webhook.

Control flow:

Control flow mainly contains two functions.

  1. a) Run Job, triggers a task based on preceding task status. Run if dependencies contain 6 different types of options.
  1. b) If/else condition triggers a task based on the job parameters or dynamic values.

To summarize, Databricks workflows evolved as an alternative to the other external orchestration tools. Advanced features and capabilities make it a no-brainer to opt for Databricks workflow over external tools for managing Databricks pipelines.

If you wish to know more about our data engineering services, drop a line here.

Snowflake Database: An Effective Cloud-Native Data Warehousing Platform

What is a Snowflake?

Snowflake is a cloud-based data warehouse that was created by three data warehousing experts in 2012 who formerly worked at Oracle Corporation. The Snowflake data warehouse is a cloud-based Analytical data warehouse that is offered as Software-as-a-service. Snowflake architecture is different from traditional data warehousing technologies like SQL Server, Teradata, Oracle, and cloud data warehouses like AWS Redshift and Google Big Query.

How is Snowflake Special?

Data warehouses typically use either Shared Disk or Shared Nothing architecture. Multiple nodes are used in Shared disk architectures to access data shared on a central storage system, while a portion of the data is stored in each node/cluster in a Shared Nothing architecture. Snowflake combines both architectures and creates a hybrid architecture.

Snowflake employs a centralized storage layer for data persistence that is available to all computing nodes. Snowflake also uses Massively Parallel Processing (MPP) clusters to process queries, with each node storing a fraction of the whole data locally.

Snowflake’s data architecture consists of three layers:

  1. Storage
  2. Compute/Query Processing
  3. Cloud Services

Each layer can scale independently and includes built-in redundancy.

 Fig: Snowflake architecture showing the different layers.

How does it work?

Storage Layer: Snowflake stores data in databases. A database is a logical group of objects consisting primarily of tables and views organized into schemas. Snowflake supports structured relational data in the form of tables using standard SQL data types. Additionally, Snowflake’s variant data type stores semi-structured non-relational data such as JSON, parquet, etc. ANSI standard SQL is used to perform data-related tasks for all datatypes.

Snowflake uses secure cloud storage to maintain data. Snowflake converts the stored data into a compressed format and encrypts it using AES 256 encryption.

Compute Layer: This is the layer where queries are executed using resources provisioned from a cloud provider. Unlike conventional data warehouses, Snowflake creates independent compute clusters called virtual warehouses that can access the same data storage layer without compromising performance.

To create a virtual warehouse, we can simply give it a name and specify a size. Snowflake automatically handles the provisioning and configuration of the underlying computational resources. There is no downtime when scaling up or down a virtual warehouse. Anytime a virtual warehouse is resized, the extra resources are available for use by any subsequent queries. Snowflake’s architecture also enables read/write concurrency without any resource contention. For instance, separate virtual warehouses can be used for loading and querying simultaneously. As all virtual warehouses access the same data storage layer, inserts and updates are immediately available to other warehouses.

Cloud Services Layer: This layer manages the entire system. It authenticates users, secures data, manages sessions, and performs query compilation and optimization. This layer also coordinates the data storage updates and access, to ensure all virtual warehouses can see the latest data instantaneously once a transaction is completed. A vital part of this layer is the metadata store as it enables many features like time travel, zero-copy cloning and data sharing. Snowflake maintains the services layer using resources distributed across multiple zones to ensure high availability.

Connecting to Snowflake is pretty easy using clients such as the JDBC or ODBC drivers. Snowflake also provides a web interface and a command-line client.

What do you need to manage Snowflake?

Most of the criteria that traditional data warehouses use to adjust performance are eliminated by Snowflake. You only need to virtual warehouses, databases, and tables, load data, and run queries. Snowflake handles everything else.

How much does Snowflake cost?

Pricing is based on usage. Just pay for the computational and storage resources that are used. Storage costs are determined by the amount of compressed data stored in database tables, and the data retained to support Snowflake’s data recovery features. Compute prices are based on the warehouse size and how long the warehouse runs.

The purpose of this post is to provide a broad overview of data analytics stacks. Please get in touch with us if our analytics service and skills-as-a-service pique your interest.

Images credit: sarasanalytics.com

Best Practices for Designing Effective Power BI Dashboards

Introduction:

Developing a compelling Power BI dashboard extends beyond mere aesthetics; it’s about delivering actionable insights. Let’s delve into pivotal best practices to shape dashboards that are not only visually appealing but also offer substantial value.

Define Clear Objectives:

Before delving into the design process, articulate the dashboard’s purpose. Establishing the key questions, it should answer ensures a focused design that resonates with Power BI users, meeting their specific needs.

Simplify and Declutter:

Maintain a clean and clutter-free dashboard with a minimalistic approach. This enhances user experience, allowing them to concentrate on critical data points without unnecessary distractions, thereby optimizing the efficacy of your Power BI visualization.

Choose the Right Visualizations:

Visualizations serve as storytellers within Power BI dashboards. Opt for those that effectively convey your message. While classics like bar charts, line graphs, and pie charts are reliable, explore alternative options based on the nuances of your data.

Consistent Design and Branding:

Create a cohesive visual identity by ensuring consistency in color schemes, fonts, and branding elements. This fosters a professional appearance and reinforces your organization’s identity within the Power BI platform.

This picture showcases a Power BI dashboard designed with various visuals serving different purposes. The use of a consistent font and color scheme makes it look clean and attractive, helping to tell the full story.

Prioritize Data Quality:

Maintain data integrity by ensuring accuracy and currency. Regularly clean and organize your data to prevent inaccuracies. Remember, in Power BI, the adage “garbage in, garbage out” holds, making data quality a cornerstone of effective dashboard design.

Optimize for Performance:

Prioritize speed by trimming unnecessary calculations and limiting resource-intensive visuals. This optimizes Power BI dashboard loading times, providing a seamless user experience, and ensuring efficient data consumption.

Enable Interactivity:

Engage your audience within the Power BI platform by leveraging interactive features such as drilldowns and filters. Empower users to explore the data independently, fostering a sense of ownership over insights.

Documentation and Training:

Facilitate user understanding by incorporating tooltips and guides within the Power BI dashboard. This is especially crucial for users new to the platform, ensuring a smooth onboarding experience and maximizing the potential of your visualizations.

Regularly Review and Update:

Maintain relevance by regularly reviewing your Power BI dashboard’s performance against objectives. Update it as needed, ensuring that it evolves with the dynamic nature of your business. A constantly evolving dashboard ensures continued value within the Power BI ecosystem.

Conclusion:

Effective Power BI dashboard design strikes a delicate balance between simplicity and functionality. By incorporating these tips, you’ll create visually appealing dashboards and empower users with valuable insights, facilitating informed decision-making within your organization. Power BI becomes a transformative tool when these best practices are woven into the fabric of your dashboard design.

Enterprise-wide Analytics implementation for a Pharmaceutical Business

The Pharmaceutical business generates vast amounts of data from various sources, including sales, transactions, inventory management systems, customer interaction, supply chain, manufacturing, and marketing campaigns. One of our clients encountered several data-related challenges such as data fragmentation, inefficient data movement, and lack of data orchestration. To overcome those obstacles and unlock the value of their data, we decided to move to Azure Cloud and implemented Azure Data Factory.

Problem

The client struggled with disparate data sources that hindered their ability to gain a unified view of customer behavior, sales performance, and inventory management. Additionally, manual data movement processes were time-consuming, error-prone, and difficult to scale. The data was scattered across multiple systems, making it difficult to gain a holistic view of the business, also with the increasing number of data breaches, the client needed a secure and compliant way to store and manage its data. The data was growing exponentially, and the client’s existing infrastructure was unable to handle the scale and complexity of the data. So, the client required a solution architecture that could automate data integration, provide seamless data movement and facilitate data orchestration for efficient analytics and decision-making.

Data Solution

DataTheta utilized Azure Data Factory, a flexible and scalable data integration platform that successfully handled their data-related challenges, the key tenets of the solution include

  1. Data Movement: Azure Data Factory to ingest data from various sources such as Sales databases, inventory management systems, and marketing platforms. Data Factory facilitated efficient data movement across on-premises and cloud storage systems.
  2. Data Transformation: Using Azure Data Factory’s data flow feature, we performed data transformations, cleansing, and enrichment operations to ensure data quality and consistency. This included mapping data from different sources, applying business rules, and performing aggregations.
  3. Data Orchestration: Azure Data Factory was used to create and manage data pipelines and data flows. We automated the end-to-end data integration process including scheduling, dependency management, and error handling to ensure the seamless execution of data workflows. The was now centralized and accessible to the entire organization, enabling teams to make data-driven decisions.
  4. Integration with Other Azure Services: Azure Data Factory was integrated with various Azure services such as Azure SQL Database and Azure Storage for data storage and analytics. It can be integrated with Azure key vault for secrets and key management and Azure Logic App for Specific use cases.
  5. Data presentation Layer: Data aggregated using the data pipelines were well presented using Power BI to various teams such as the sales group, supply chain group, production planning, etc., to effectively utilize the information about various operations.

 Data factory Implementation Along with other Azure services (ref: Azure.com)

Data factory Implementation Along with other Azure services (ref: Azure.com)

Implementation and Benefits

The implementation of Azure Data Factory and other Azure Services yielded significant benefits,

  1. Unified View of Data: Azure Data Factory facilitated the integration of data from multiple sources, providing clients with a unified view of customer behaviour, sales performance, and inventory management. This enabled the client to make informed decisions based on accurate and up-to-date information.
  2. Automated Data Workflows: Azure Data Factory automated the end-to-end data integration process, reducing manual effort and improving operational efficiency, ensuring timely data movement and transformation.
  3. Scalability and Flexibility: Azure Data Factory offered scalability to handle large volumes of data and flexibility to accommodate changing business requirements, Client was able to scale resources based on demand, ensuring efficient data processing and reducing costs.
  4. Data Security: Azure Cloud provided robust security features, including encryption, identity and access management, and threat detection.
  5. Data Quality and Consistency: By leveraging Azure Data Factory’s data transformation capabilities, the client improved data quality and consistency. The platform allowed us to apply data cleansing rules, perform validations and enforce data integrity, ensuring reliable insights and analytics.
  6. Time and Cost Saving: Azure Data Factory reduced the time required for data integration and movement, resulting in faster access to data and accelerated analytics. The automation capabilities of Data Factory also led to cost savings by minimizing manual work and optimizing resource utilization.

Value Creation: DataTheta’s solutions powered by Azure Data Factory from Azure Cloud provided the client with a comprehensive solution for managing and analyzing its data. The client was able to overcome its data-related challenges and achieve significant benefits. Datatheta’s architecture and solutions capability enabled the integration of various data sources seamlessly enabling a rich ecosystem for data processing and analytics. The rich data visualization provided information democratization among various business groups.

Monitoring Key Performance Indicators for fact-based business decision making

Must-know information about data analytics, data stacks and business value realization for a decision-maker.

In our day-to-day life, we monitor a lot of indicators about physical health, financial health, mental health, and more. Similarly, there is a multitude of indicators that aid enterprises in understanding their current and target business operation states. If you are in the journey of capturing these KPIs, you need to know the data sources and software ecosystem that render the error-free, information-rich, actionable KPIs from the available data. This exercise of KPIs building is more of a data analysis, which is like baking.

  • Well curated and good quality ingredients or “data” for expected results (or outcome)
  • The proportional blending of various ingredients or “math applied on the data” gives the best bake (information-rich KPIs)
  • Appropriate tools or “data infrastructure” makes the baking process easier
  • “Baking Skill” cannot be replaced with the best ingredients and bakery equipment – “the human intellect advantage”

Using unclean ingredients for the baking makes the pastry unconsumable, similarly, the unclean data need to be processed appropriately before bringing to the baker’s table. The ingredient quantity and the oven temperature bring out the crispy cookies, similarly when the math well applied to the data brings the error-free and acceptable KPIs.

Using unclean ingredients for the baking makes the pastry unconsumable, similarly, the unclean data need to be processed appropriately before bringing to the baker’s table. The ingredient quantity and the oven temperature bring out the crispy cookies, similarly when the math well applied to the data brings the error-free and acceptable KPIs.

Good data infrastructure coupled with competent data analysis brings the dependable KPIs for making your business data-driven.

Let us discuss about data stack and data definition,

Data Stack: Data stack is a set of software units that helps to move the data from different data sources (from SAP, CRM, HRMS, Financial Systems, etc), loads into a new unified destination, clean the data, and set it ready for data visualization (for business users) and consumption of data scientists (for advanced use cases). You can learn more details here.

Data Definition: Data definition is simply defined as how various data points (variables) are arithmetically processed to get a final value that helps in making a business decision. Let me demonstrate this with an example.

In the below data set, the sales of outlets are captured, the product visibility in the storefront enables easy access of the product and more sales eventually. But some necessary items such as fruits and vegetables though moved to less visible areas also generates enough sales. If you create a new KPI concerning product placement/visibility in a Type I supermarket in Tier 1 location, that will help the sales acceleration. This needs more questions to be answered about the product attributes, day of sale, and current product visibility.

In the below data set, the sales of outlets are captured, the product visibility in the storefront enables easy access of the product and more sales eventually. But some necessary items such as fruits and vegetables though moved to less visible areas also generates enough sales. If you create a new KPI concerning product placement/visibility in a Type I supermarket in Tier 1 location, that will help the sales acceleration. This needs more questions to be answered about the product attributes, day of sale, and current product visibility.

A statistical and mathematical calculation that renders the new KPI for the business users in an error-free and recurrent decision making of product placement in various cities in different types of supermarkets is termed as data definition (some practitioners term this as data augmentation or concoction).

A statistical and mathematical calculation that renders the new KPI for the business users in an error-free and recurrent decision making of product placement in various cities in different types of supermarkets is termed as data definition (some practitioners term this as data augmentation or concoction).

What to consider as a decision-maker?

Your company should set up its infrastructure with a central database that harbours the data for analysis (by your business users and data scientists) and reporting.

This paves the path to the data-driven business. Yes, you have a single version of data that gives the necessary information for your business operation. These data need to be cleaned and packed in different boxes that can be accessed by different groups.

Ok, but where to begin?

The first and foremost is the C-suite support. This goes without saying.

What could be the potential use case in your industry? In most setting the best one to start is with business intelligence projects rather than a data science project.

After deciding the use cases, you need to work out the data stack (or data infrastructure). Then who will handle the data and the governance of the data within your organization. This potentially answers the questions:

  1. Who is the data owner?
  2. Who has the privilege to access what type of data?

Your data infrastructure decision is very much dependent on the type of data you have (structured/unstructured) and the use cases that you have decided to work on.

Few more things to consider:

  1. Vendor dependence: Ensure in any case that you are not dependant on vendors. When your data volume increases and the data consumers grow up the cost will escalate substantially. Be wise while stitching a contract with your vendor.
  2. Automation: Automation is helpful. Play this with caution. Test the system thoroughly before the deployment.
  3. Data Science: Don’t venture into data science projects initially. Start with KPIs or BI visualization projects. Data Science projects require skilled stakeholders to develop, implement and deploy. This also has a longer development lifecycle that includes model performance monitoring and model versioning.
  4. Adoption: If you already have a BI tool that the team is comfortable using, build your data presentation layer on it.

The general layout of the BI project is as follows:

Data Loading:

Data loading is the process of moving the data from the source systems such as ERP, CRM, HRMS, other third-party applications to the data warehouse. Here we have two options for data loading:

  1. Make: We use schedulers to schedule the load jobs and write custom code to connect with the sources. DataTheta prefers Apache Airflow, an open-source tool as a scheduler to execute code written in Python.
  2. Buy: The services such as Fivetran, Matillion, Stitch. These are saas products with competitive pricing.

So, how to choose between these options:

  • Do you have large amounts of data? Or What is your next one-year projection?
  • Do you want to have the data in your own data centre due to its trade/regulatory confidentiality?
  • Do you have machines (PLC, DCS, SCADA) as your data source?

If your answer is “yes” for any one of the above, then you need to go with the “make” decision. If you have any other situation than the above quoted, write to DataTheta to obtain our opinion.

Should I build my data engineering team?

There are multiple factors you need to consider:

Speed to data consumption: How fast your team wants to consume the data. If you have ample time in hand, then plan for your first data engineer hire. If you are moving now new to the business intelligence space it is advisable to outsource the data engineering work as the workload will be less than 3 weeks for the initial projects.

Skill Availability: The skill availability is less due to the skyrocketing demand in the market. When I am drafting this article, the count of job openings is 14,000+ in India and high in other markets too.

Cost: The cost of hiring 3+ years of experience data engineer will cost around 80kEU to 110kEU per annum in Europe and 110K USD per annum in the US. Moreover, the workload will be more at the initial days of the project. Hence outsourcing makes perfect sense in most cases.

Cost: The cost of hiring 3+ years of experience data engineer will cost around 80kEU to 110kEU per annum in Europe and 110K USD per annum in the US. Moreover, the workload will be more at the initial days of the project. Hence outsourcing makes perfect sense in most cases.

Data Storage:

What is the amount of data your business will aggregate in the next year? The expected volume of data is the deciding factor for the selection of a database.

There are regular databases and massively parallel processing (MPP) databases.

If your data table does not exceed 5 – 8 million records, then you may opt for a regular SQL database. We recommend PostgreSQL for various reasons. If your business stores more data then you may consider Snowflake, Redshift, or BigQuery. If you have multiple petabytes of data, you need to consider the Data Lake architecture. Is this word new to you? Read more about Data Lake here.

Data Transformation:

After the data arrives in the central database, it is imperative to break it into clean, small, useful chunks and harbor it in different tables. This process is called data transformation and aggregation. There is a multitude of tools available to do this job.

Datatheta uses Pentaho for the data transformation jobs. Dbt is another tool worth the mention. It is an open-source tool and reduces the repetitive jobs of a data engineer. The CRUD procedure automation and keeping tab of the table lineage are useful features apart from the data testing and version control.

Data Visualization:

Data visualization is the critical component of a BI project. Mere visual appeal is not the game decider. Instead, the following need to be considered:

  1. Cost
  2. How much it can support customization for report consumers?
  3. Ease of use
  4. Handle extended size tables

At DataTheta, we tried Power BI, Tableau, Metabase, and Highcharts.

Metabase is an open-source tool. If you have less than 50 users to access the dashboard and the users know about reading the information content from data, then this is for you. If Metabase is hosted in your server then it is free to use, yet powerful.

If you are a bigger organization with more than 100 end users and require centralized control of the dashboard, then Power BI is the best option. The other tools such as Tableau and Qlik are also good to explore.

We tried Highcharts, this comes in a perpetual license model. If you have an internal team to handle the BI, this is a low-cost alternative to the Power BI and Tableau.

Cloud Service Provider:

The cloud service provider plays an important role in your business intelligence journey. If you prefer to stick to an on-premise data centre, you need to rethink the decision. Cloud services are useful and efficient in various aspects.

AWS, Azure, and Google Cloud are the market leaders in this space. If you are already utilizing the service of any of these providers, consider building the data stacks with them. You may negotiate a better deal based on your existing relationship. This article has covered the subject comprehensively.

It is important to know how the entire data ecosystem works. More important is deriving business value from data-driven decision-making. Data literacy is the important outcome of these efforts. This can be achieved by:

  1. Cherry-pick the best use case that suits your business.
  2. Understand the information content present in your data. Then, connect all the necessary data sources to a single database.
  3. Choose and set up the data stack suitable for your use case and organization culture. If you are not sure, it is helpful to talk to people who have done that.

This is an ever-growing field, and the technology evolves faster, so allow the right people to support your data analytics journey.

This article is created to give an overall picture of the data analytics stacks. If you are interested in our analytics service and skills-as-a-service reach out to us. You are welcome to write your comments and queries.

Images Credit: Freepik.com and Excalidraw.com