Data Centric

Data Centric

Share

Data is everywhere. Data-Centric is here to facilitate those data into information.

17/11/2024

Python is known for its versatility, readability, and ease of use, making it one of the most popular programming languages today. Key features of Python include:

1. **Simple and Readable Syntax**: Python's syntax is designed to be clean and easy to understand, which makes it a great language for beginners and promotes readability, even in complex projects.

2. **Interpreted Language**: Python is an interpreted language, meaning it executes code line-by-line, which makes debugging easier. You don't need to compile the code before running it, as the Python interpreter does this for you on the fly.

3. **Dynamically Typed**: In Python, you don’t need to declare the type of a variable explicitly; the type is inferred at runtime. This flexibility allows for rapid prototyping and experimentation.

4. **Object-Oriented and Functional**: Python supports both object-oriented and functional programming paradigms, so you can use classes and objects as well as functional constructs like lambda functions and map-reduce operations.

5. **Extensive Standard Library**: Python has a comprehensive standard library that includes modules and functions for common tasks like file I/O, math, web development, data handling, and more, reducing the need for external dependencies.

6. **Cross-Platform Compatibility**: Python is cross-platform, meaning it runs on various operating systems like Windows, MacOS, and Linux with little or no modification to the code.

7. **Large Community and Libraries**: Python has a massive community that contributes to a vast ecosystem of libraries and frameworks, such as NumPy and pandas for data analysis, TensorFlow and PyTorch for machine learning, and Django and Flask for web development.

8. **Support for Multithreading and Multiprocessing**: Python supports multithreading and multiprocessing, allowing developers to execute multiple threads or processes in parallel, making it useful for tasks like data processing and computation.

9. **Strong Support for Data Science and Machine Learning**: With libraries like pandas, NumPy, scikit-learn, and TensorFlow, Python has become the language of choice for data analysis, machine learning, and artificial intelligence applications.

10. **Interactive Mode**: Python has an interactive mode, allowing you to test snippets of code quickly in the Python shell or via tools like Jupyter Notebook, making it excellent for exploratory programming and learning.

These features make Python ideal for various applications, from web development and automation to data science, machine learning, and beyond.

17/11/2024

Programming is the process of creating a set of instructions that tell a computer how to perform a specific task or solve a problem. These instructions are written in a *programming language*, such as Python, Java, or C++. At its core, programming involves:

1. **Understanding the Problem**: Breaking down a problem into smaller, manageable parts and figuring out what steps are needed to solve it.

2. **Writing Code**: Using a programming language to translate the solution into code that a computer can understand and execute.

3. **Testing and Debugging**: Ensuring the program works as expected by running tests, finding errors (bugs), and fixing them.

4. **Refining and Optimizing**: Improving the code to make it faster, more efficient, or easier to understand.

5. **Executing Programs**: Running the code so the computer can perform the task automatically, reliably, and repeatedly.




19/03/2024

đź”» Here are the frequently used 20 DAX functions with examples:

1. SUM:

Example: Calculate the total sales amount.
Total Sales = SUM(Sales[Amount])

2. AVERAGE:

Example: Calculate the average of order quantities.
Average Quantity = AVERAGE(Orders[Quantity])

3. COUNT:

Example: Count the number of orders.
Number of Orders = COUNT(Orders[OrderID])

4. MAX and MIN:

Example: Find the highest and lowest order amounts.
Highest Order Amount = MAX(Orders[Amount])
Lowest Order Amount = MIN(Orders[Amount])

5. RELATED:

Example: Get the name of the customer related to an order.
Customer Name = RELATED(Customers[Name])

6. ALL and ALLEXCEPT:

Example: Calculate the total sales amount ignoring filters except for the Product Category.
Total Sales (All except Product Category) = CALCULATE(SUM(Sales[Amount]), ALLEXCEPT(Sales, Products[Category]))

7. SWITCH:

Example: Categorize order quantities based on predefined ranges.
Quantity Category =
SWITCH(
TRUE(),
Orders[Quantity]

02/11/2023

What is Data Science ?
Data science is an interdisciplinary field that combines various techniques, algorithms, processes, and systems to extract insights and knowledge from structured and unstructured data. It encompasses a wide range of skills and methods, including statistics, machine learning, data analysis, data visualization, and domain expertise, to solve complex problems and make data-driven decisions.
What is Data Analytics ?
Data analytics is the process of examining large sets of data to uncover hidden patterns, correlations, trends, and insights. It involves various techniques and tools to transform raw data into meaningful information that can be used for making informed decisions.

Difference Between Data Science & Data Analytics

Scope:
Data Science: Data science is a broader field that encompasses various aspects of data processing, including data collection, cleaning, analysis, modeling, and interpretation. It often involves complex algorithms and advanced machine learning techniques to extract insights and build predictive models.
Data Analytics: Data analytics is a subset of data science that primarily focuses on analyzing data to answer specific questions or solve immediate business problems. It involves descriptive analytics (summarizing historical data) and, to some extent, diagnostic analytics (understanding why certain events occurred).
Objective:
Data Science: The primary goal of data science is to extract valuable insights and knowledge from large and complex datasets. This can involve tasks like building predictive models, uncovering hidden patterns, and developing algorithms for various applications.
Data Analytics: Data analytics is more focused on providing actionable insights to support decision-making. It aims to answer specific questions or address particular business challenges using data.
Techniques:
Data Science: Data science often involves advanced statistical techniques, machine learning, and sometimes deep learning to build predictive models. It may also involve working with unstructured data, such as text and images.
Data Analytics: Data analytics primarily uses descriptive and diagnostic analytics techniques, which involve summarizing historical data and understanding the reasons behind specific outcomes.
Depth of Analysis:
Data Science: Data scientists typically perform in-depth analyses that may involve complex modeling and algorithm development. They are often responsible for creating models that can make predictions or recommendations based on data.
Data Analytics: Data analysts focus on analyzing data to provide insights for immediate decision-making. The depth of analysis may not be as extensive as in data science, but it is highly relevant and actionable.

Tools and Technologies:
Data Science: Data scientists often use advanced programming languages like Python or R, along with specialized libraries and frameworks for machine learning and statistical analysis.
Data Analytics: Data analysts commonly use tools like Excel, SQL, and visualization platforms (e.g., Tableau, Power BI) to conduct analysis and present findings in a clear and understandable manner.
Job Roles:
Data Science: Job roles in data science often include Data Scientist, Machine Learning Engineer, and AI Researcher. These roles involve a deep understanding of algorithms, modeling, and advanced statistical techniques.
Data Analytics: Job roles in data analytics may include Data Analyst, Business Analyst, and Business Intelligence Analyst. These roles focus on extracting actionable insights from data for business decision-making.
What to Start Data Science Journey?
To start data science, it's important to have a solid understanding of several key concepts, skills, and tools. Here are some important things to know in data science:
Programming Languages: Proficiency in programming languages like Python and R is crucial. These languages are widely used in data science for tasks such as data manipulation, analysis, and building machine learning models.
Statistics and Probability: A strong foundation in statistics and probability theory is essential for tasks like hypothesis testing, regression analysis, and understanding the uncertainty associated with data.
Data Wrangling and Cleaning: Knowing how to clean and preprocess data is a critical skill. This involves tasks like handling missing values, dealing with outliers, and transforming data for analysis.
Data Visualization: The ability to create effective visualizations to communicate insights from data is important. Tools like Matplotlib, Seaborn, and Tableau are commonly used for data visualization.
Machine Learning: Familiarity with machine learning algorithms and techniques is a core component of data science. This includes both supervised learning (classification, regression) and unsupervised learning (clustering, dimensionality reduction).
Deep Learning (Optional): For tasks like image recognition, natural language processing, and other complex problems, knowledge of deep learning frameworks like TensorFlow or PyTorch can be beneficial.
SQL and Database Management: Understanding how to query and manipulate data in databases using SQL is important, as many organizations store their data in databases.
Big Data Technologies (Optional): Familiarity with tools like Hadoop, Spark, and distributed computing can be useful for handling and analyzing large volumes of data.
Domain Knowledge: Having domain-specific knowledge (e.g., finance, healthcare, marketing) can be incredibly valuable in understanding the context and nuances of the data.
Experimental Design and A/B Testing: Understanding how to design experiments, conduct A/B tests, and interpret results is important for making data-driven decisions.
Communication Skills: Being able to effectively communicate findings and insights to both technical and non-technical stakeholders is crucial in data science roles.
Ethical Considerations: Understanding the ethical implications of data science, including issues related to privacy, bias, and fairness, is becoming increasingly important.
Version Control (e.g., Git): Knowing how to use version control systems is important for collaboration and managing code and data.
Continuous Learning: The field of data science is constantly evolving, so a willingness and ability to keep learning and staying up-to-date with new tools and techniques is important.
Problem-Solving Skills: Being able to approach complex problems, break them down into manageable parts, and find creative solutions is a key skill in data science.

Buzzwords, everyone should keep in mind
Population: The entire group of individuals or items that you are interested in studying.
Sample: A subset of the population that is selected for data collection. Samples are often used to make inferences about the entire population.
Variable: A characteristic or quantity that can take on different values. In statistics, variables can be classified as dependent (response) or independent (predictor) variables.
Descriptive Statistics: Methods used to summarize and describe data, such as mean, median, and standard deviation.
Inferential Statistics: Techniques used to make predictions or inferences about a population based on data collected from a sample.
Mean: The average value of a set of data points. It is calculated by summing all values and dividing by the number of data points.
Median: The middle value in a set of data when the values are arranged in order. If there is an even number of data points, the median is the average of the two middle values.
Mode: The value that appears most frequently in a set of data.
Standard Deviation: A measure of the spread or dispersion of data. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that they are more spread out.
Normal Distribution: A symmetric, bell-shaped probability distribution that is commonly observed in many natural phenomena.
Hypothesis Testing: A method to determine whether a specific statement (hypothesis) about a population parameter is true or not, based on sample data.
Confidence Interval: A range of values within which a population parameter is estimated to lie, along with a level of confidence associated with the estimate.
Regression Analysis: A statistical technique used to examine the relationship between one or more independent variables and a dependent variable.
Correlation: A measure of the strength and direction of a linear relationship between two variables. It is usually expressed as the correlation coefficient.
Outlier: An extreme value in a data set that significantly differs from the other values and can potentially skew the results.
Probability: The likelihood or chance of an event occurring, typically expressed as a value between 0 (impossible) and 1 (certain).
Significance Level (Alpha): The predetermined threshold used in hypothesis testing to determine whether to accept or reject a null hypothesis.
Null Hypothesis (H0): A statement that there is no significant difference or effect in the population being studied.
Alternative Hypothesis (Ha or H1): A statement that contradicts the null hypothesis and suggests a significant difference or effect in the population.
P-value: The probability of obtaining a test statistic as extreme as, or more extreme than, the one observed, assuming that the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
Frequency: The count of how often a particular value or category appears in a data set.
Skewness: A measure of the asymmetry in the distribution of data. A positively skewed distribution has a long tail on the right, while a negatively skewed distribution has a long tail on the left.
Kurtosis: A measure of the "tailedness" of a distribution. It assesses whether the data has heavy tails (leptokurtic) or light tails (platykurtic) compared to a normal distribution.
Variance: A measure of how much individual data points deviate from the mean. It's the average of the squared differences from the mean.
Categorical Data: Data that represents categories or labels, typically not numerical in nature. For example, colors or types of animals.
Continuous Data: Data that can take on an infinite number of values within a given range, often measured on a scale. For example, height or temperature.
Central Limit Theorem: A fundamental concept in statistics stating that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.
Confounding Variable: An extraneous variable that is not the primary focus of a study but can influence the relationship between the variables being studied.
Chi-Square Test: A statistical test used to determine if there is a significant association between two categorical variables.
Covariance: A measure of how two variables change together. Positive covariance indicates that when one variable increases, the other tends to increase, and vice versa.
ANOVA (Analysis of Variance): A statistical method used to compare the means of two or more groups to determine if they are statistically different.
Regression Coefficients: The values that represent the strength and direction of the relationship between independent variables and the dependent variable in a regression model.
Multicollinearity: A situation in regression analysis where two or more independent variables are highly correlated, making it difficult to distinguish their individual effects on the dependent variable.
Histogram: A graphical representation of the distribution of numerical data, which is divided into bins or intervals to show the frequency of values within each bin.
Random Sample: A sample in which each member of the population has an equal and independent chance of being selected, resulting in an unbiased representation of the population.
Statistical Significance: The likelihood that an observed result is not due to chance, typically determined by comparing the p-value to a predefined significance level.
Power (Statistical Power): The probability of correctly rejecting a false null hypothesis, or the ability of a statistical test to detect a true effect when it exists.
Time Series Data: Data collected or recorded at successive points in time, often used for forecasting and analyzing trends over time.
Cross-Tabulation: A method of summarizing and analyzing data from two or more categorical variables using a contingency table to show the relationships between them.
Coefficient of Determination (R-squared): A measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model.
Descriptive Statistics: These techniques help summarize and describe data. Common descriptive statistics include measures like mean, median, mode, standard deviation, and range. They provide an initial overview of the data's characteristics.
Inferential Statistics: These methods are used to draw conclusions or make predictions about a population based on a sample of data. Inferential statistics include hypothesis testing, confidence intervals, and regression analysis.
Probability: Probability theory is fundamental in data science for understanding uncertainty and making predictions. It includes concepts like conditional probability, Bayes' theorem, and random variables.
Data Distributions: Understanding different data distributions, such as the normal distribution, Poisson distribution, and binomial distribution, is crucial for modeling and analyzing data effectively.
Hypothesis Testing: Hypothesis testing is used to make decisions based on sample data. Data scientists use it to assess the significance of observed effects, such as A/B testing for website optimization.
Regression Analysis: Regression models help in understanding and predicting the relationship between one or more independent variables and a dependent variable. Linear regression is a common technique, but there are many variations, including multiple regression, logistic regression, and polynomial regression.
Machine Learning: Data science often involves various machine learning algorithms, which are based on statistical principles. These algorithms include decision trees, support vector machines, k-means clustering, and neural networks.
Resampling Techniques: Techniques like bootstrapping and cross-validation are used to assess the stability and reliability of statistical models and to avoid overfitting.
Dimensionality Reduction: Principal Component Analysis (PCA) and other dimensionality reduction techniques are used to reduce the complexity of datasets by identifying important variables and patterns.
Statistical Software: Data scientists use statistical software such as R and Python with libraries like NumPy, Pandas, and SciPy to perform data analysis and statistical modeling.
Time Series Analysis: For data with a temporal component, data scientists use time series analysis to model and predict trends and patterns over time.
Sampling Techniques: Data scientists often work with large datasets and use various sampling methods, like random sampling or stratified sampling, to obtain representative samples for analysis.
Outlier Detection: Identifying and handling outliers is a common statistical task to ensure that unusual or erroneous data points do not unduly influence analysis or model performance.
Confounding Variables: Data scientists need to account for and control confounding variables, which can affect the relationship between variables being studied.
Feature Engineering: Statistical techniques are used to create new features from existing data, which can improve the performance of machine learning models.
Statistical Visualization: Data visualization is essential for understanding data patterns and relationships. Data scientists use techniques like histograms, scatter plots, and box plots to explore and present data effectively.
Thanks
Hasnat Osman

31/05/2023




03/04/2023





16/03/2023

Sales dashboard using Power BI.





28/01/2023

Git vs. GitHub

What Is Git?
It is a free, high-quality distributed version control system suitable for tracking modifications in source code in software development. It was originally created as an open-source system for coordinating tasks among programmers, but today it is widely used to track changes in any set of files. The key objectives of Git are as follows:

Speed and efficiency
Data integrity
Support for distributed and non-linear workflows

What Is GitHub?
It is a web-based Git repository. This hosting service has cloud-based storage. GitHub offers all distributed version control and source code management functionality of Git while adding its own features. It makes it easier to collaborate using Git.

Additionally, GitHub repositories are open to the public. Developers worldwide can interact and contribute to one another’s code, modify or improve it, making GitHub a networking site for web professionals. The process of interaction and contribution is also called social coding.



30/11/2022
17/11/2022


#2023

Want your school to be the top-listed School/college in Dhaka?

Click here to claim your Sponsored Listing.

Location

Website

Address


Dhaka