Top 25 Data Science Interview Questions and Answers (2025) — Freshers & Experienced | Jobdexo

If you are preparing for a data science interview, you already know how overwhelming it can feel. Companies like Amazon, Google, Flipkart, TCS, Infosys, and hundreds of startups are hiring data scientists right now — but their interviews are tough. I have compiled the 25 most commonly asked data science interview questions based on real interview experiences shared by candidates on Glassdoor, LinkedIn, and our own Jobdexo community.

Whether you are a fresher from 2024 or 2025 batch or someone with 1-2 years of experience, these questions cover exactly what interviewers actually ask.

1. What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data to train a model — for example, predicting whether an email is spam or not. Unsupervised learning works with unlabeled data and finds hidden patterns — like clustering customers into groups based on their buying behavior.

2. What is overfitting and how do you prevent it?
Overfitting happens when a model learns the training data too well, including its noise, and performs poorly on new data. You can prevent it using cross-validation, regularization (L1/L2), pruning decision trees, or using more training data.

3. Explain the bias-variance tradeoff.
Bias is the error from wrong assumptions in the model. Variance is the error from sensitivity to small changes in training data. High bias leads to underfitting, high variance leads to overfitting. A good model balances both.

4. What is a confusion matrix?
A confusion matrix is a table that shows the performance of a classification model. It contains four values — True Positive, True Negative, False Positive, and False Negative. From these you can calculate accuracy, precision, recall, and F1 score.

5. What is the difference between precision and recall?
Precision measures how many of the predicted positives are actually positive. Recall measures how many actual positives were correctly identified. When false positives are costly, optimize for precision. When false negatives are costly, optimize for recall.

6. What is cross-validation?
Cross-validation is a technique to evaluate model performance by splitting data into k subsets. The model trains on k-1 subsets and tests on the remaining one. This process repeats k times. It gives a more reliable accuracy estimate than a single train-test split.

7. Explain the difference between a random forest and a decision tree.
A decision tree is a single tree that splits data based on feature values. A random forest builds hundreds of decision trees on random subsets of data and combines their results. Random forests reduce overfitting and give much better accuracy.

8. What is regularization? Explain L1 and L2.
Regularization adds a penalty to the loss function to reduce overfitting. L1 (Lasso) adds the absolute value of coefficients and can reduce some to zero, effectively doing feature selection. L2 (Ridge) adds the square of coefficients and keeps all features but makes them smaller.

9. What is gradient descent?
Gradient descent is an optimization algorithm used to minimize the loss function. It works by calculating the gradient of the loss and moving in the opposite direction by a small step (learning rate) until it reaches the minimum.

10. What is the difference between bagging and boosting?
Bagging builds multiple models in parallel on random subsets and combines them — Random Forest uses bagging. Boosting builds models sequentially where each new model corrects the errors of the previous one — XGBoost and AdaBoost use boosting.

11. What is PCA?
Principal Component Analysis is a dimensionality reduction technique. It transforms high-dimensional data into fewer dimensions while retaining as much variance as possible. It is useful when you have too many features and want to reduce computation time.

12. How do you handle missing values in a dataset?
Common approaches include removing rows with missing values, filling them with mean, median, or mode, using forward or backward fill for time series, or using machine learning models to predict missing values.

13. What is the difference between correlation and causation?
Correlation means two variables move together. Causation means one variable directly causes the change in another. A classic example — ice cream sales and drowning rates are correlated but ice cream does not cause drowning. Both increase in summer.

14. What is a p-value?
A p-value tells you how likely it is to observe your results if the null hypothesis is true. A p-value below 0.05 typically means the result is statistically significant and you reject the null hypothesis.
15. Explain the Central Limit Theorem.

The Central Limit Theorem states that if you take large enough samples from any population, the distribution of sample means will be approximately normal, regardless of the original distribution. This is the foundation of many statistical tests.

16. What is the difference between Type I and Type II errors?
Type I error is a false positive — you reject the null hypothesis when it is actually true. Type II error is a false negative — you fail to reject the null hypothesis when it is actually false.

17. What is feature engineering?
Feature engineering is the process of using domain knowledge to create new features or transform existing ones to improve model performance. For example, extracting day of week from a date column or combining two columns into a ratio.

18. What is the difference between RMSE and MAE?
RMSE (Root Mean Squared Error) penalizes large errors more because it squares them. MAE (Mean Absolute Error) treats all errors equally. Use RMSE when large errors are particularly bad. Use MAE when all errors are equally important.

19. What is A/B testing?
A/B testing is an experiment where you split users into two groups — one sees version A and the other sees version B. You measure which version performs better using statistical tests. It is widely used in product and marketing decisions.

20. Explain the concept of a ROC curve and AUC.
ROC (Receiver Operating Characteristic) curve plots True Positive Rate against False Positive Rate at different thresholds. AUC (Area Under the Curve) measures the overall performance — a value of 1 is perfect, 0.5 means the model is no better than random guessing.

21. What is the difference between deep learning and machine learning?
Machine learning uses algorithms that learn patterns from data. Deep learning is a subset of machine learning that uses neural networks with many layers. Deep learning is better for unstructured data like images, audio, and text but requires much more data and computing power.

22. What is the curse of dimensionality?
As the number of features increases, the data becomes increasingly sparse in the high-dimensional space. Models need exponentially more data to perform well. This is why dimensionality reduction techniques like PCA are important.

23. What is time series analysis?
Time series analysis deals with data points collected over time in sequential order. It is used to identify trends, seasonality, and patterns to make future predictions. Common models include ARIMA, SARIMA, and LSTM.

24. What is the difference between a parametric and non-parametric model?
Parametric models assume a specific form for the relationship between variables and have a fixed number of parameters — like linear regression. Non-parametric models make no such assumptions and can be more flexible — like KNN or decision trees.

25. How would you explain a machine learning model to a non-technical stakeholder?
This is a communication question. A good answer focuses on the business outcome rather than technical details. For example — instead of saying "we used a random forest classifier," say "we built a system that predicts which customers are likely to leave so the sales team can reach out to them first."

Final Tip
Do not just memorize these answers. Practice explaining them out loud as if you are in a real interview. Companies like Google, Amazon, and Flipkart care more about how you think through a problem than whether you know the exact definition.
Good luck with your data science interview. If you found this helpful, share it with your friends who are also preparing. Check out more interview preparation resources on Jobdexo.

Top 25 Data Science Interview Questions and Answers (2025) — Freshers & Experienced

💼 Ready to Find Your Dream Job?

📖 More Articles

Top 25 Data Science Interview Questions and Answers (2025) — Freshers &amp; Experienced

💼 Ready to Find Your Dream Job?

📖 More Articles

Top 25 Data Science Interview Questions and Answers (2025) — Freshers & Experienced