•These focus on core concepts like modeling, algorithms, and data handling.
1.What is overfitting, and how does it differ from underfitting? Overfitting occurs when a model performs well on training data but poorly on new data due to high variance and low bias; underfitting happens with high bias and low variance, failing on both.
2.Explain bias-variance tradeoff. High bias leads to underfitting (simplistic model), high variance to overfitting (captures noise); balance minimizes total error.
3.What is a confusion matrix? It's a 2x2 table showing true positives, false positives, true negatives, and false negatives to compute metrics like precision and recall.
4.Differentiate logistic regression from linear regression. Logistic predicts binary outcomes using sigmoid for probabilities; linear predicts continuous values.
5.What is a random forest? An ensemble of decision trees using bagging and random feature selection for majority vote classification or averaging regression.
6.Define p-value. It measures probability of results assuming null hypothesis is true; low (≤0.05) rejects null, high supports it.
7.What are Type I and Type II errors? Type I: false positive (reject true null); Type II: false negative (fail to reject false null).
8.Explain gradient descent. An optimization algorithm minimizing loss by iteratively adjusting parameters in the direction of steepest descent.
9.What is feature engineering? Creating or transforming variables to improve model performance, like scaling or encoding categorical.
10.Describe selection bias. Non-random sample selection skews results, not representing the population.
• SQL and Coding Questions expect live coding for data manipulation.
11.Write SQL for customer orders with info. SELECT o.*, c.* FROM orders o JOIN customers c ON o.customer_id = c.id;
12.How to handle missing values >30%? Drop rows for large datasets or impute with mean/median; assess impact first.
13.Calculate Jaccard similarity. Size of intersection divided by union of sets.
14.What are window functions? SQL functions like ROW_NUMBER() or LAG() for calculations over partitions.
15.Reverse a linked list (pseudocode). Use iterative two-pointer or recursive approach to swap next pointers.
Statistics and ML Questions
these probe foundational math.
16.What is RMSE? Root mean squared error: [\sqrt{\frac{1}{n} \sum (y_i - \hat{y_i})^2}], measuring prediction deviation.
17.Explain correlation vs. covariance. Correlation is normalized covariance (-1 to 1); covariance measures joint variability direction.
18.What is cross-validation? Split data into folds, train/test rotationally to assess generalization (e.g., k-fold).
19.Differences in supervised vs. unsupervised learning. Supervised uses labeled data for prediction; unsupervised finds patterns in unlabeled.
20.What is PCA? Principal Component Analysis reduces dimensions by projecting data onto eigenvectors of covariance matrix.Behavioral
•Questions demonstrate real-world application.
21.Describe a failed project. Outline problem, actions (e.g., debug overfitting), lessons (better validation).
22.Align data projects with business goals? Start with KPIs, iterate with stakeholders, measure ROI.
23.Explain ML to non-technical person. Use analogy: model as recipe learning from examples to predict outcomes.
24.Time series vs. regression? Time series accounts for autocorrelation/seasonality; standard regression assumes independence.Update algorithm frequency?
25.When data drifts, performance drops, or business needs change; monitor metrics
Mastering Data Science Interviews: Top 25 Questions to Ace Your Next Job Hunt
💼 Ready to Find Your Dream Job?
Browse 1000+ verified fresher jobs and internships — updated daily.
Browse Latest Jobs →