Spark interview questions to help you render answers during an interview and secure the job. Get it right, and you’ll end up working smarter, not harder. All you need to do is read through this article carefully. Interviews are not just about knowing the answers. With this article, you’ll get to know how to answer.
What is Apache Spark Architecture?
Apache Spark is an open-source cluster computing framework that is setting the world of Big Data on fire. According to Spark Certified Experts, Spark’s performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop.
In this blog, I will give you a brief insight into Spark Architecture and the fundamentals that underlie Spark Architecture. Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.
This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions:
Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)
100 Spark Interview Questions and Answers
Below are appropriate answers for your spark interview questions
1. What is PageRank is GraphX?
PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.
GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank object.
Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.
2. What is the significance of Sliding Window operation?
Sliding Window controls the transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.
Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
3. What do you understand by Transformations in Spark?
Transformations are functions applied to RDDs, resulting in another RDD. It does not execute until an action occurs.
Functions such as map() and filer() are examples of transformations, where the map() function iterates over every line in the RDD and splits into a new RDD.
The filter() function creates a new RDD by selecting elements from the current RDD that passes the function argument.
4. What makes Spark good at low latency workloads like graph processing and Machine Learning?
Apache Spark stores data in memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model.
Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.
5. How is Streaming implemented in Spark?
Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.
The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.
The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards, and databases. It is similar to batch processing as the input data is divided into streams like batches.
6. Explain Idle Assessment.
The idle assessment, known as call by use, is a strategy that defers compliance until one needs a benefit.
7. Illustrate some demerits of using Spark.
Since Spark utilizes more storage space when compared to Hadoop and MapReduce, there might arise certain problems. Developers need to be careful while running their applications on Spark.
To resolve the issue, they can think of distributing the workload over multiple clusters, instead of running everything on a single node.
8. What are the types of Transformation on DStream?
There are two types of transformation on DStream:
Stateless transformation: In stateless transformation, the processing of each batch does not depend on the data of its previous batches. Each stateless transformation applies separately to each RDD.
Examples: map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey().
Stateful transformation: stateful transformation use data or intermediate results from previous batches to compute the result of the current batch.
The stateful transformations on the other hand allow us combining data across time. Examples: updateStateByKey and mapWithState.
9. Explain Executor Memory in a Spark application.
Every spark application has the same fixed heap size and fixed number of cores for a spark executor.
The heap size is what is referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag.
Every spark application will have one executor on each worker node. The executor memory is basically a measure of how much memory of the worker node will the application utilize.
10. Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars.
BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution.
The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:
Sample building engine: determines the stratified samples to be built based on workload history and data distribution.
Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.
Big Data Interview Questions
11. Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
The data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets.
If any partition of an RDD is lost due to failure, lineage helps build only that particular lost partition.
12. Explain the key features of Apache Spark.
Polyglot: Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python.
The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.
Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning.
It manages data using partitions that help parallelize distributed data processing with minimal network traffic.
Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive, and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data through Spark SQL.
Data sources can be more than just simple pipes that convert data and pull it into Spark.
Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary.
This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation, and only when the driver requests some data, does this DAG actually gets executed.
Real-Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation.
Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.
Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop.
Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.
Machine Learning: Spark’s MLlib is the machine learning component that is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning.
Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.
13. What are the functions of Spark SQL?
Spark SQL is Apache Spark’s module for working with structured data.
Spark SQL loads the data from a variety of structured data sources.
It queries data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
It provides a rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables and expose custom functions in SQL.
14. What are Spark Datasets?
Datasets are data structures in Spark (added since Spark 1.6) that provide the JVM object benefits of RDDs (the ability to manipulate data with lambda functions), alongside a Spark SQL-optimized execution engine.
15. Explain Caching in Spark Streaming.
DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream.
For input streams that receive data over the network (such as Kafka, Flume, sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault tolerance.
16. What is Executor Memory in a Spark application?
Every spark application has the same fixed heap size and fixed number of cores for a spark executor. The heap size is referred to as the Spark executor memory which is controlled with the Spark executor memory property of the –executor-memory flag.
Every spark application will have one executor on each worker node. The executor memory is basically a measure of how much memory of the worker node will the application utilize.
- Most Asked Celebrity Interview Questions
- Answer Common Admission Interview Questions
- Most Asked Team Leader Interview Questions
17. What does MLlib do?
MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.
18. What is YARN in Spark?
YARN is one of the key features provided by Spark that provides a central resource management platform for delivering scalable operations throughout the cluster.
YARN is a cluster management technology and a Spark is a tool for data processing.
19. Is there any benefit of learning MapReduce if Spark is better than MapReduce?
Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
20. What are the benefits of Spark over MapReduce?
Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
Spark Programming Questions
21. What is shuffling in Spark? When does it occur?
Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey.
22. What is the difference between CreateOrReplaceTempView and createGlobalTempView?
CreateOrReplaceTempView is used when you want to store the table for a particular spark session and CreateGlobalTempView is used when you want to share the temp table across multiple spark sessions.
23. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long-running jobs into different batches and writing the intermediary results to the disk. Spark interview questions.
24. Compare map() and flatMap() in Spark.
In Spark, map() transformation is applied to each row in a dataset to return a new dataset. flatMap() transformation is also applied to each row of the dataset, but a new flattened dataset is returned.
In the case of flatMap, if a record is nested (e.g. a column which is in itself made up of a list, array), the data within that record gets extracted and is returned as a new row of the returned dataset.
Both map() and flatMap() transformations are narrow, which means that they do not result in a shuffling of data in Spark.
flatMap() is said to be a one-to-many transformation function as it returns more rows than the current DataFrame. map() returns the same number of records as what was present in the input DataFrame.
flatMap() can give a result that contains redundant data in some columns.
flatMap() can be used to flatten a column that contains arrays or lists. It can be used to flatten any other nested collection too.
25. Define Actions in Spark.
An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations.
Actions trigger execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.
reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.
26. Explain Sparse Vector.
A vector is a one-dimensional array of elements. However, in many applications, the vector elements have mostly zero values that are said to be sparse. Spark interview questions.
27. What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and considers it to be one of the best big data analytics formats so far.
Parquet is a columnar format, supported by many data processing systems. The advantages of having columnar storage are as follows:
‣ Columnar storage limits IO operations.
‣ It can fetch specific columns that you need to access.
‣ Columnar storage consumes less space.
‣ It gives better-summarized data and follows type-specific encoding. Spark interview questions.
28. Explain Pair RDD.
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel.
They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
- 10 Skills to Showcase During a Job Interview
- How to Prepare for A Job Interview | 10 Job Interview Secrets
29. What do you understand by SchemaRDD in Apache Spark RDD?
SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module.
The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema.
On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.
30. Explain Lazy Evaluation.
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result.
When a transformation like a map () is called on an RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow. Spark interview questions.
Databricks Interview Questions
31. What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however, it is often suggested that users call persist () method on the RDD in case they plan to reuse it.
Spark has various persistence levels to store the RDDs on disk or in memory, or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
32. What are receivers in Apache Spark Streaming?
Receivers are those entities that consume data from different data sources and then move them to Spark for processing.
They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core.
The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:
Reliable receivers: Here, the receiver sends an acknowledgment to the data sources post successful reception of data and its replication on the Spark storage space.
Unreliable receiver: Here, there is no acknowledgment sent to the data sources. Spark interview questions.
33. What is SparkContext in PySpark?
A SparkContext represents the entry point to connect to a Spark cluster. It can be used to create RDDs, accumulators and broadcast variables on that particular cluster.
Only one SparkContext can be active per JVM. A SparkContext has to be stopped before creating a new one. PySpark uses the library Py4J to launch a JVM and creates a JavaSparkContext, By default, PySpark has SparkContext available as ‘sc’.
Hence, creating a SparkContext will not work.
34. Does Apache Spark provide the check pointing?
Lineage graphs are always useful to recover RDDs from failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist.
However, the decision on which data to the checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
35. Under what scenarios do you use Client and Cluster modes for deployment?
In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode.
Also, in Client mode, the entire process is lost if the machine goes offline.
If we have the client machine inside the cluster, then the Client mode can be used for deployment.
Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure. Spark interview questions.
36. Name the types of Cluster Managers in Spark.
The Spark framework supports three major types of Cluster Managers.
Standalone: A basic Cluster Manager to set up a cluster
Apache Mesos: A generalized/commonly-used Cluster Manager, running Hadoop MapReduce and other applications
YARN: A Cluster Manager responsible for resource management in Hadoop
37. What is a Lineage Graph?
A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.
The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory.
So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.
38. What is RDD?
RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:
Parallelized collections: Meant for running parallelly.
Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.
39. What are DStreams?
DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume, or Kinesis; or by applying high-level operations on existing DStreams.
Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.
40. What are scalar and aggregate functions in Spark SQL?
In Spark SQL, Scalar functions are those functions that return a single value for each row. Scalar functions include built-in functions including array functions and map functions.
Aggregate functions return a single value for a group of rows. Some of the built-in aggregate functions include min(), max(), count(), countDistinct(), avg(). Users can also create their own scalar and aggregate functions. Spark interview questions.
- 10 Skills to Showcase During a Job Interview
- How to Prepare for A Job Interview | 10 Job Interview Secrets
Spark Transformations Questions
41. Explain coalesce in Spark.
Coalesce in Spark is a method that is used to reduce the number of partitions in a DataFrame. Reduction of partitions using the re-partitioning method is an expensive operation.
Instead, the coalesce method can be used. Coalesce does not perform a full shuffle and instead of creating new partitions, it shuffles the data using Hash Partitioner and adjusts the data into the existing partitions.
The Coalesce method can only be used to decrease the number of partitions. Coalesce is to be ideally used in cases where one wants to store the same data in a lesser number of files. Spark interview questions.
42. How does Spark Streaming handle caching?
Caching can be handled in Spark Streaming by means of a change in settings on DStreams. A Discretized Stream (DStream) allows users to keep the stream’s data persistent in memory.
By using the persist() method on a DStream, every RDD of that particular DStream is kept persistent on memory and can be used if the data in the DStream has to be used for computation multiple times.
Unlike RDDs, in the case of DStreams, the default persistence level involves keeping the data serialized in memory.
43. What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.
44. Why is there a need for broadcast variables when working with Apache Spark?
These are read-only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.
Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
45. What are the steps involved in structured API execution in Spark?
If valid code, Spark converts this to a Logical Plan.
Spark transforms this Logical Plan into a Physical Plan, checking for optimizations along the way.
Spark then executes this Physical Plan (RDD manipulations) on the cluster.
46. Explain Immutable in reference to Spark.
If a value has been generated and assigned, it cannot be changed. This attribute is called immutability. Spark is immutable by nature. It does not accept upgrades or alterations. Please notice that data storage is not immutable, but the data content is immutable.
47. What are the various functionalities supported by Spark Core?
Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:
Scheduling and monitoring jobs
48. What do you understand by worker node?
Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.
Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master.
Based on the resource availability, the master schedule tasks.
49. Define Partitions in Apache Spark.
As the name suggests, the partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Spark interview questions.
Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors.
By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
Everything in Spark is a partitioned RDD.
50. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long-running jobs into different batches and writing the intermediary results to the disk.
Spark Optimization Techniques Questions
52. Explain receivers in Spark Streaming.
Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark.
Receivers are usually created by streaming contexts as long-running tasks on various executors and scheduled to operate in a round-robin manner with each receiver taking a single core. Spark interview questions.
53. Does Apache Spark provide checkpoints?
Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures.
It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.
54. What is the role of accumulators in Spark?
Accumulators are variables used for aggregating information across the executors. This information can be about the data or API diagnosis like how many records are corrupted or how many times a library API was called.
55. What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface.
Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries, and data.
56. List some use cases where Spark outperforms Hadoop in processing.
Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
Spark is preferred over Hadoop for real-time querying of data
Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
57. What is a Sparse Vector?
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
58. What is RDD?
RDDs (Resilient Distributed Datasets) arthe real sparke basic abstraction in Apache Spark that represent the data coming into the system in object format.
RDDs are used for in-memory computations on large clusters, in a fault-tolerant manner. RDDs are read-only portioned, collection of records, that are –
Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.
Build a Big Data Project Portfolio by working on real-time apache-spark projects. Spark interview questions.
59. Explain transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter, and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
60. What are the languages supported by Apache Spark for developing big data applications?
Scala, Java, Python, R and Clojure
Let us know how this article was useful to you in the comment box below. Feel free to share this article with friends and loved ones.
My name is John U., I’m the founding Editor at Demzyportal and an SEO Expert. I make sure every blog post published on Demzyportal is EPIC.