
Latest [Sep 06, 2022] Associate-Developer-Apache-Spark Exam Dumps - Valid and Updated Dumps
Free Sales Ending Soon - 100% Valid Associate-Developer-Apache-Spark Exam Dumps with 179 Questions
Importance of Databricks Associate Developer Apache Spark Exam to Secure your future
Data science is a rapidly growing field that's being used in a wide range of industries today. It's an essential skill for any developer or business person looking to make the most out of their data. Whether you want to be a part of the rapidly expanding world of data science or are just starting out with your career, the Data Science Associate Developer exam is the best way to test your knowledge and skills in this field. Databricks Associate Developer Apache Spark exam dumps are the best way to prepare for this exam.
In the current market scenario, the demand for Data Scientist is increasing day by day. As a Data Scientist, you are required to analyze the data and predict the future trends. The only thing that you need to do is to find out the right data and use the right tools. There are various tools that are available for you to perform data analysis and you can choose the one that suits your needs. Some of the tools that are used in data analysis are R, Python, Tableau, SAS, etc. Data Scientists are required to work on many different platforms and tools and they should be able to switch between them easily.
NEW QUESTION 83
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
- A. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. explode
6. "attributes" - B. 1. filter
2. "array_contains(attributes, cozy)"
3. select
4. "itemId"
5. explode
6. "attributes" - C. 1. filter
2. array_contains("cozy")
3. select
4. "itemId"
5. explode
6. "attributes" - D. 1. where
2. "array_contains(attributes, 'cozy')"
3. select
4. itemId
5. explode
6. attributes - E. 1. filter
2. "array_contains(attributes, 'cozy')"
3. select
4. "itemId"
5. map
6. "attributes"
Answer: A
Explanation:
Explanation
The correct code block is:
itemsDf.filter("array_contains(attributes, 'cozy')").select("itemId", explode("attributes")) The key here is understanding how to use array_contains(). You can either use it as an expression in a string, or you can import it from pyspark.sql.functions. In that case, the following would also work:
itemsDf.filter(array_contains("attributes", "cozy")).select("itemId", explode("attributes")) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/29.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 84
Which of the following code blocks returns a copy of DataFrame transactionsDf in which column productId has been renamed to productNumber?
- A. transactionsDf.withColumn("productId", "productNumber")
- B. transactionsDf.withColumnRenamed("productId", "productNumber")
- C. transactionsDf.withColumnRenamed(col(productId), col(productNumber))
- D. transactionsDf.withColumnRenamed(productId, productNumber)
- E. transactionsDf.withColumnRenamed("productNumber", "productId")
Answer: B
Explanation:
Explanation
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 85
Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?
- A. transactionsDf.drop_duplicates(subset="productId")
- B. transactionsDf.dropDuplicates(subset=["productId"])
- C. transactionsDf.dropDuplicates(subset="productId")
- D. transactionsDf.distinct("productId")
- E. transactionsDf.unique("productId")
Answer: B
Explanation:
Explanation
Although the question suggests using a method called unique() here, that method does not actually exist in PySpark. In PySpark, it is called distinct(). But then, this method is not the right one to use here, since with distinct() we could filter out unique values in a specific column.
However, we want to return the entire rows here. So the trick is to use dropDuplicates with the subset keyword parameter. In the documentation for dropDuplicates, the examples show that subset should be used with a list. And this is exactly the key to solving this question: The productId column needs to be fed into the subset argument in a list, even though it is just a single column.
More info: pyspark.sql.DataFrame.dropDuplicates - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 86
Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame transactionsDf?
- A. transactionsDf.select(corr(predError, value).alias("corr"))
- B. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) (Correct)
- C. transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()
- D. transactionsDf.select(corr("predError", "value"))
- E. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()
Answer: B
Explanation:
Explanation
In difficulty, this question is above what you can expect from the exam. What this question NO:
wants to teach you, however, is to pay attention to the useful details included in the documentation.
pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting way.
The command takes two columns over multiple rows and returns a single row - similar to an aggregation function. When examining the documentation (linked below), you will find this code example:
a = range(20)
b = [2 * x for x in range(20)]
df = spark.createDataFrame(zip(a, b), ["a", "b"])
df.agg(corr("a", "b").alias('c')).collect()
[Row(c=1.0)]
See how corr just returns a single row? Once you understand this, you should be suspicious about answers that include first(), since there is no need to just select a single row. A reason to eliminate those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in the question.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) Correct! After calculating the Pearson correlation coefficient, the resulting column is correctly renamed to corr.
transactionsDf.select(corr(predError, value).alias("corr"))
No. In this answer, Python will interpret column names predError and value as variable names.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first() Incorrect. first() returns a row, not a DataFrame (see above and linked documentation below).
transactionsDf.select(corr("predError", "value"))
Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr(predError, value) and not corr.
transactionsDf.select(corr(["predError", "value"]).alias("corr")).first() False. In addition to first() returning a row, this code block also uses the wrong call structure for command corr which takes two arguments (the two columns to correlate).
More info:
- pyspark.sql.functions.corr - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.first - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 87
Which of the following code blocks generally causes a great amount of network traffic?
- A. DataFrame.rdd.map()
- B. DataFrame.count()
- C. DataFrame.collect()
- D. DataFrame.select()
- E. DataFrame.coalesce()
Answer: C
Explanation:
Explanation
DataFrame.collect() sends all data in a DataFrame from executors to the driver, so this generally causes a great amount of network traffic in comparison to the other options listed.
DataFrame.coalesce() just reduces the number of partitions and generally aims to reduce network traffic in comparison to a full shuffle.
DataFrame.select() is evaluated lazily and, unless followed by an action, does not cause significant network traffic.
DataFrame.rdd.map() is evaluated lazily, it does therefore not cause great amounts of network traffic.
DataFrame.count() is an action. While it does cause some network traffic, for the same DataFrame, collecting all data in the driver would generally be considered to cause a greater amount of network traffic.
NEW QUESTION 88
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))
- A. 1. size
2. regexp_replace
3. lower
4. "itemName"
5. alias - B. 1. length
2. regexp_replace
3. lower
4. col("itemName")
5. alias - C. 1. lower
2. regexp_replace
3. length
4. "itemName"
5. alias - D. 1. size
2. regexp_extract
3. lower
4. col("itemName")
5. alias - E. 1. length
2. regexp_extract
3. upper
4. col("itemName")
5. as
Answer: B
Explanation:
Explanation
Correct code block:
itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame:
+------------+
|consonant_ct|
+------------+
| 19|
| 16|
| 10|
+------------+
This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|
\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam.
The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation.
The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.
More info:
- lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation
- regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation
- length: pyspark.sql.functions.length - PySpark 3.1.2 documentation
- alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 89
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?
- A. itemsDf.withColumn("supplier").alias("manufacturer")
- B. itemsDf.withColumnsRenamed("supplier", "manufacturer")
- C. itemsDf.withColumn(["supplier", "manufacturer"])
- D. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
- E. itemsDf.withColumnRenamed("supplier", "manufacturer")
Answer: E
Explanation:
Explanation
itemsDf.withColumnRenamed("supplier", "manufacturer")
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the question asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied.
itemsDf.withColumnsRenamed("supplier", "manufacturer")
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.
itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
No. Watch out - although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below, withColumnRenamed expects strings.
itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns.
withColumn is typically used to add columns to DataFrames, taking the name of the new column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn("supplier").alias("manufacturer")
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn - PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias - PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 90
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:
- A. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
- B. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")
- C. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
- D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
- E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
Answer: E
Explanation:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION 91
The code block shown below should return an exact copy of DataFrame transactionsDf that does not include rows in which values in column storeId have the value 25. Choose the answer that correctly fills the blanks in the code block to accomplish this.
- A. transactionsDf.drop(transactionsDf.storeId==25)
- B. transactionsDf.where(transactionsDf.storeId!=25)
- C. transactionsDf.remove(transactionsDf.storeId==25)
- D. transactionsDf.select(transactionsDf.storeId!=25)
- E. transactionsDf.filter(transactionsDf.storeId==25)
Answer: B
Explanation:
Explanation
transactionsDf.where(transactionsDf.storeId!=25)
Correct. DataFrame.where() is an alias for the DataFrame.filter() method. Using this method, it is straightforward to filter out rows that do not have value 25 in column storeId.
transactionsDf.select(transactionsDf.storeId!=25)
Wrong. The select operator allows you to build DataFrames column-wise, but when using it as shown, it does not filter out rows.
transactionsDf.filter(transactionsDf.storeId==25)
Incorrect. Although the filter expression works for filtering rows, the == in the filtering condition is inappropriate. It should be != instead.
transactionsDf.drop(transactionsDf.storeId==25)
No. DataFrame.drop() is used to remove specific columns, but not rows, from the DataFrame.
transactionsDf.remove(transactionsDf.storeId==25)
False. There is no DataFrame.remove() operator in PySpark.
More info: pyspark.sql.DataFrame.where - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 92
Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?
- A. itemsDf.write(filePath, mode="overwrite")
- B. itemsDf.write.mode("overwrite").path(filePath)
- C. itemsDf.write.mode("overwrite").parquet(filePath)
- D. itemsDf.write().parquet(filePath, mode="overwrite")
- E. itemsDf.write.option("parquet").mode("overwrite").path(filePath)
Answer: C
Explanation:
Explanation
itemsDf.write.mode("overwrite").parquet(filePath)
Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.
Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.
itemsDf.write.mode("overwrite").path(filePath)
No. A pyspark.sql.DataFrameWriter instance does not have a path() method.
itemsDf.write.option("parquet").mode("overwrite").path(filePath)
Incorrect, see above. In addition, a file format cannot be passed via the option() method.
itemsDf.write(filePath, mode="overwrite")
Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.
itemsDf.write().parquet(filePath, mode="overwrite")
False. See above.
More info: pyspark.sql.DataFrameWriter.parquet - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 93
Which of the following statements about storage levels is incorrect?
- A. The cache operator on DataFrames is evaluated like a transformation.
- B. DISK_ONLY will not use the worker node's memory.
- C. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
- D. Caching can be undone using the DataFrame.unpersist() operator.
- E. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.
Answer: C
Explanation:
Explanation
MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory.
DISK_ONLY will not use the worker node's memory.
Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in memory.
In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.
Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors' memory.
Caching can be undone using the DataFrame.unpersist() operator.
Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will remove all of its parts from the executors' memory and disk.
The cache operator on DataFrames is evaluated like a transformation.
Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like DataFrame.cache().count().
More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation
NEW QUESTION 94
Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
- A. 1.def count_to_target(target):
2. result = list(range(target))
3. return result
4.
5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
6.
7.df = transactionsDf.select(count_to_target_udf('predError')) - B. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target)
9.
10.transactionsDf.select(count_to_target_udf('predError')) - C. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
9.
10.transactionsDf.select(count_to_target_udf('predError'))
(Correct) - D. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = [range(target)]
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])
9.
10.transactionsDf.select(count_to_target_udf(col('predError'))) - E. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.transactionsDf.select(count_to_target(col('predError')))
Answer: C
Explanation:
Explanation
Correct code block:
def count_to_target(target):
if target is None:
return
result = list(range(target))
return result
count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
transactionsDf.select(count_to_target_udf('predError'))
Output of correct code block:
+--------------------------+
|count_to_target(predError)|
+--------------------------+
| [0, 1, 2]|
| [0, 1, 2, 3, 4, 5]|
| [0, 1, 2]|
| null|
| null|
| [0, 1, 2]|
+--------------------------+
This question is not exactly easy. You need to be familiar with the syntax around UDFs (user-defined functions). Specifically, in this question it is important to pass the correct types to the udf method - returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual.
Remember that in Spark, you always pass types in an instantiated way like ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here - make sure you do not forget those.
You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator.
Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly.
More info: How to Turn Python Functions into PySpark Functions (UDF) - Chang Hsin Lee - Committing my thoughts to words.
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 95
Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?
Sample of DataFrame itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
- A. spark.select(size(split(col(supplier), " ")))
- B. itemsDf.select(size(split("supplier", " ")))
- C. itemsDf.select(word_count("supplier"))
- D. itemsDf.split("supplier", " ").size()
- E. itemsDf.split("supplier", " ").count()
Answer: B
Explanation:
Explanation
Output of correct code block:
+----------------------------+
|size(split(supplier, , -1))|
+----------------------------+
| 3|
| 1|
| 3|
+----------------------------+
This question shows a typical use case for the split command: Splitting a string into words. An additional difficulty is that you are asked to count the words. Although it is tempting to use the count method here, the size method (as in: size of an array) is actually the correct one to use. Familiarize yourself with the split and the size methods using the linked documentation below.
More info:
Split method: pyspark.sql.functions.split - PySpark 3.1.2 documentation Size method: pyspark.sql.functions.size - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 96
Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
- A. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", StringType()),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath) - B. 1.itemsDf = spark.read.schema('itemId integer, attributes <string>, supplier string').parquet(filePath)
- C. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType),
3. StructField("attributes", ArrayType(StringType)),
4. StructField("supplier", StringType)])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath) - D. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", ArrayType([StringType()])),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath) - E. 1.itemsDfSchema = StructType([
2. StructField("itemId", IntegerType()),
3. StructField("attributes", ArrayType(StringType())),
4. StructField("supplier", StringType())])
5.
6.itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)
Answer: E
Explanation:
Explanation
The challenge in this question comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by spark.read.
The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array<string>, supplier string.
A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so:
StringType(). Just using StringType does not work in pySpark and will fail.
Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the question just asks for a
"valid"
schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample.
More info: Learning Spark, 2nd Edition, Chapter 3
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 97
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__)
- A. 1. select
2. ["transactionId", "predError", "value", "f"] - B. 1. where
2. col("transactionId"), col("predError"), col("value"), col("f") - C. 1. select
2. "transactionId, predError, value, f" - D. 1. filter
2. "transactionId", "predError", "value", "f" - E. 1. select
2. col(["transactionId", "predError", "value", "f"])
Answer: A
Explanation:
Explanation
Correct code block:
transactionsDf.select(["transactionId", "predError", "value", "f"])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument.
Thus, this is the correct choice here. The option using col(["transactionId", "predError",
"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like "transactionId, predError, value, f" is not valid syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 98
Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column predError in DataFrame transactionsDf?
- A. transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))
- B. transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))
- C. transactionsDf.withColumn("predErrorSquared", "predError"**2)
- D. transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))
- E. transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))
Answer: A
Explanation:
Explanation
While only one of these code blocks works, the DataFrame API is pretty flexible when it comes to accepting columns into the pow() method. The following code blocks would also work:
transactionsDf.withColumn("predErrorSquared", pow("predError", 2))
transactionsDf.withColumn("predErrorSquared", pow("predError", lit(2))) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/26.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 99
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))
- A. The Python function is unable to handle null values, resulting in the code block crashing on execution.
- B. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
- C. The udf() method does not declare a return type.
- D. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
- E. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
Answer: E
Explanation:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.
NEW QUESTION 100
The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.
Find the error.
Code block:
transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")
- A. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").
- B. The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.
- C. The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.
- D. The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.
- E. The "outer" argument should be eliminated, since "outer" is the default join type.
Answer: D
Explanation:
Explanation
Correct code block:
transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, "outer") Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/33.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 101
Which of the following code blocks performs an inner join between DataFrame itemsDf and DataFrame transactionsDf, using columns itemId and transactionId as join keys, respectively?
- A. itemsDf.join(transactionsDf, "itemsDf.itemId == transactionsDf.transactionId", "inner")
- B. itemsDf.join(transactionsDf, "inner", itemsDf.itemId == transactionsDf.transactionId)
- C. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.transactionId, "inner")
- D. itemsDf.join(transactionsDf, col(itemsDf.itemId) == col(transactionsDf.transactionId))
- E. itemsDf.join(transactionsDf, itemId == transactionId)
Answer: C
Explanation:
Explanation
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 102
Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?
- A. transactionsDf.where("productId"=3).or("productId"<1))
- B. transactionsDf.filter(productId==3 or productId<1)
- C. transactionsDf.filter((col("productId")==3) or (col("productId")<1))
- D. transactionsDf.filter(col("productId")==3 | col("productId")<1)
- E. transactionsDf.filter((col("productId")==3) | (col("productId")<1))
Answer: E
Explanation:
Explanation
This question targets your knowledge about how to chain filtering conditions. Each filtering condition should be in parentheses. The correct operator for "or" is the pipe character (|) and not the word or. Another operator of concern is the equality operator. For the purpose of comparison, equality is expressed as two equal signs (==).
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 103
Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?
- A. transactionsDf.withColumn("storeId", col("storeId", "string"))
- B. transactionsDf.withColumn("storeId", col("storeId").convert("string"))
- C. transactionsDf.withColumn("storeId", convert("storeId").as("string"))
- D. transactionsDf.withColumn("storeId", col("storeId").cast("string"))
- E. transactionsDf.withColumn("storeId", convert("storeId", "string"))
Answer: D
Explanation:
Explanation
This question asks for your knowledge about the cast syntax. cast is a method of the Column class. It is worth noting that one could also convert a column type using the Column.astype() method, which is just an alias for cast.
Find more info in the documentation linked below.
More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 104
The code block shown below should read all files with the file ending .png in directory path into Spark.
Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)
- A. 1. open
2. format
3. "image"
4. "fileType"
5. open - B. 1. read()
2. format
3. "binaryFile"
4. "recursiveFileLookup"
5. load - C. 1. open
2. as
3. "binaryFile"
4. "pathGlobFilter"
5. load - D. 1. read
2. format
3. binaryFile
4. pathGlobFilter
5. load - E. 1. read
2. format
3. "binaryFile"
4. "pathGlobFilter"
5. load
Answer: E
Explanation:
Explanation
Correct code block:
spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator - the open operator shown in one of the answers does not exist.
NEW QUESTION 105
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet?
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+
- A. itemsDf.withColumn('attributes', sort_array(desc('attributes')))
- B. itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))
- C. itemsDf.select(sort_array("attributes"))
- D. itemsDf.withColumn("attributes", sort_array("attributes", asc=False))
- E. itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))
Answer: D
Explanation:
Explanation
Output of correct code block:
+------+-----------------------------+-------------------+
|itemId|attributes |supplier |
+------+-----------------------------+-------------------+
|1 |[winter, cozy, blue] |Sports Company Inc.|
|2 |[summer, red, fresh, cooling]|YetiX |
|3 |[travel, summer, green] |Sports Company Inc.|
+------+-----------------------------+-------------------+
It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this question you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.
More info: pyspark.sql.functions.sort_array - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 106
......
The Exam cost of Databricks Associate Developer Apache Spark Exam?
The cost of the Databricks Associate Developer Apache Spark Exam is 200 USD per attempt.
Associate-Developer-Apache-Spark Exam Dumps - 100% Marks In Associate-Developer-Apache-Spark Exam: https://www.practicetorrent.com/Associate-Developer-Apache-Spark-practice-exam-torrent.html
Verified Associate-Developer-Apache-Spark Exam Questions Certain Success: https://drive.google.com/open?id=1hm5WlLpd3BcGRThNJ5tlVVwNI5dJnKOz