Cache and persist - work in progress

performance

Introduction

When to use cache or persist?

When you are creating more dataframes that depend on the base dataframe(s), you can cache/persist the base dataframe(s) to speed up the operation.

<Example is a work in progress>

cache() - Uses default storage level

The default storage levels are:

RDD - MEMORY_ONLY
Dataset - MEMORY_AND_DISK

If we want to store in some other location, other than defaults, then we have to use persist()

Deciding storage level

Available storage levels

NONE - No persistence
OFF_HEAP - I'm not sure if we should really use it. Stackoverflow discussion
MEMORY_ONLY - Deserialize and store the dataframe into the JVM's ram (Ram allocated in spark.executor.memory and spark.driver.memory)
MEMORY_ONLY_2 - Save the dataframe in memory of current node, also save a copy in a 2nd node
MEMORY_AND_DISK - Store deserialized dataframe in JVM's memory. If it is too large for memory, store excess partitions on disk.
MEMORY_AND_DISK_2 - Just like MEMORY_AND_DISK, but also store a copy on a 2nd node
MEMORY_AND_DISK_DESER
DISK_ONLY - Store only on disk
DISK_ONLY_2 - Just like DISK_ONLY, but also store a copy on a 2nd node's disk
DISK_ONLY_3 - Just like DISK_ONLY, but also store a copy on a 2nd and 3rd nodes' disks

Documentation link - pyspark.StorageLevel

Advice from the documentation

I'm copy pasting from the documentation:

Source link - https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

TABLE OF CONTENTS