Introduction
When to use cache or persist?
When you are creating more dataframes that depend on the base dataframe(s), you can cache/persist the base dataframe(s) to speed up the operation.
<Example is a work in progress>
cache() - Uses default storage level
The default storage levels are:
- RDD -
MEMORY_ONLY - Dataset -
MEMORY_AND_DISK
If we want to store in some other location, other than defaults, then we have to use persist()
Deciding storage level
Available storage levels
NONE- No persistenceOFF_HEAP- I'm not sure if we should really use it. Stackoverflow discussionMEMORY_ONLY- Deserialize and store the dataframe into the JVM's ram (Ram allocated inspark.executor.memoryandspark.driver.memory)MEMORY_ONLY_2- Save the dataframe in memory of current node, also save a copy in a 2nd nodeMEMORY_AND_DISK- Store deserialized dataframe in JVM's memory. If it is too large for memory, store excess partitions on disk.MEMORY_AND_DISK_2- Just likeMEMORY_AND_DISK, but also store a copy on a 2nd nodeMEMORY_AND_DISK_DESERDISK_ONLY- Store only on diskDISK_ONLY_2- Just likeDISK_ONLY, but also store a copy on a 2nd node's diskDISK_ONLY_3- Just likeDISK_ONLY, but also store a copy on a 2nd and 3rd nodes' disks
Documentation link - pyspark.StorageLevel
Advice from the documentation
I'm copy pasting from the documentation:
Which Storage Level to Choose?
Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:
- If your RDDs fit comfortably with the default storage level (
MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. - If not, try using
MEMORY_ONLY_SERand selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala) - Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
- Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.