Cache and persist - work in progress


performance

Introduction

When to use cache or persist?

When you are creating more dataframes that depend on the base dataframe(s), you can cache/persist the base dataframe(s) to speed up the operation.

<Example is a work in progress>

cache() - Uses default storage level

The default storage levels are:

  • RDD - MEMORY_ONLY
  • Dataset - MEMORY_AND_DISK

If we want to store in some other location, other than defaults, then we have to use persist()

Deciding storage level

Available storage levels

  1. NONE - No persistence
  2. OFF_HEAP - I'm not sure if we should really use it. Stackoverflow discussion
  3. MEMORY_ONLY - Deserialize and store the dataframe into the JVM's ram (Ram allocated in spark.executor.memory and spark.driver.memory)
  4. MEMORY_ONLY_2 - Save the dataframe in memory of current node, also save a copy in a 2nd node
  5. MEMORY_AND_DISK - Store deserialized dataframe in JVM's memory. If it is too large for memory, store excess partitions on disk.
  6. MEMORY_AND_DISK_2 - Just like MEMORY_AND_DISK, but also store a copy on a 2nd node
  7. MEMORY_AND_DISK_DESER
  8. DISK_ONLY - Store only on disk
  9. DISK_ONLY_2 - Just like DISK_ONLY, but also store a copy on a 2nd node's disk
  10. DISK_ONLY_3 - Just like DISK_ONLY, but also store a copy on a 2nd and 3rd nodes' disks

Documentation link - pyspark.StorageLevel

Advice from the documentation

I'm copy pasting from the documentation:

Source link - https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

  • If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
  • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
  • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
  • Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

TABLE OF CONTENTS