Corridor integrated notebook

Corridor Integrated Notebooks¶

In this section, we will address FAQs that a notebook user may have at the outset. However, each of these aspects will be illustrated in more details as we proceed with illustrations.

What packages are available inside the Notebook?
By default, some of the basic packages are already pre-installed in the notebook (unless prescribed otherwise by your system administrator): NumPy, pandas, scikit-learn, findspark.

To install any additional packages, contact your technology team and follow the appropriate process as per their guidelines.

Can I use pandas DataFrame for large datasets - How can I get more memory? Is this a right approach?
While there are no additional limitations on the Platform imposes, there are constraints that the infrastructure setup for your installation may impose.

Pandas require all the data that is being operated on to be in Memory. In such cases, it may be beneficial to use PySpark which is meant for larger data - But this may not always be possible.

Contact the technology team to understand how to handle your use-case - be it increasing the infrastructure pool available for your notebooks, or an alternative method.

How to import pyspark? When I try to import pyspark, it says pyspark not found
There are multiple ways to use PySpark in Notebooks - depending on what sort of setup was done in the Platform installation.

In our experience, the most robust way to find and use spark is:

In [1]:

Copied!





import findspark
findspark.init()
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
import findspark
findspark.init()
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()