Friday, April 19, 2024

Pyspark code-1:How to delete all the csv files from databricks file system

#All the files to be deleted under a particular path starts with part and ends with.csv 


#Pyspark code-1:How to delete all

the csv files from databricks file

system

Code 

import os


# Define the directory path

directory_path = "/FileStore/tables/Sample DataSource/"

# Get a list of all CSV files recursively in the directory


#List out all the files under this directory

csv_files = dbutils.fs.ls(directory_path)


# Delete each CSV file

for file in csv_files:
   
if file.name.startswith("part") and file.name.endswith(".csv"):

        dbutils.fs.rm(file.path, recurse=True)



  • In Databricks, when you use the dbutils.fs.ls() function to list files in a directory, it returns a list of FileInfo objects. Each FileInfo object represents a file or directory in the specified location. The file.name attribute of a FileInfo object contains the name of the file or directory.

    Here's an explanation of file.name:

    • file: This is a variable representing a single item in the list of files/directories returned by dbutils.fs.ls().
    • .: This is the dot operator, which is used to access attributes and methods of objects in Python.
    • name: This is an attribute of the FileInfo object that holds the name of the file or directory.

    When you access file.name, you're retrieving the name of the file or directory represented by the FileInfo object stored in the variable file.


No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...