This previous post demonstrated a way to use generic website html source to create an inventory of web-accessible files using discovered URL links. A similar methodology can be applied to search through filesystem directory tree to establish a catalog for any matching filetypes.
Use glob
to recursively globstar-match filepaths
#
In this case I use Python’s builtin glob
and regular expression modules to list files and match extensions in the names. I used the os.path
collection of utility methods to pull directory names from the full paths, but a more modern way would probably to use the builtin pathlib
. Pandas is the only non-builtin package used, which could be removed if a DataFrame
is not the desired output.
Regex can match numbers to create template patterns for large file lists
#
There are probably other CLI tools that will let you do similar recursive searches for specific filetypes but this code adds some additional reductions into groups
based on similar paths with numbers that might be useful in the context of an inventory or summary, rather than the full list of files found in the catalog
.
Scan Filesystem Code
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
import os
import pandas as pd
import glob
import re
SEARCH_EXTENSIONS = ("win", "WIN", ".csv", ".asc",
".nc", ".jp*g", ".png", ".bdf",
".BDF", ".zip", ".xyz",)
SEARCH_ROOTS = ("/mnt/M/ERA5",)
catalogs = []
for ext in SEARCH_EXTENSIONS:
for mountpath in SEARCH_ROOTS:
catalog = {
"Fullpath": sorted(
glob.glob(f"{mountpath}/*/**/*{ext}")
)
}
catalog = pd.DataFrame(catalog)
if catalog.shape[0] > 1:
try:
catalog["Pattern"] = catalog.Fullpath.apply(
lambda path: re.sub(r'\d+', '*', path)
)
catalog["Folder"] = catalog.Fullpath.apply(
os.path.dirname
)
catalog["LowerPath"] = catalog.Fullpath.str.lower()
catalog["Extension"] = catalog.Fullpath.apply(
lambda path: os.path.splitext(path)[-1][1:]
)
except:
print(catalog.Fullpath)
raise
catalogs.append(catalog)
catalogs = pd.concat(catalogs)
print("\nEvery Match:")
print(catalogs.to_string())
print("\nEach Pattern Grouping:")
groups = {}
for group in catalogs.Pattern.unique():
groups[group] = catalogs[catalogs.Pattern==group]
print("\n\n")
print(f"File pattern {group}\n")
print(f"Sample:", groups[group].iloc[0].to_string())
|
Sample path pattern group matches
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
File pattern /mnt/M/ERA*/RDA/sfc/*.VAR_*D.e*.oper.an.sfc.*_*_*d.regn*sc.*_*.nc
Sample: Fullpath /mnt/M/ERA5/RDA/sfc/349912.VAR_2D.e5.oper.an.s...
Pattern /mnt/M/ERA*/RDA/sfc/*.VAR_*D.e*.oper.an.sfc.*_...
Folder /mnt/M/ERA5/RDA/sfc
LowerPath /mnt/m/era5/rda/sfc/349912.var_2d.e5.oper.an.s...
Extension nc
File pattern /mnt/M/ERA*/RDA/sfc/*.VAR_*T.e*.oper.an.sfc.*_*_*t.regn*sc.*_*.nc
Sample: Fullpath /mnt/M/ERA5/RDA/sfc/349912.VAR_2T.e5.oper.an.s...
Pattern /mnt/M/ERA*/RDA/sfc/*.VAR_*T.e*.oper.an.sfc.*_...
Folder /mnt/M/ERA5/RDA/sfc
LowerPath /mnt/m/era5/rda/sfc/349912.var_2t.e5.oper.an.s...
Extension nc
File pattern /mnt/M/ERA*/Unipost/MM_Unipost/*_ERA*_UniPost.nc
Sample: Fullpath /mnt/M/ERA5/Unipost/MM_Unipost/201901_ERA5_Uni...
Pattern /mnt/M/ERA*/Unipost/MM_Unipost/*_ERA*_UniPost.nc
Folder /mnt/M/ERA5/Unipost/MM_Unipost
LowerPath /mnt/m/era5/unipost/mm_unipost/201901_era5_uni...
Extension nc
File pattern /mnt/M/ERA*/WINDENERGY/GOMOS/*_ERA*_TUVQW_GOMOS.nc
Sample: Fullpath /mnt/M/ERA5/WINDENERGY/GOMOS/197901_ERA5_TUVQW...
Pattern /mnt/M/ERA*/WINDENERGY/GOMOS/*_ERA*_TUVQW_GOMO...
Folder /mnt/M/ERA5/WINDENERGY/GOMOS
LowerPath /mnt/m/era5/windenergy/gomos/197901_era5_tuvqw...
|