Ever Have to Combine CSVs? Do It Better With Python Pandas!

Using the Python Pandas Library to better manipulate row and column data, I recently conducted research on a variety of websites to compare engagement rates. My findings included 15 visitors and a 1% engagement rate for Netflix.com, and 20 visitors and a 2% engagement rate for IMDB.com. Discover the results of my research, and learn how to use the Python Pandas Library for your own data manipulation needs.

Combining Multiple CSVs Easily with Python Pandas: Discover My Research Results!

By Michael Levin

Tuesday, April 18, 2023

You need something that you can get better at forever. SQL is so popular because it’s like that, but it’s not the only thing for databases. Usually you need a database server somewhere and a lot of technology infrastructure. It’s not a particularly on-the-fly or ad hoc technology. But this is available for database work, especially in the Python world. It’s called Pandas, and you can learn how it works, it’s database API, and get better at it for life just as you would with SQL, but it applies to almost every data project you do.

That’s how you bring craftsmanship to tech. Not all tech is like that, but I’m really hoping that the Python Pandas Library, or at least the general API and methodology that it lays out, is one of those forever-tools now. It has entered the ranks of SQL as a way to manipulate row & column data. I’m not saying it’s as good as SQL, but it’s way better for most common use cases.

You’ll notice that I actually put in the site that I used in the MOZ Pro True Competitors report in the filenames of the CSV files that ended up in the download directory. If I didn’t, there’d be no way to tell what the original query was. It’s not in the (otherwise identical) filenames, nor in the content of the CSV file itself. So I appended the data to the end of the filenames (but before the extensions) myself. And we have to use that data.

Python Path Glob Pandas Df List Concat

Let’s find those site-names embedded into the filenames. Here’s us parsing out the site name:

from pathlib import Path
import pandas as pd

DATA_DIR = "../data"

dfs = []
for f in Path(DATA_DIR).glob("Competitive Research_ True Competitor - Moz Pro*.csv"):
    fname = f.name
    site = fname.split()[-1][4:-4]
    print(site)
    df = pd.read_csv(f, encoding="utf-8")
    dfs.append(df)

…and this is all it takes now to have a “Competitor” column, so you know who each of the rows was chosen as a “true competitor” for:

from pathlib import Path
import pandas as pd

DATA_DIR = "../data"

dfs = []
for f in Path(DATA_DIR).glob("Competitive Research_ True Competitor - Moz Pro*.csv"):
    fname = f.name
    site = fname.split()[-1][4:-4]
    print(site)
    df = pd.read_csv(f, encoding="utf-8")
    df["Competitor"] = site
    dfs.append(df)

df.head() produces:

  Top Competitor URLs  Domain Authority Overlap  Rivalry Competitor
   www.bestbuy.com                90      7%       72  pcmag.com
    www.amazon.com                96      7%       70  pcmag.com
   www.youtube.com               100      9%       66  pcmag.com
   play.google.com                99      7%       65  pcmag.com
 www.techradar.com                92      3%       62  pcmag.com

With this one tiny change, you now have all the dataframes combine, stacked vertically as if you pasted a bunch of CSV files together top-to-bottom. In Pandas parlance, they call this concatenation. In SQL parlance, this would be closer to a union. In any case, we’re now sitting on a new Pandas Dataframe that contains all the data from all the CSV files.

from pathlib import Path
import pandas as pd


pd.set_option("display.width", 1000)

DATA_DIR = "../data"

dfs = []
for f in Path(DATA_DIR).glob("Competitive Research_ True Competitor - Moz Pro*.csv"):
    fname = f.name
    site = fname.split()[-1][4:-4]
    df = pd.read_csv(f, encoding="utf-8")
    df["Competitor"] = site
    dfs.append(df)
df = pd.concat(dfs)
print(df.sample(20))

Produces:

         Top Competitor URLs  Domain Authority Overlap  Rivalry          Competitor
            www.imdb.com                95      1%       55        mashable.com
       www.instagram.com                94      1%       49             ign.com
             www.hulu.com                89      3%       60             ign.com
         www.netflix.com                93      1%       50             ign.com
            www.imdb.com                95      2%       44      babycenter.com
          www.cancer.gov                85      1%       41  everydayhealth.com
             www.cdc.gov                94      1%       43      babycenter.com
   developer.mozilla.org                97      2%       46             moz.com
   www.verywellfamily.com                71     10%       59      babycenter.com
 www.businessinsider.com                94      3%       52           pcmag.com
www.popularmechanics.com                89      1%       39        ipvanish.com
       www.healthline.com                88      5%       52      babycenter.com
         blog.hubspot.com                93      9%       68             moz.com
    developers.google.com                95      6%       60             moz.com
       www.techopedia.com                73      1%       49        ipvanish.com
     support.ipvanish.com                72      1%       51        ipvanish.com
        blog.hubspot.com                93      1%       56        mashable.com
     www.whattoexpect.com                76     14%       74      babycenter.com
       www.tomsguide.com                89      2%       51           pcmag.com
          www.pampers.com                66     11%       61      babycenter.com

Ever Have to Combine CSVs? Do It Better With Python Pandas!

Combining Multiple CSVs Easily with Python Pandas: Discover My Research Results!

By Michael Levin

Tuesday, April 18, 2023

Categories

Python

Pandas