Analyzing Residential Property Assessments in Anne Arundel County, Maryland

Paul NG

Introduction

The idea of buying a home seems to be growing harder in more recent years, and as someone entering the adult world. This scares me, as I hardly know anything or are equipped to deal with the world of real estate.

With that said, what better way to prepare myself then to use data science skills to model and analyze trends in property value?

The objective of this project will be to analyze public assessments for the value of real estate in the Anne Arundel County area of Maryland using reports from 2021 to 2022, and hopefully uncover and dissect trends and insights in the market. We will learn how more concrete factors like location, size, and other factors to a plot contribute to its assessed value, and how we can make use of this information to understand facets of the market and the real estate as a whole as potential laymen to the topic.

With that said, I’d like to more clearly make a disclaimer that I’m not highly knowledgeable in regards to properties or real estate, but I’ll be doing my best to explain many of the terms and aspects of properties we will interact with throughout this tutorial with the help of many resources and research. For those unfamiliar with the world of real estate, I hope this will provide an interesting and accessible lens into the assessment process of properties. For those who are familiar, I hope I can provide some interesting insights through this exploratory analysis and data-fitting.

And to both parties I’ll do what I can to show the steps you can take to analyze data like this yourself.

Setting Up

For this analysis, we’ll be using Python through Google Colab along with the imported libraries below:

In [154]:
# libraries to import
from google.colab import drive # For mounting drive files
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from glob import glob
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import linear_model
import statsmodels.api as sm
import statistics as stat
import warnings
warnings.filterwarnings("ignore")
plt.style.use('seaborn')

Keep in mind that if you are using Python through Jupyter or are using an alternative method of importing data as opposed to Google Drive, importing ‘drive’ from ‘google.colab’ will not be necessary.

Reading in the data

The dataset we’ll be using for this is the Maryland Real Property Assessments data available through Maryland’s official Data portal.

Though the dataset here is already ready to be read in as a dataframe, it’s an enormous file with 2.44 millions rows and 221 columns! In order to work with the data more easily without crashing my computer or the Colab client, I preprocessed and filtered some of the data using Excel. With excel, I narrowed down the dataset to only include properties in Anne Arundel County, and of those, only residential single-dwelling homes.

After doing that, I uploaded the smaller file into a Drive folder dedicated to the project accessible here (It was still too large to upload to GitHub) and mounted my Colab notebook to access it.

In [155]:
# Mounting my google drive where I keep a copy of the prefiltered dataset
drive.mount('/content/drive')
# This is the file path for me with respects to my own drive, make sure to change this to fit your csv importing process!
path = "/content/drive/MyDrive/Tutorial/Data/Anne Arundel County Maryland Residential Property Assessments.csv"
base_df = pd.read_csv(path)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Here’s the dataset:

In [156]:
base_df.head()
Out[156]:
Jurisdiction Code (MDP Field: JURSCODE) County Name (MDP Field: CNTYNAME) Account ID (MDP Field: ACCTID) Real Property Search Link FINDER Online Link Search Google Maps for this Location MDP Longitude (MDP Field: DIGXCORD converted to WGS84) MDP Latitude (MDP Field: DIGYCORD converted to WGS84) Mappable Latitude and Longitude RECORD KEY: County Code (SDAT Field #1) ... Plat Reference Liber (MDP Field: PLTLIBER. SDAT Field #267) Plat Reference Folio (MDP Field: PLTFOLIO. SDAT Field #268) PARENT ACCOUNT NUMBER: District/Ward (SDAT Field #387) PARENT ACCOUNT NUMBER: Account Number (SDAT Field #388) Last Activity Date (YYYY.MM.DD) (SDAT Field #392) Record Creation Date (YYYY.MM.DD) (SDAT Field #397) Record Deletion Date (YYYY.MM.DD) (SDAT Field #398) Assessment Cycle Year (SDAT Field #399) File Record Type (SDAT Field #400) Date of Most Recent Open Data Portal Record Update
0 ANNE Anne Arundel County 20174709207600 https://sdat.dat.maryland.gov/RealProperty/Pag... https://apps.planning.maryland.gov/finderonlin... https://maps.google.com/maps?t=h&q=38.90595657... -76.517997 38.905957 (38.90595657323789, -76.51799720169853) 2 ... 0000003 26 NaN NaN 2022.04.16 1991.00.00 0000.00.00 2022 M 20220504
1 ANNE Anne Arundel County 20100000035500 https://sdat.dat.maryland.gov/RealProperty/Pag... https://apps.planning.maryland.gov/finderonlin... https://maps.google.com/maps?t=h&q=38.92436856... -76.571792 38.924369 (38.92436856136531, -76.57179190498331) 2 ... 0000166 4 NaN NaN 2022.04.16 1991.00.00 0000.00.00 2022 M 20220504
2 ANNE Anne Arundel County 20100000048000 https://sdat.dat.maryland.gov/RealProperty/Pag... https://apps.planning.maryland.gov/finderonlin... https://maps.google.com/maps?t=h&q=38.84279759... -76.542110 38.842798 (38.84279759819853, -76.54211036504326) 2 ... 0000000 0 NaN NaN 2021.08.03 1991.00.00 0000.00.00 2022 M 20220504
3 ANNE Anne Arundel County 20100000067000 https://sdat.dat.maryland.gov/RealProperty/Pag... https://apps.planning.maryland.gov/finderonlin... https://maps.google.com/maps?t=h&q=38.90941636... -76.554680 38.909416 (38.90941636309363, -76.55468029456495) 2 ... 0000000 0 NaN NaN 2021.08.03 1991.00.00 0000.00.00 2022 M 20220504
4 ANNE Anne Arundel County 20100000067200 https://sdat.dat.maryland.gov/RealProperty/Pag... https://apps.planning.maryland.gov/finderonlin... https://maps.google.com/maps?t=h&q=38.91023312... -76.554364 38.910233 (38.91023312394375, -76.55436388543245) 2 ... 0000000 0 NaN NaN 2022.04.16 1991.00.00 0000.00.00 2022 M 20220504

5 rows × 221 columns

In [157]:
len(base_df.index)
Out[157]:
163854
In [158]:
base_df.dtypes
Out[158]:
Jurisdiction Code (MDP Field: JURSCODE)                object
County Name (MDP Field: CNTYNAME)                      object
Account ID (MDP Field: ACCTID)                          int64
Real Property Search Link                              object
FINDER Online Link                                     object
                                                        ...  
Record Creation Date (YYYY.MM.DD) (SDAT Field #397)    object
Record Deletion Date (YYYY.MM.DD) (SDAT Field #398)    object
Assessment Cycle Year (SDAT Field #399)                 int64
File Record Type (SDAT Field #400)                     object
Date of Most Recent Open Data Portal Record Update      int64
Length: 221, dtype: object

As you can see, it’s pretty huge. Taking a look at the dimensions we still have more than 160,000 rows, each with those 221 column values. Keep in mind that this dataset contains a lot of information regarding taxes and various other details that we don’t plan on accounting for in this analysis, whether due to lack of relevance or from the exponentially larger complexity and length they would bring to this tutorial. So before we can discuss the dataset, we should trim it down.

Tidying the data:

To start tidying our data, let’s decide what information we want to look at and keep columns that pertain to that.

For this tutorials analysis we’ll be looking at around four general attributes to keep things easy to understand but still enough to hopefully find several interesting trends:

  • Property Value
  • Property Size
  • Year of Construction
  • City

We’ll also want information like coordinates and an address line in order to visualize the data and to reference specific entries when needed.

With those in mind, we’ll trim the table down to the following columns:

In [159]:
# Keeping only the columns we plan to use for our dataframe or for tidying it
df = base_df[[
              # For unique identification if I ever need it
              'LEGAL DESCRIPTION: Line 2 (MDP Field: LEGAL2. SDAT Field #18)',

              # Location data
              'MDP Longitude (MDP Field: DIGXCORD converted to WGS84)',
              'MDP Latitude (MDP Field: DIGYCORD converted to WGS84)',
              'PREMISE ADDRESS: City (MDP Field: PREMCITY. SDAT Field #25)',

              # Base cycle: assessment independent of current market cycle
              'BASE CYCLE DATA: Land Value (SDAT Field #154)',
              'BASE CYCLE DATA: Improvements Value (SDAT Field #155)',
              'BASE CYCLE DATA: Preferential Land Value (SDAT Field #156)',
              'BASE CYCLE DATA: Date Assessed (YYYY.MM) (SDAT Field #158)',

              # Base cycle: assessment of current market cycle
              'CURRENT CYCLE DATA: Land Value (MDP Field Names: NFMLNDVL, CURLNDVL, and SALLNDVL. SDAT Field #164)',
              'CURRENT CYCLE DATA: Improvements Value (MDP Field Names: NFMIMPVL, CURIMPVL, and SALIMPVL. SDAT Field #165)',
              'CURRENT CYCLE DATA: Preferential Land Value (SDAT Field #166)',
              'CURRENT CYCLE DATA: Date Assessed (YYYY.MM) (MDP Field: LASTASSD. SDAT Field #169)',

              # Data regarding physical premise
              'C.A.M.A. SYSTEM DATA: Year Built (YYYY) (MDP Field: YEARBLT. SDAT Field #235)',
              'C.A.M.A. SYSTEM DATA: Structure Area (Sq.Ft.) (MDP Field: SQFTSTRC. SDAT Field #241)',
              'C.A.M.A. SYSTEM DATA: Land Area (MDP Field: LANDAREA. SDAT Field #242)',
              # a for acres, s for square feet, after looking at linked webpages for several addresses
              'C.A.M.A. SYSTEM DATA: Land Unit of Measure (MDP Field: LUOM. SDAT Field #243)',
              
              # 'Homestead Qualification Date (MDP Field: HOMQLDAT. SDAT Field #260)',
              'Last Activity Date (YYYY.MM.DD) (SDAT Field #392)',
              'Record Deletion Date (YYYY.MM.DD) (SDAT Field #398)',
              'Assessment Cycle Year (SDAT Field #399)',

              # These three columns were already filtered in this version of the csv, but feel free to use these if you decide to import the whole dataset
              #'County Name (MDP Field: CNTYNAME)',
              #'Land Use Code (MDP Field: LU/DESCLU. SDAT Field #50)',
              #'C.A.M.A. SYSTEM DATA: Number of Dwelling Units (MDP Field: BLDG_UNITS. SDAT Field #239)',
              ]]

Now if you noticed above or looked through the original dataset, there’s actually numerous columns that refer to the monetary value of a property, they all usually have the following columns: Land Value: this is the value of the land itself, without any buildings or manmade features

Improvement Value: this is the value of the manmade features, most usually refers to the the house(s) or building(s) on the land

Preferential Land Value: this is a value in consideration of agricultural land that is added to the total value to account for the land’s usage, this value will still be imported to see if it’s in use for our entries but we don’t expect this to be relevant for residential houses.

Circuit Breaker Value: this is the total value of the property, but with regards to and adjusted in accordance to a Circuit Breaker program in Maryland, which is a program to reduce property taxes for those who would have trouble affording them, you can learn more about this here. As I am looking purely at the cost of the property itself and not the tax situations of the homeowners, this value is not important for this analysis

Each of these values are also distributed across the following groups: Sales segments 1-3: Sales segments usually refers to sales for different markets (like selling for a residential market as opposed to a commercial one), but it also appears to be used in this data to represent previous transactions. Due to being highly variable in having data for these segments and having better alternative available, these groups will be skipped.

Previous Year Assessments: It is as it says, interesting data to consider but won’t be in use for a limited span.

Current cycle: With respect/accounting for the current market cycle. Cycles in this context refer to the various phases of the real estate market (conventionally this is recovery, expansion, hyper supply, and recession), as some of this data is assessed at different times, we won’t be relying on this data heavily due to being more variable and representing current market shifts as opposed to more objective assessment of property.

Base cycle: This is the assessment independent of current cycle trends, the base value to work off of. Our assessment data will focus on these values.

With those explained, let’s continue cleaning up this data and rename these columns to more simple and relevant names to not clutter code or our tables.

In [160]:
# renaming columns
df = df.rename(columns={
    'LEGAL DESCRIPTION: Line 2 (MDP Field: LEGAL2. SDAT Field #18)': 'street_address', 
    'MDP Longitude (MDP Field: DIGXCORD converted to WGS84)': 'long',
    'MDP Latitude (MDP Field: DIGYCORD converted to WGS84)': 'lat',
    'PREMISE ADDRESS: City (MDP Field: PREMCITY. SDAT Field #25)': 'city',

    'BASE CYCLE DATA: Land Value (SDAT Field #154)': 'base_land',
    'BASE CYCLE DATA: Improvements Value (SDAT Field #155)': 'base_improvements',
    'BASE CYCLE DATA: Preferential Land Value (SDAT Field #156)': 'base_preferential',
    'BASE CYCLE DATA: Date Assessed (YYYY.MM) (SDAT Field #158)': 'base_assess_date',

    'CURRENT CYCLE DATA: Land Value (MDP Field Names: NFMLNDVL, CURLNDVL, and SALLNDVL. SDAT Field #164)': 'curr_land',
    'CURRENT CYCLE DATA: Improvements Value (MDP Field Names: NFMIMPVL, CURIMPVL, and SALIMPVL. SDAT Field #165)': 'curr_improvements',
    'CURRENT CYCLE DATA: Preferential Land Value (SDAT Field #166)': 'curr_preferential',
    'CURRENT CYCLE DATA: Date Assessed (YYYY.MM) (MDP Field: LASTASSD. SDAT Field #169)': 'curr_assess_date',
  
    'CURRENT ASSESSMENT YEAR: Total Assessment (SDAT Field #172)': 'curr_year_total',
    'CURRENT ASSESSMENT YEAR: Circuit Breaker Assessment (SDAT Field #176)': 'curr_year_cb',

    'C.A.M.A. SYSTEM DATA: Year Built (YYYY) (MDP Field: YEARBLT. SDAT Field #235)': 'year_built',
    'C.A.M.A. SYSTEM DATA: Structure Area (Sq.Ft.) (MDP Field: SQFTSTRC. SDAT Field #241)': 'struct_area',
    'C.A.M.A. SYSTEM DATA: Land Area (MDP Field: LANDAREA. SDAT Field #242)': 'land_area_multityped',
    'C.A.M.A. SYSTEM DATA: Land Unit of Measure (MDP Field: LUOM. SDAT Field #243)': 'land_area_type',
    
    'Last Activity Date (YYYY.MM.DD) (SDAT Field #392)': 'rec_update_date',
    'Record Deletion Date (YYYY.MM.DD) (SDAT Field #398)': 'rec_delete_date',
    'Assessment Cycle Year (SDAT Field #399)': 'assessment_year'
    })

The table is a lot more legible and manageable now, but we’re still not done. Let's go through these columns to make sure they are relevant and aren’t missing values.

Locations

In [161]:
print('Missing address: ', len(df[df['street_address'].isna()]))
print('Missing long: ', len(df[(df['long'].isna()) | (df['long'] == 0)]))
print('Missing lat: ', len(df[(df['lat'].isna()) | (df['lat'] == 0)]))
Missing address:  0
Missing long:  164
Missing lat:  164

Here we see some rows are missing coordinate values. Since this is essential to mapping the data and 164 isn’t even a percent of the total data size, we can remove these null entries without issue

In [162]:
df = df.dropna(subset=['long', 'lat'], how='any')[(df['long'] != 0) & (df['lat'] != 0)]

Assessment values

In [163]:
print('Missing base land $: ', len(df[(df['base_land'].isna()) | (df['base_land'] == 0)]))
print('Missing base improvements $: ', len(df[(df['base_improvements'].isna()) | (df['base_improvements'] == 0)]))
print('Missing/no base preferential $: ', len(df[(df['base_preferential'].isna()) | (df['base_preferential'] == 0)]))
print('Missing base assessment date: ', len(df[(df['base_assess_date'].isna()) | (df['base_assess_date'] == 0)]))

print('Missing current land $: ', len(df[(df['curr_land'].isna()) | (df['curr_land'] == 0)]))
print('Missing current improvements $: ', len(df[(df['curr_improvements'].isna()) | (df['curr_improvements'] == 0)]))
print('Missing/no current preferential $: ', len(df[(df['curr_preferential'].isna()) | (df['curr_preferential'] == 0)]))
print('Missing curr assessment date: ', len(df[(df['curr_assess_date'].isna()) | (df['curr_assess_date'] == 0)]))
Missing base land $:  26
Missing base improvements $:  82
Missing/no base preferential $:  163690
Missing base assessment date:  0
Missing current land $:  26
Missing current improvements $:  81
Missing/no current preferential $:  163690
Missing curr assessment date:  0

Assessment values are integral to our analysis, so dropping these few rows with missing entries is in our best interest

Also note that no columns have a value for preferential value, so we can safely drop that column

In [164]:
df = df.dropna(subset=['base_land', 'base_improvements', 'curr_land', 'curr_improvements', 'base_assess_date'], how='any')
df = df[(df['base_land'] != 0) & (df['base_improvements'] != 0) & (df['curr_land'] != 0) & (df['curr_improvements'] != 0)]
In [165]:
print('Missing base land $: ', len(df[(df['base_land'].isna()) | (df['base_land'] == 0)]))
print('Missing base improvements $: ', len(df[(df['base_improvements'].isna()) | (df['base_improvements'] == 0)]))
print('Missing current land $: ', len(df[(df['curr_land'].isna()) | (df['curr_land'] == 0)]))
print('Missing current improvements $: ', len(df[(df['curr_improvements'].isna()) | (df['curr_improvements'] == 0)]))
Missing base land $:  0
Missing base improvements $:  0
Missing current land $:  0
Missing current improvements $:  0
In [166]:
df = df.drop(['base_preferential', 'curr_preferential'], axis=1)

Now that we tidied the assessment data we can also add total property values as columns, which will be the main focus of our analysis

In [167]:
df['curr_total'] = df.apply(lambda row: row['curr_land'] + row['curr_improvements'], axis=1)
df['base_total'] = df.apply(lambda row: row['base_land'] + row['base_improvements'], axis=1)

While we’re here let's also convert the assessment dates into a datetime format

In [168]:
# YYYY.MM format got recognized as a float so we must convert it to a string first
df['base_assess_date'] = pd.to_datetime(df['base_assess_date'].astype(str), format ='%Y.%m', errors='coerce')
df['curr_assess_date'] = pd.to_datetime(df['curr_assess_date'].astype(str), format ='%Y.%m', errors='coerce')

Other property data

In [169]:
print('Missing year built: ', len(df[(df['year_built'].isna()) | (df['year_built'] == 0)]))
print('Missing struct area: ', len(df[(df['struct_area'].isna()) | (df['struct_area'] == 0)]))
print('Missing land area: ', len(df[(df['land_area_multityped'].isna()) | (df['land_area_multityped'] == 0)]))
print('Missing land area types : ', len(df[df['land_area_type'].isna()]))
print('Land area types : ', df['land_area_type'].unique())
Missing year built:  173
Missing struct area:  3
Missing land area:  350
Missing land area types :  350
Land area types :  ['S' 'A' nan]

We have a small amount of entries missing here too. These values aren’t as integral to our analysis individually, but there is no easy means of accurately imputing data (you could considering copying the nearest geographic neighbor’s year built date and dimensions, but this seems involved for 350 entries), and the amount of missing entries still remains at less than a percent of our total dataset, so I'll continue to drop empty columns.

In [170]:
df = df.dropna(subset=['year_built', 'struct_area', 'land_area_multityped', 'land_area_type'], how='any')
df = df[(df['year_built'] != 0) & (df['struct_area'] != 0) & (df['land_area_multityped'] != 0)]
In [171]:
print('Missing year built: ', len(df[(df['year_built'].isna()) | (df['year_built'] == 0)]))
print('Missing struct area: ', len(df[(df['struct_area'].isna()) | (df['struct_area'] == 0)]))
print('Missing land area: ', len(df[(df['land_area_multityped'].isna()) | (df['land_area_multityped'] == 0)]))
Missing year built:  0
Missing struct area:  0
Missing land area:  0

Looking at the area values you might’ve also noticed that land are has been split into a number and type column. Although structures are all already in square feet, land are is either in values S or A. Documentation for this dataset is pretty lacking so it’s not immediately clear how to identify these measurements for certain but looking at the websites for several of the properties and matching up the area data, I can confirm that S is for square feet and A is for Acres.

To minimize this confusion and to make data consistent, let’s convert all the values in the square feet

In [172]:
# Change land are to be consistantly 
def toSqFt(row):
  if row['land_area_type'] == 'S':
    return row['land_area_multityped']
  elif row['land_area_type'] == 'A':
    # 1 acre = 43560 sq. ft
    return row['land_area_multityped'] * 43560
  else:
    return np.nan
In [173]:
df['land_area'] = df.apply(lambda row: toSqFt(row), axis=1)
# And now we can delete the old columns
df = df.drop(['land_area_multityped', 'land_area_type'], axis=1)

Record dates

In [174]:
print('Missing update date: ', len(df[(df['rec_update_date'].isna()) | (df['rec_update_date'] == '0000.00.00')]))
print('Entries w/ deletion dates : ', len(df[df['rec_delete_date'] != '0000.00.00']))
print('Assessment years: ', df['assessment_year'].unique())
Missing update date:  0
Entries w/ deletion dates :  1
Assessment years:  [2022]

Deletions was kept in order to find and avoid deleted records in our data, here we see we have 4, so we’ll simply remove them and the column from the dataframe

In [175]:
# Remove deleted records
df = df[df['rec_delete_date'] == '0000.00.00']
df = df.drop(['rec_delete_date'], axis=1)
In [176]:
df = df.drop(['assessment_year',], axis=1) # redundant column

While we’re here let's also convert the update column into a datetime format

In [177]:
df['rec_update_date'] = pd.to_datetime(df['rec_update_date'],format='%Y.%m.%d', errors='coerce')
In [178]:
print('Missing update date: ', len(df[(df['rec_update_date'].isna())]))
Missing update date:  0

City

In [179]:
df['city'].unique()
Out[179]:
array(['EDGEWATER', 'GALESVILLE', 'RIVA', 'DAVIDSONVILLE', 'HARWOOD',
       'LOTHIAN', 'MAYO', 'WEST RIVER', 'EDEWATER', 'JAMES H HOLT PROP',
       'CHURCHTON', 'DAVIDSONVILE', 'DAVIDSONVILLE MD', 'ANNAPOLIS',
       'GAMBRILLS', 'CROWNSVILLE', 'MILLERSVILLE', 'ANNAP', 'CROFTON',
       'DAVDISONVILLE', nan, 'MD', 'RIVA MD', 'CRPWMSVILLE',
       'CROENSVILLE', 'MILLESVILLE', 'CROFOTN', 'SHERWOOD FOREST',
       'CROFTONLS', 'SEVERNA PARK', 'PASADENA', 'GLEN BURNIE', 'ARNOLD',
       'BALTIMORE', 'GLEN BUNRIE', 'WENGER PROPERTY', 'PASADEN',
       'SEVERNA APRK', 'PASADENA MD', 'SEVERN PARK', 'ANNNAPOLIS',
       'CARVELL BEACH', 'CARVEL BEACH', 'BALTIMORE MD',
       'CHESTNUT HILL COVE', 'CHESTNUTHILL COVE', 'CURTIS BAY',
       'ORCHARD BEACH', 'GLENBURNIE', 'GLEN BURNIE MD', 'GLEN BURNE',
       'GLEN BUNIE', 'GIBSON ISLAND', 'PASADEA', 'SEVERNA PARL',
       'RIVIERA BEACH', 'SEVERN', 'JESSUP', 'ODENTON', 'LAUREL',
       'HANOVER', 'HARMANS', 'COOK PROPERTY', 'SEVENN', 'SEVERN MD',
       'BROOKWOOD RUN', 'GRAMBRILLS', 'LARUEL', 'SEVEN', 'HARMONS',
       'HANOVR', 'ODENTON MD', 'ODENTTON', 'LINTHICUM', 'LINTHICUM HGTS',
       'BROOKLYN PARK', 'LINTHICUM HTS', 'LINTHICUM HEIGHTS', 'BROOKLYN',
       'LINTHIUM HEIGHTS', 'GLEN BURIE', 'EASTPORT', 'ANNAOOLIS',
       'SHADY SIDE', 'DEALE', 'SHADYSIDE', 'SHADY SIDE MD', 'CHRUCHTON',
       'TRACYS LANDING', 'OWINGS', 'FRIENDSHIP', 'DUNKIRK', 'DRURY',
       'BRISTOL', "TRACY'S LANDING", 'TRACEYS LANDING',
       'VERKLIN PROPERTY', 'TRACYS LANDIND', 'NORTH BEACH',
       'NORTH BEACH PARK', 'NORTH BEACH PK', 'ROSE HAVEN', 'LAREL'],
      dtype=object)

Saving the most frightening and troublesome one for last, there seems to be abundant typos and inconsistencies in the city entries (CRPWMSVILLE, CHRUCHTON, etc.). Though humorous, I really want to analyze how cities affect assessment values, so I’ll need to manually fix the typos.

To help me identify and efficiently group city typoes I used this list sorter to look at the entries in alphabetical order, and identified values that referred to the same city by their common starting substring if any. Otherwise I made explicit cases for a spelling variant.

In [211]:
def fixcities(row):
  if not isinstance(row['city'], str):
    return np.nan
  elif row['city'].startswith('ANN'):
    return 'ANNAPOLIS'
  elif row['city'].startswith('ARN'):
    return 'ARNOLD'
  elif row['city'].startswith('BALT'):
    return 'BALTIMORE'
  elif row['city'].startswith('BROOKLYN'):
    return 'BROOKLYN PARK'
  elif row['city'].startswith('CAR'):
    return 'CARVEL BEACH'
  elif row['city'].startswith('CHES'):
    return 'CHESTNUT HILL COVE'
  elif row['city'].startswith('CHRUCH'):
    return 'CHURCHTON'
  elif row['city'].startswith('CROF'):
    return 'CROFTON'
  elif row['city'].startswith('CRPWMSVILLE') or row['city'].startswith('CROEN'):
    return 'CROWNSVILLE'
  elif row['city'].startswith('DAV'):
    return 'DAVIDSONVILLE'
  elif row['city'].startswith('EDE'):
    return 'EDGEWATER'
  elif row['city'].startswith('GLEN'):
    return 'GLEN BURNIE'
  elif row['city'].startswith('HAN'):
    return 'HANOVER'
  elif row['city'].startswith('HARM'):
    return 'HARMANS'
  # These aren't locations
  elif row['city'].startswith('JAMES H HOLT PROP') or row['city'].startswith('MD') or row['city'].startswith('WENGER') or row['city'].startswith('COOK'):
    return 'OTHER'
  elif row['city'].startswith('LA'):
    return 'LAUREL'
  elif row['city'].startswith('LIN'):
    return 'LINTHICUM'
  elif row['city'].startswith('MILL'):
    return 'MILLERSVILLE'
  elif row['city'].startswith('NORTH BEACH'):
    return 'NORTH BEACH'
  elif row['city'].startswith('ODE'):
    return 'ODENTON'
  elif row['city'].startswith('PAS'):
    return 'PASADENA'
  elif row['city'].startswith('RIVA MD'):
    return 'RIVA'
  elif row['city'].startswith('SEVERNA'):
    return 'SEVERNA PARK'
  elif row['city'].startswith('SEV') or row['city'].startswith('BROOKWOOD'):
    return 'SEVERN'
  elif row['city'].startswith('SHAD'):
    return 'SHADY SIDE'
  elif row['city'].startswith('TRAC'):
    return 'TRACYS LANDING'
  else:
    return row['city']
In [212]:
df['city'] = df.apply(lambda row: fixcities(row), axis=1)

Notice now we have a few empty city rows

In [182]:
print('Missing cities: ', len(df[(df['city'].isna())]))
Missing cities:  18

So let’s also drop those rows

In [183]:
df = df.dropna(subset=['city'])

And with that we now officially have tidied up our dataset:

Tidied Dataset

In [184]:
df.head()
Out[184]:
street_address long lat city base_land base_improvements base_assess_date curr_land curr_improvements curr_assess_date year_built struct_area rec_update_date curr_total base_total land_area
0 834 SELBY BLVD -76.517997 38.905957 EDGEWATER 184000.0 85300.0 2021-01-01 184000.0 89900.0 2021-01-01 1955 1260 2022-04-16 273900.0 269300.0 8000.0
1 3338 SOLOMONS ISLAND RD -76.571792 38.924369 EDGEWATER 135300.0 170900.0 2021-01-01 152800.0 180300.0 2021-01-01 1972 1372 2022-04-16 333100.0 306200.0 90169.2
2 991 GALESVILLE RD -76.542110 38.842798 GALESVILLE 359300.0 393400.0 2021-01-01 469300.0 411600.0 2021-01-01 2015 3347 2021-08-03 880900.0 752700.0 55321.0
3 3650 MUDDY CREEK RD -76.554680 38.909416 EDGEWATER 230400.0 405400.0 2021-01-01 250400.0 408600.0 2021-01-01 2019 3276 2021-08-03 659000.0 635800.0 139392.0
4 3638 MUDDY CREEK RD -76.554364 38.910233 EDGEWATER 237700.0 180600.0 2021-01-01 257700.0 182500.0 2021-01-01 1958 1880 2022-04-16 440200.0 418300.0 210830.4

This should be much better to work with than 212 columns with complete descriptions for column names!

Exploratory Data Analysis and Visualization

With data in hand for property values across the city, we can now plot out the information and find any potential trends. Here with the following plots I’ll be focusing on data and interactions that I believe are relevant to our objective or may show interesting relationships that otherwise have been overlooked.

Boxplot of Assessment Value Distributions I

With the data available, the first thing I want to do is to actually plot it all out on an actual map, but that brought up a concern for me in the form of dealing with major outliers. I’ll go into more detail during the actual heat mapping section, but for now I want to check for any outliers and oddities.

To do so, we'll make a boxplot for our assessment values as the most direct way of identifying outliers.

In [185]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12,10))

sns.boxplot(df['base_total'], ax=axs[0,0])
sns.boxplot(df['curr_total'], ax=axs[0,1])
sns.boxplot(df['base_land'], ax=axs[1,0])
sns.boxplot(df['base_improvements'], ax=axs[1,1])

axs[0,0].title.set_text('Property Values Distribution')
axs[0,0].set_xlabel("$")
axs[0,1].title.set_text('Property Values (Current Cycle)')
axs[0,1].set_xlabel("$")
axs[1,0].title.set_text('Property Land Values Distribution')
axs[1,0].set_xlabel("$")
axs[1,1].title.set_text('Property Improvements Values Distribution')
axs[1,1].set_xlabel("$")

plt.show()

Looking at these plots, the problem of outliers seems blatantly apparent. The central 50% of all of the data is contained in the box, yet that box is completely shrunk down by some major outliers on the scale. Looking at the smaller proportion in this upper quartile of data, we can make a good educated guess that that these homes are for particularly wealthy individuals and communities.

Aprt from that, there's not too much else we can glean from these boxplots. It's interesting to be able to observe this small population present in the county, but it also makes it difficult to observe the rest of our distribution.

One last bit of tidying

With outliers disrupting the scale of our plots in mind, I'd like to have an additional plot without outliers to have a better undersanding of the bulk of our data.

To so, let's calculate the fence of the data and cut off all values outside this fence. In the contetx of statistics, a fence is defined as the boundary that distinguishes outliers from non-outliers, and is typically 1.5 times the interquartile range (Q3 - Q1, the two ends of the boxpltos you saw). Also, since I'm mainly interesting in total value assessments, I'll only be generating a fenced dataset with respects to the total valeus distribution.

In [186]:
def remove_outliers(data, col):
    Q3 = np.quantile(data[col], 0.75)
    Q1 = np.quantile(data[col], 0.25)
    IQR = Q3 - Q1
    
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    return data[(data[col] >= lower) & (data[col] <= upper)]
In [187]:
fenced_df = remove_outliers(df, 'base_total')

Boxplot of Assessment Value Distributions II

Now with outliers removed, let's see what the distribution looks like now.

In [188]:
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12,10))

sns.boxplot(fenced_df['base_total'], ax=axs[0,0])
sns.boxplot(fenced_df['curr_total'], ax=axs[0,1])
sns.boxplot(fenced_df['base_land'], ax=axs[1,0])
sns.boxplot(fenced_df['base_improvements'], ax=axs[1,1])

axs[0,0].title.set_text('Property Values Distribution')
axs[0,0].set_xlabel("$")
axs[0,1].title.set_text('Property Values (Current Cycle)')
axs[0,1].set_xlabel("$")
axs[1,0].title.set_text('Property Land Values Distribution')
axs[1,0].set_xlabel("$")
axs[1,1].title.set_text('Property Improvements Values Distribution')
axs[1,1].set_xlabel("$")

plt.show()

The plots are a lot easier to see now! Looking at this distribution now we can still see that high valeus properties seem to dominate the peripherals of the data, but the median and general majority of property prices lie around $300,000 to \$400,000.

The average home in Maryland is currently $399,498 according to Zillow so it's not too far off, but notably below the current average. However base prices don't account for cyclic and seasonal trends in the market where as the current cycle data does. With the adjustment shown in the current cycle distrubtion, it seems that our iqr is shifted upwards a little bit, but also fails to average on 400,000.

Though odd, This biggest explanation to this difference is mostly explained by the higher value outliers of the set, which have a huge impact on the mean but none on the median. And considering how some residences breached into the next magnitude to over $1.6 million, our data can be better understood as still representative.

In fact we can take a look at the stats more explicitly to confirm too.

In [189]:
df['base_total'].describe()
Out[189]:
count    1.630640e+05
mean     4.053983e+05
std      2.450746e+05
min      2.250000e+04
25%      2.627000e+05
50%      3.466000e+05
75%      4.758000e+05
max      1.626550e+07
Name: base_total, dtype: float64

Looking at the summary (keep in mind this df retains the outliers) weactually fall a bit over the average, passing $405,000 in this county, but with a drastically lower median the influence of the millionares is quite apparent.

Heatmaps of Total Property Values in the county

Now that we've sorted out the outliers of the data, we can now generate heatmaps of property value that are reasonably scaled.

The max value for the heat mao is typically the max value you'd expect form a set, but since using the actual max would scale down everything else too drastically, I'll make use of the fenced data and use max of the data with outliers removed.

In [190]:
aac_lat, aac_long = 38.9530, -76.5488
In [191]:
# Heatmap for base total assessment
heat_data = [[row['lat'],row['long'], row['base_total']] for index, row in df.iterrows()]

m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)

HeatMap(heat_data,
        max_val = fenced_df['base_total'].max(),
        blur = 15,
        radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)

m
base_total_hm

Looking at this this map we get some interesting but not too surprising results. The majority of the high value properties seem to be more concentrated in the northern part of the county, which makes sense knowing that Southern states tend to be more rural and Northern ones more urban, but what's interesting is how this idea is represented in a small county like Anne Arundel too. Maryland is considered to be a borderline state politically and geographically in the US, so it's interesting how that trait can be observed at the county level in my home county.

Another thing to note is where exactly you can find these clusters. The three biggest ones I'd like to note if zoom in a bit are Annapolis, Severna Park, and Edgewater. All three make sense, Annapolis being the state capitol and the other two being well regarded as high income communities locally, but looking at the heatmap makes the distinction much more clear and apparent.

Heatmaps of Property Land and Improvement Values in the county

With total property value mapped and analyzed, let's take a look at how each of the two components, land and improvement, vary from the total.

In [192]:
# Heatmap for base land assessment
heat_data = [[row['lat'],row['long'], row['base_land']] for index, row in df.iterrows()]

m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)

HeatMap(heat_data,
        max_val = fenced_df['base_land'].max(),
        blur = 15,
        radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)

m
base_land_hm

Looking at this map, we can see that the Northern part of the county much less crowded with high value land, but what's incredibly interesting is that there appear to be three major clusters of high value land still, and those happen to be Annapolis, Crofton, and Severna Park. I neglected to mention Crofton before but it's also unsurprisingly regarded as a high income area, but it's interesting how valued it's land is compared to other similar communities, despite not being near the coast like Severna Park and Annapolis, which is a facotr I persoanlly suspect to be a postive influence on their land value. This more likely than not says more about the positive prospects of being in the Crofton community or having the facilities and resources that come with living near the area, which you wouldn't be able to tell with total value alone.

In [193]:
# Heatmap for base improvements assessment
heat_data = [[row['lat'],row['long'], row['base_improvements']] for index, row in df.iterrows()]

m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)

HeatMap(heat_data,
        max_val = fenced_df['base_improvements'].max(),
        blur = 15,
        radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)

m
base_improv_hm

Looking at improvement values instead, the distribution seems to be a bit more spread out, but still exclusive to the North part of the county. The three cities before still show high values even in their improvements, but the Glen Burnie and Sveren areas also now show high improvement values, which was a little unexpected for me. Looking close this seems to be the product of smaller communities in the areas which has a vew houses with notably high improvement values.

Unlike land, there's a lot more individual agency when it comes to changing a property's improvement value, and the fact smaller neighborhoods and groups of houses seem to have more distinct impacts on improvement value here seems to support that notion.

Heatmap of Property Land Areas

Before looking at the direct relationships between specific property attributes, I also wish to see how land area plays into property value. Before I map this though, I will be fencing the data since as you can see below outliers are also prevalent

In [194]:
sns.boxplot(fenced_df['land_area'])
Out[194]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f13a96ae1d0>
In [195]:
fenced_df2 = remove_outliers(df, 'land_area')
In [196]:
heatdf = remove_outliers(df.dropna(subset=['land_area']), 'land_area')
# Heatmap for base improvements assessment
heat_data = [[row['lat'],row['long'], row['land_area']] for index, row in df.iterrows()]

m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)

HeatMap(heat_data,
        max_val = fenced_df2['land_area'].max(),
        blur = 15,
        radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)

m
land_area_hm

This is where a bit of the follies of heatmapping this type of data come in. As we can see here, this is probably the most distributed spread of valeus we've seen thus far, mostly since land area is unlikely to substantially differ unless you have both very urban and rural communities together. However, we still see a greater density in the heatmap up North, but zooming in you can see that this is more certainly due to the sheer density of properties of North, correlating also to a higher population.

Most of the blocks up North are more or less blue when you zoom in, but the South appears to show some larger neighborhoods that support the notion of a more rural area, though not by much, which is to be expected in as a concise of a region as a county.

Values of Property by Year Built

Now let's look at some more direct relationships of some of our columns attributes in regards to property value, let's start with seeing how the year a property was built affects it's value today.

I will note beforehand that you may see fenced versions of the dataframe used on plots from here on out without elaboration, and is usually done to avoid extreme scaling problems that come from having a couple million dollar homes.

In [197]:
fig, ax = plt.subplots(figsize=(10,6))
sns.lineplot(x='year_built', y='base_total', data=fenced_df)
ax.title.set_text('Property Values vs Time Built')
plt.show()

The relationship between construction year and property value brings something a bit unexpected and not immediately easy to understand.

Most houses built prior to the 1900's seem to fluctuate completely across the board, with those at and before the early 1800's being consistently valued more.

This particular observation is more or less explainable by the increased quality of older architecture and buildings in more historically important or long standing communities for those build much earlier on.

The major dip starting around 1900 is a clear display of the economic boom and depression that came with the 1920s and the introduction of credit and the fun concept of total bank collapse, as well as the effects of active World War. With that context this drop makes sense, as houses constructed durign such a tumultuous time were likely not done with frivoulousness or extravegance but more so pragmatism and afforadability in consideration of global circumstances at the time, and a larger more general market of buyers thanks to credit.

This drop in value being justified by the costs and pressure of war is also cemneted by the fast rise in the values of properties built after the end of WWII, as the value of homes every year seems to have steadily increased, which more or less reflects both our improving market, but also likely is a product of growing scarcity of space for homes too.

In [198]:
fig, axs = plt.subplots(nrows = 2, figsize=(10,10))

sns.lineplot(x='year_built', y='base_land', data=fenced_df, ax=axs[0])
axs[0].title.set_text('Property Land Values vs Time Built')

sns.lineplot(x='year_built', y='base_improvements', data=fenced_df, ax=axs[1])
axs[1].title.set_text('Property Improvement Values vs Time Built')

plt.show()

Splitting the values into improvements and land is very enlightening regarding the influencing factors I suspected above. The major spike in improvements while land takes a slow decline (until recently) starting post WWII shows that the larger factor to the growing prices of newer properties lies mainly in their construction.

With the higherer overall values in land prior to 1950 in mind too, this shows an interesting relationship of older properties being valued for their antiquity (or size as product of being older) and new properties being valued for being better built, or lending themselves to better upkeep of their land due to the relative recentness of their construction (Older buildings should be expected to fall in overall quality overall from aging after all).

Relationship of Land and Structure Area to Property Value

The relationship of land and structure area to a properties value was also one that interested me.

In [199]:
fig, axs = plt.subplots(nrows = 2, figsize=(10,10))

sns.lineplot(x='land_area', y='base_land', data=fenced_df2, ax=axs[0])
axs[0].title.set_text('Property Land Values vs Land Area (sq ft)')

sns.lineplot(x='struct_area', y='base_improvements', data=fenced_df2, ax=axs[1])
axs[1].title.set_text('Property Improvement Values vs Building Area (sq ft)')

plt.show()

Plotting out these relationships was defintiely not what I expected, especially for land area. In hindsight, the larger amount of values on the x axis that may near 160,000 may not be the best for the plot, but there's still plenty of information that can be observed from these plots.

For property land values, it seems there's a minor positive linear relationship between it and land area, albeit it's almost neglible for many of the cases plotted as most lie near the bottom of land values (also to those worried 0 value cases weren't accoutn for, they are, there's simply that much homogeny in the land value side of evaluations, which the heatmap does support as well with only three small clusters visible there in an otherwise similar value range)

More interestingly is the very clear and distinct relationship of structure are to improvement value, which tapers off near large values but rich people elude a common persons thought process, and rural homes may be contesting that section along with richer properties depending on how structure area was determined. Otherwise, this is a definite positive linear relationship that can be witnessed here, confirming a relatively straightforward idea that bigger house costs more.

It's odd to see how much more impact buildigns have on a property then land relative to their values despite land having comparable impact on the total cost as the property does in most cases.

Average Property Value Across Cities

The last factor I want to observe is the relationship between a city and it's property values. For this I'll be keeping outliers due to some small and rich communities being mixed in to the cities too.

In [213]:
fig, ax = plt.subplots(figsize=(18,10))
sns.barplot(x="city", y="base_total", data=df)
plt.title("Average Property Value Across AAC Cities")
plt.xticks(rotation=90)
plt.show()

I'm still not sure whether having small independent communities as separate city entires is the best for reading the data, but this table gives us some insights on how some cities vary in how their property values.

Gibson Island and Other show themselves to be the most notorious of the small unincorporated communities in this plot, which makes sense when looking into them. Gibson Island is a whole island by the Chesapeake Bay, so it's value property is particularly distinct and lends itself to a more rich audience as a disjoint part of the overall area. Other represents entries which don't have any city to their label but had a real estate agency instead. Such cases appear to be a rpoduct of more independent transactions and otherwise which I'm not familiar with, but generally it seems to be in the field of particularly high income individuals, so I'll push it aside as being that.

Looking at the other cities, it's hard to see there's a very strong trend to glean from the data, but we do see some cities like Baltimore or Severna Park whose property valeus do correlate with the expected incomes in those cities.

The cities observations in particular make the idea of looking into other factors that pertain to a properties enviroment very compelling to look into and account for

Predicting Value

With the plots above, we've gotten an idea about the relationships of these factors on the value of a property, so let's see if they actually hold true by creating a regression model for them.

For our model, we want to account for all the factors we've analyzed in the previous section so our X values will be the following:

Year Built

Land Area

Struct Area

City

To generate this model we'll also be using the sklearn libraries imported at the beginning, please see their user guide if you want to learn more for yourself!

In [201]:
# Each city must be represented in someway numerically/binarily(?)
lm_df = pd.get_dummies(df, prefix = 'city_is', columns = ['city'], drop_first = False)
In [202]:
dummy_list = ['city_is_' + s for s in df['city'].unique().tolist()]
In [203]:
# These are out x values
ind = sm.add_constant(lm_df[['year_built', 'land_area', 'struct_area'] + dummy_list].to_numpy())

# This is our y
dep = lm_df['base_total'].to_numpy()
In [204]:
model = sm.OLS(dep, ind).fit()

And now we have our model, let's look at the summary for it:

In [205]:
print(model.summary(xname = ['const', 'year_built', 'land_area', 'struct_area'] + dummy_list))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.664
Model:                            OLS   Adj. R-squared:                  0.664
Method:                 Least Squares   F-statistic:                     6330.
Date:                Mon, 16 May 2022   Prob (F-statistic):               0.00
Time:                        19:43:38   Log-Likelihood:            -2.1659e+06
No. Observations:              163064   AIC:                         4.332e+06
Df Residuals:                  163012   BIC:                         4.332e+06
Df Model:                          51                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                       1.385e+06   3.24e+04     42.715      0.000    1.32e+06    1.45e+06
year_built                  -687.3995     16.264    -42.265      0.000    -719.276    -655.523
land_area                      0.0100      0.001      7.821      0.000       0.007       0.013
struct_area                  193.1847      0.439    439.698      0.000     192.324     194.046
city_is_EDGEWATER           4.511e+04   8699.083      5.185      0.000    2.81e+04    6.22e+04
city_is_GALESVILLE          2.018e+04   1.34e+04      1.503      0.133   -6141.946    4.65e+04
city_is_RIVA                6.769e+04   9347.409      7.241      0.000    4.94e+04     8.6e+04
city_is_DAVIDSONVILLE       4.491e+04   8977.640      5.002      0.000    2.73e+04    6.25e+04
city_is_HARWOOD             2.051e+04   9945.393      2.062      0.039    1012.904       4e+04
city_is_LOTHIAN             -3.58e+04   9501.763     -3.768      0.000   -5.44e+04   -1.72e+04
city_is_MAYO                 1.68e+04   1.25e+04      1.345      0.179   -7687.454    4.13e+04
city_is_WEST RIVER          2.388e+04   1.01e+04      2.364      0.018    4078.765    4.37e+04
city_is_OTHER               4.901e+04   7.01e+04      0.699      0.484   -8.83e+04    1.86e+05
city_is_CHURCHTON          -3770.5060   9476.840     -0.398      0.691   -2.23e+04    1.48e+04
city_is_ANNAPOLIS           1.149e+05   8591.312     13.374      0.000    9.81e+04    1.32e+05
city_is_GAMBRILLS           4888.3172   8838.423      0.553      0.580   -1.24e+04    2.22e+04
city_is_CROWNSVILLE         4.427e+04   8890.699      4.979      0.000    2.68e+04    6.17e+04
city_is_MILLERSVILLE        2.322e+04   8757.295      2.651      0.008    6051.748    4.04e+04
city_is_CROFTON             2.202e+04   8753.208      2.515      0.012    4860.699    3.92e+04
city_is_SHERWOOD FOREST     3.449e+05   9.87e+04      3.493      0.000    1.51e+05    5.38e+05
city_is_SEVERNA PARK        1.196e+05   8667.224     13.793      0.000    1.03e+05    1.37e+05
city_is_PASADENA            1.674e+04   8609.561      1.944      0.052    -135.659    3.36e+04
city_is_GLEN BURNIE        -5.141e+04   8604.859     -5.974      0.000   -6.83e+04   -3.45e+04
city_is_ARNOLD              7.964e+04   8719.551      9.133      0.000    6.25e+04    9.67e+04
city_is_BALTIMORE          -9.063e+04   8734.498    -10.377      0.000   -1.08e+05   -7.35e+04
city_is_SEVERN             -4.421e+04   8687.042     -5.089      0.000   -6.12e+04   -2.72e+04
city_is_CARVEL BEACH        5183.1298   2.61e+04      0.199      0.842   -4.59e+04    5.62e+04
city_is_CHESTNUT HILL COVE  -1.96e+04    2.6e+04     -0.752      0.452   -7.07e+04    3.15e+04
city_is_CURTIS BAY         -1.645e+04   8.08e+04     -0.204      0.839   -1.75e+05    1.42e+05
city_is_ORCHARD BEACH       2.237e+04   7.01e+04      0.319      0.750   -1.15e+05     1.6e+05
city_is_GIBSON ISLAND       1.199e+06   1.31e+04     91.183      0.000    1.17e+06    1.22e+06
city_is_RIVIERA BEACH      -3.226e+04   1.39e+05     -0.231      0.817   -3.05e+05    2.41e+05
city_is_JESSUP              1986.3203   1.05e+04      0.189      0.850   -1.86e+04    2.26e+04
city_is_ODENTON            -1.255e+04   8702.868     -1.442      0.149   -2.96e+04    4508.617
city_is_LAUREL             -3.614e+04   8841.306     -4.088      0.000   -5.35e+04   -1.88e+04
city_is_HANOVER            -2.816e+04   8796.511     -3.202      0.001   -4.54e+04   -1.09e+04
city_is_HARMANS            -3.676e+04   1.45e+04     -2.533      0.011   -6.52e+04   -8318.548
city_is_COOK PROPERTY      -1.761e+05   1.39e+05     -1.263      0.207   -4.49e+05    9.72e+04
city_is_GRAMBRILLS         -2.785e+04   1.39e+05     -0.200      0.842   -3.01e+05    2.45e+05
city_is_LINTHICUM          -6.375e+04   8866.574     -7.190      0.000   -8.11e+04   -4.64e+04
city_is_BROOKLYN PARK      -1.278e+05   2.91e+04     -4.390      0.000   -1.85e+05   -7.07e+04
city_is_EASTPORT           -1.023e+05   9.87e+04     -1.036      0.300   -2.96e+05    9.13e+04
city_is_SHADY SIDE         -1.589e+04   9127.379     -1.741      0.082   -3.38e+04    1997.539
city_is_DEALE               -1.11e+04   9652.207     -1.150      0.250      -3e+04    7818.483
city_is_TRACYS LANDING      9298.4339   1.05e+04      0.884      0.377   -1.13e+04    2.99e+04
city_is_OWINGS              -2.44e+04   2.97e+04     -0.823      0.411   -8.25e+04    3.37e+04
city_is_FRIENDSHIP         -4.674e+04   1.24e+04     -3.758      0.000   -7.11e+04   -2.24e+04
city_is_DUNKIRK            -3.775e+04   1.19e+04     -3.177      0.001    -6.1e+04   -1.45e+04
city_is_DRURY               7.125e+04   1.39e+05      0.511      0.609   -2.02e+05    3.44e+05
city_is_BRISTOL             8.433e+04   1.39e+05      0.605      0.545   -1.89e+05    3.58e+05
city_is_VERKLIN PROPERTY    1.946e+04   1.39e+05      0.140      0.889   -2.54e+05    2.93e+05
city_is_NORTH BEACH         1.444e+04   1.09e+04      1.324      0.185   -6928.325    3.58e+04
city_is_ROSE HAVEN         -5.936e+04   1.39e+05     -0.426      0.670   -3.33e+05    2.14e+05
==============================================================================
Omnibus:                   177310.406   Durbin-Watson:                   1.265
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         36357366.507
Skew:                           5.215   Prob(JB):                         0.00
Kurtosis:                      75.404   Cond. No.                     4.63e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.81e-16. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

This is a lot of information to look at at once, but keep in mind this is fluffed up from each individual city being listed.

There are a few key takeaways from this that I grasped immediately from this. One would be how little land area appears to have an impact, with a coefficient of 1 percent, it has a miniscule positve influence, like we surmised fro mthe analysis. Land area also seems to have a pretty strong positive influence as we had assumed from our analysis too with a coefficient over a hundred, and both this and land area have P > |t| that rounds to 0.00, meaning these are extremely confident predictions!

Year built and a few of the cities also have very low P > |t| values (we'll look at time more in the actual graph later) but a good amount if not most of the cities have pretty poor values for confidence. This is likely a product of low sample sizes in those cases as this appears to exclusive to smaller communities I don't recognize, while places like Laurel or Hanover, Crofton and many other larger cities maintain very near 0 values for this P value, and also show their own varying coefficients for how they impact valu assessment (ie. Crofton has a positive coefficient and has been seen having higher values in our heatmap)

There's still lot's that is hard to grasp with just this summary alone though, so let's plot it out and see what we get!

In [206]:
pred = model.predict(ind).reshape(1, dep.size)
In [207]:
df['prediction'] = pred[0]
In [208]:
pred_df = remove_outliers(df[['base_total', 'prediction', 'year_built']], 'base_total')

pred_df = pred_df.melt('year_built', var_name="Source",value_name="Property Value")
In [209]:
pred_df.head()
Out[209]:
year_built Source Property Value
0 1955 base_total 269300.0
1 1972 base_total 306200.0
2 2015 base_total 752700.0
3 2019 base_total 635800.0
4 1958 base_total 418300.0
In [210]:
fig, ax = plt.subplots(figsize=(10,6))

sns.lineplot(data=pred_df,x='year_built', y='Property Value', hue='Source')
ax.title.set_text('Property Values vs Time Built (Prediction vs Actual)')
plt.show()

Looking at this comparison between our prediction and the actual base values, the model appears to be pretty impressive in it's accuracy to an extent. There's nearly indistinguishable overlap for values past 1950.

There's definitely a lot of fluctuation to be had as we look before the late 1800s though and especially before 1800. This is most easily explained by the smaller amount of houses remaining from the time, but also the variance of the quality of a home and property that comes with time is certainly a factor to consider in how chaotic the plot gets.

Conclusion and Future Work

Even if I'm not very knowledgable in the world of real estate, I learned about what factors go into how properties are obsessed, and how strongly and in what way some of them interact with the value of a home. Land area seemed to be much less relvant than I thought and trends in construction year and cities bring up even mroe interesting ideas and possibilities.

If this tutorial caught your interest in real estate or the data science process, there’s still lots left in this dataset you can explore yourself. This tutorial only focused on Anne Arundel County in Maryland but the whole statewide dataset is available for use. You could even looking into collating datasets of other states and gain some insights on larger scale factors on property value. Additionally, I also mentioned various columns of information that I chose to look past for this tutorial like cycle data, construction information, and various tax related evaluations that are available for you to look into yourself. This tutorial was written mainly as a way to guide others through the world of data analysis and as a more layman’s look into real estate, there’s lots of data here and elsewhere for you to explore, so if anything else I hope this has been a compelling preview into these worlds for you.