The idea of buying a home seems to be growing harder in more recent years, and as someone entering the adult world. This scares me, as I hardly know anything or are equipped to deal with the world of real estate.
With that said, what better way to prepare myself then to use data science skills to model and analyze trends in property value?
The objective of this project will be to analyze public assessments for the value of real estate in the Anne Arundel County area of Maryland using reports from 2021 to 2022, and hopefully uncover and dissect trends and insights in the market. We will learn how more concrete factors like location, size, and other factors to a plot contribute to its assessed value, and how we can make use of this information to understand facets of the market and the real estate as a whole as potential laymen to the topic.
With that said, I’d like to more clearly make a disclaimer that I’m not highly knowledgeable in regards to properties or real estate, but I’ll be doing my best to explain many of the terms and aspects of properties we will interact with throughout this tutorial with the help of many resources and research. For those unfamiliar with the world of real estate, I hope this will provide an interesting and accessible lens into the assessment process of properties. For those who are familiar, I hope I can provide some interesting insights through this exploratory analysis and data-fitting.
And to both parties I’ll do what I can to show the steps you can take to analyze data like this yourself.
For this analysis, we’ll be using Python through Google Colab along with the imported libraries below:
# libraries to import
from google.colab import drive # For mounting drive files
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from glob import glob
import folium
from folium import plugins
from folium.plugins import HeatMap
from sklearn import linear_model
import statsmodels.api as sm
import statistics as stat
import warnings
warnings.filterwarnings("ignore")
plt.style.use('seaborn')
Keep in mind that if you are using Python through Jupyter or are using an alternative method of importing data as opposed to Google Drive, importing ‘drive’ from ‘google.colab’ will not be necessary.
The dataset we’ll be using for this is the Maryland Real Property Assessments data available through Maryland’s official Data portal.
Though the dataset here is already ready to be read in as a dataframe, it’s an enormous file with 2.44 millions rows and 221 columns! In order to work with the data more easily without crashing my computer or the Colab client, I preprocessed and filtered some of the data using Excel. With excel, I narrowed down the dataset to only include properties in Anne Arundel County, and of those, only residential single-dwelling homes.
After doing that, I uploaded the smaller file into a Drive folder dedicated to the project accessible here (It was still too large to upload to GitHub) and mounted my Colab notebook to access it.
# Mounting my google drive where I keep a copy of the prefiltered dataset
drive.mount('/content/drive')
# This is the file path for me with respects to my own drive, make sure to change this to fit your csv importing process!
path = "/content/drive/MyDrive/Tutorial/Data/Anne Arundel County Maryland Residential Property Assessments.csv"
base_df = pd.read_csv(path)
Here’s the dataset:
base_df.head()
len(base_df.index)
base_df.dtypes
As you can see, it’s pretty huge. Taking a look at the dimensions we still have more than 160,000 rows, each with those 221 column values. Keep in mind that this dataset contains a lot of information regarding taxes and various other details that we don’t plan on accounting for in this analysis, whether due to lack of relevance or from the exponentially larger complexity and length they would bring to this tutorial. So before we can discuss the dataset, we should trim it down.
To start tidying our data, let’s decide what information we want to look at and keep columns that pertain to that.
For this tutorials analysis we’ll be looking at around four general attributes to keep things easy to understand but still enough to hopefully find several interesting trends:
We’ll also want information like coordinates and an address line in order to visualize the data and to reference specific entries when needed.
With those in mind, we’ll trim the table down to the following columns:
# Keeping only the columns we plan to use for our dataframe or for tidying it
df = base_df[[
# For unique identification if I ever need it
'LEGAL DESCRIPTION: Line 2 (MDP Field: LEGAL2. SDAT Field #18)',
# Location data
'MDP Longitude (MDP Field: DIGXCORD converted to WGS84)',
'MDP Latitude (MDP Field: DIGYCORD converted to WGS84)',
'PREMISE ADDRESS: City (MDP Field: PREMCITY. SDAT Field #25)',
# Base cycle: assessment independent of current market cycle
'BASE CYCLE DATA: Land Value (SDAT Field #154)',
'BASE CYCLE DATA: Improvements Value (SDAT Field #155)',
'BASE CYCLE DATA: Preferential Land Value (SDAT Field #156)',
'BASE CYCLE DATA: Date Assessed (YYYY.MM) (SDAT Field #158)',
# Base cycle: assessment of current market cycle
'CURRENT CYCLE DATA: Land Value (MDP Field Names: NFMLNDVL, CURLNDVL, and SALLNDVL. SDAT Field #164)',
'CURRENT CYCLE DATA: Improvements Value (MDP Field Names: NFMIMPVL, CURIMPVL, and SALIMPVL. SDAT Field #165)',
'CURRENT CYCLE DATA: Preferential Land Value (SDAT Field #166)',
'CURRENT CYCLE DATA: Date Assessed (YYYY.MM) (MDP Field: LASTASSD. SDAT Field #169)',
# Data regarding physical premise
'C.A.M.A. SYSTEM DATA: Year Built (YYYY) (MDP Field: YEARBLT. SDAT Field #235)',
'C.A.M.A. SYSTEM DATA: Structure Area (Sq.Ft.) (MDP Field: SQFTSTRC. SDAT Field #241)',
'C.A.M.A. SYSTEM DATA: Land Area (MDP Field: LANDAREA. SDAT Field #242)',
# a for acres, s for square feet, after looking at linked webpages for several addresses
'C.A.M.A. SYSTEM DATA: Land Unit of Measure (MDP Field: LUOM. SDAT Field #243)',
# 'Homestead Qualification Date (MDP Field: HOMQLDAT. SDAT Field #260)',
'Last Activity Date (YYYY.MM.DD) (SDAT Field #392)',
'Record Deletion Date (YYYY.MM.DD) (SDAT Field #398)',
'Assessment Cycle Year (SDAT Field #399)',
# These three columns were already filtered in this version of the csv, but feel free to use these if you decide to import the whole dataset
#'County Name (MDP Field: CNTYNAME)',
#'Land Use Code (MDP Field: LU/DESCLU. SDAT Field #50)',
#'C.A.M.A. SYSTEM DATA: Number of Dwelling Units (MDP Field: BLDG_UNITS. SDAT Field #239)',
]]
Now if you noticed above or looked through the original dataset, there’s actually numerous columns that refer to the monetary value of a property, they all usually have the following columns: Land Value: this is the value of the land itself, without any buildings or manmade features
Improvement Value: this is the value of the manmade features, most usually refers to the the house(s) or building(s) on the land
Preferential Land Value: this is a value in consideration of agricultural land that is added to the total value to account for the land’s usage, this value will still be imported to see if it’s in use for our entries but we don’t expect this to be relevant for residential houses.
Circuit Breaker Value: this is the total value of the property, but with regards to and adjusted in accordance to a Circuit Breaker program in Maryland, which is a program to reduce property taxes for those who would have trouble affording them, you can learn more about this here. As I am looking purely at the cost of the property itself and not the tax situations of the homeowners, this value is not important for this analysis
Each of these values are also distributed across the following groups:
Sales segments 1-3: Sales segments usually refers to sales for different markets (like selling for a residential market as opposed to a commercial one), but it also appears to be used in this data to represent previous transactions. Due to being highly variable in having data for these segments and having better alternative available, these groups will be skipped.
Previous Year Assessments: It is as it says, interesting data to consider but won’t be in use for a limited span.
Current cycle: With respect/accounting for the current market cycle. Cycles in this context refer to the various phases of the real estate market (conventionally this is recovery, expansion, hyper supply, and recession), as some of this data is assessed at different times, we won’t be relying on this data heavily due to being more variable and representing current market shifts as opposed to more objective assessment of property.
Base cycle: This is the assessment independent of current cycle trends, the base value to work off of. Our assessment data will focus on these values.
With those explained, let’s continue cleaning up this data and rename these columns to more simple and relevant names to not clutter code or our tables.
# renaming columns
df = df.rename(columns={
'LEGAL DESCRIPTION: Line 2 (MDP Field: LEGAL2. SDAT Field #18)': 'street_address',
'MDP Longitude (MDP Field: DIGXCORD converted to WGS84)': 'long',
'MDP Latitude (MDP Field: DIGYCORD converted to WGS84)': 'lat',
'PREMISE ADDRESS: City (MDP Field: PREMCITY. SDAT Field #25)': 'city',
'BASE CYCLE DATA: Land Value (SDAT Field #154)': 'base_land',
'BASE CYCLE DATA: Improvements Value (SDAT Field #155)': 'base_improvements',
'BASE CYCLE DATA: Preferential Land Value (SDAT Field #156)': 'base_preferential',
'BASE CYCLE DATA: Date Assessed (YYYY.MM) (SDAT Field #158)': 'base_assess_date',
'CURRENT CYCLE DATA: Land Value (MDP Field Names: NFMLNDVL, CURLNDVL, and SALLNDVL. SDAT Field #164)': 'curr_land',
'CURRENT CYCLE DATA: Improvements Value (MDP Field Names: NFMIMPVL, CURIMPVL, and SALIMPVL. SDAT Field #165)': 'curr_improvements',
'CURRENT CYCLE DATA: Preferential Land Value (SDAT Field #166)': 'curr_preferential',
'CURRENT CYCLE DATA: Date Assessed (YYYY.MM) (MDP Field: LASTASSD. SDAT Field #169)': 'curr_assess_date',
'CURRENT ASSESSMENT YEAR: Total Assessment (SDAT Field #172)': 'curr_year_total',
'CURRENT ASSESSMENT YEAR: Circuit Breaker Assessment (SDAT Field #176)': 'curr_year_cb',
'C.A.M.A. SYSTEM DATA: Year Built (YYYY) (MDP Field: YEARBLT. SDAT Field #235)': 'year_built',
'C.A.M.A. SYSTEM DATA: Structure Area (Sq.Ft.) (MDP Field: SQFTSTRC. SDAT Field #241)': 'struct_area',
'C.A.M.A. SYSTEM DATA: Land Area (MDP Field: LANDAREA. SDAT Field #242)': 'land_area_multityped',
'C.A.M.A. SYSTEM DATA: Land Unit of Measure (MDP Field: LUOM. SDAT Field #243)': 'land_area_type',
'Last Activity Date (YYYY.MM.DD) (SDAT Field #392)': 'rec_update_date',
'Record Deletion Date (YYYY.MM.DD) (SDAT Field #398)': 'rec_delete_date',
'Assessment Cycle Year (SDAT Field #399)': 'assessment_year'
})
The table is a lot more legible and manageable now, but we’re still not done. Let's go through these columns to make sure they are relevant and aren’t missing values.
print('Missing address: ', len(df[df['street_address'].isna()]))
print('Missing long: ', len(df[(df['long'].isna()) | (df['long'] == 0)]))
print('Missing lat: ', len(df[(df['lat'].isna()) | (df['lat'] == 0)]))
Here we see some rows are missing coordinate values. Since this is essential to mapping the data and 164 isn’t even a percent of the total data size, we can remove these null entries without issue
df = df.dropna(subset=['long', 'lat'], how='any')[(df['long'] != 0) & (df['lat'] != 0)]
print('Missing base land $: ', len(df[(df['base_land'].isna()) | (df['base_land'] == 0)]))
print('Missing base improvements $: ', len(df[(df['base_improvements'].isna()) | (df['base_improvements'] == 0)]))
print('Missing/no base preferential $: ', len(df[(df['base_preferential'].isna()) | (df['base_preferential'] == 0)]))
print('Missing base assessment date: ', len(df[(df['base_assess_date'].isna()) | (df['base_assess_date'] == 0)]))
print('Missing current land $: ', len(df[(df['curr_land'].isna()) | (df['curr_land'] == 0)]))
print('Missing current improvements $: ', len(df[(df['curr_improvements'].isna()) | (df['curr_improvements'] == 0)]))
print('Missing/no current preferential $: ', len(df[(df['curr_preferential'].isna()) | (df['curr_preferential'] == 0)]))
print('Missing curr assessment date: ', len(df[(df['curr_assess_date'].isna()) | (df['curr_assess_date'] == 0)]))
Assessment values are integral to our analysis, so dropping these few rows with missing entries is in our best interest
Also note that no columns have a value for preferential value, so we can safely drop that column
df = df.dropna(subset=['base_land', 'base_improvements', 'curr_land', 'curr_improvements', 'base_assess_date'], how='any')
df = df[(df['base_land'] != 0) & (df['base_improvements'] != 0) & (df['curr_land'] != 0) & (df['curr_improvements'] != 0)]
print('Missing base land $: ', len(df[(df['base_land'].isna()) | (df['base_land'] == 0)]))
print('Missing base improvements $: ', len(df[(df['base_improvements'].isna()) | (df['base_improvements'] == 0)]))
print('Missing current land $: ', len(df[(df['curr_land'].isna()) | (df['curr_land'] == 0)]))
print('Missing current improvements $: ', len(df[(df['curr_improvements'].isna()) | (df['curr_improvements'] == 0)]))
df = df.drop(['base_preferential', 'curr_preferential'], axis=1)
Now that we tidied the assessment data we can also add total property values as columns, which will be the main focus of our analysis
df['curr_total'] = df.apply(lambda row: row['curr_land'] + row['curr_improvements'], axis=1)
df['base_total'] = df.apply(lambda row: row['base_land'] + row['base_improvements'], axis=1)
While we’re here let's also convert the assessment dates into a datetime format
# YYYY.MM format got recognized as a float so we must convert it to a string first
df['base_assess_date'] = pd.to_datetime(df['base_assess_date'].astype(str), format ='%Y.%m', errors='coerce')
df['curr_assess_date'] = pd.to_datetime(df['curr_assess_date'].astype(str), format ='%Y.%m', errors='coerce')
print('Missing year built: ', len(df[(df['year_built'].isna()) | (df['year_built'] == 0)]))
print('Missing struct area: ', len(df[(df['struct_area'].isna()) | (df['struct_area'] == 0)]))
print('Missing land area: ', len(df[(df['land_area_multityped'].isna()) | (df['land_area_multityped'] == 0)]))
print('Missing land area types : ', len(df[df['land_area_type'].isna()]))
print('Land area types : ', df['land_area_type'].unique())
We have a small amount of entries missing here too. These values aren’t as integral to our analysis individually, but there is no easy means of accurately imputing data (you could considering copying the nearest geographic neighbor’s year built date and dimensions, but this seems involved for 350 entries), and the amount of missing entries still remains at less than a percent of our total dataset, so I'll continue to drop empty columns.
df = df.dropna(subset=['year_built', 'struct_area', 'land_area_multityped', 'land_area_type'], how='any')
df = df[(df['year_built'] != 0) & (df['struct_area'] != 0) & (df['land_area_multityped'] != 0)]
print('Missing year built: ', len(df[(df['year_built'].isna()) | (df['year_built'] == 0)]))
print('Missing struct area: ', len(df[(df['struct_area'].isna()) | (df['struct_area'] == 0)]))
print('Missing land area: ', len(df[(df['land_area_multityped'].isna()) | (df['land_area_multityped'] == 0)]))
Looking at the area values you might’ve also noticed that land are has been split into a number and type column. Although structures are all already in square feet, land are is either in values S or A. Documentation for this dataset is pretty lacking so it’s not immediately clear how to identify these measurements for certain but looking at the websites for several of the properties and matching up the area data, I can confirm that S is for square feet and A is for Acres.
To minimize this confusion and to make data consistent, let’s convert all the values in the square feet
# Change land are to be consistantly
def toSqFt(row):
if row['land_area_type'] == 'S':
return row['land_area_multityped']
elif row['land_area_type'] == 'A':
# 1 acre = 43560 sq. ft
return row['land_area_multityped'] * 43560
else:
return np.nan
df['land_area'] = df.apply(lambda row: toSqFt(row), axis=1)
# And now we can delete the old columns
df = df.drop(['land_area_multityped', 'land_area_type'], axis=1)
print('Missing update date: ', len(df[(df['rec_update_date'].isna()) | (df['rec_update_date'] == '0000.00.00')]))
print('Entries w/ deletion dates : ', len(df[df['rec_delete_date'] != '0000.00.00']))
print('Assessment years: ', df['assessment_year'].unique())
Deletions was kept in order to find and avoid deleted records in our data, here we see we have 4, so we’ll simply remove them and the column from the dataframe
# Remove deleted records
df = df[df['rec_delete_date'] == '0000.00.00']
df = df.drop(['rec_delete_date'], axis=1)
df = df.drop(['assessment_year',], axis=1) # redundant column
While we’re here let's also convert the update column into a datetime format
df['rec_update_date'] = pd.to_datetime(df['rec_update_date'],format='%Y.%m.%d', errors='coerce')
print('Missing update date: ', len(df[(df['rec_update_date'].isna())]))
df['city'].unique()
Saving the most frightening and troublesome one for last, there seems to be abundant typos and inconsistencies in the city entries (CRPWMSVILLE, CHRUCHTON, etc.). Though humorous, I really want to analyze how cities affect assessment values, so I’ll need to manually fix the typos.
To help me identify and efficiently group city typoes I used this list sorter to look at the entries in alphabetical order, and identified values that referred to the same city by their common starting substring if any. Otherwise I made explicit cases for a spelling variant.
def fixcities(row):
if not isinstance(row['city'], str):
return np.nan
elif row['city'].startswith('ANN'):
return 'ANNAPOLIS'
elif row['city'].startswith('ARN'):
return 'ARNOLD'
elif row['city'].startswith('BALT'):
return 'BALTIMORE'
elif row['city'].startswith('BROOKLYN'):
return 'BROOKLYN PARK'
elif row['city'].startswith('CAR'):
return 'CARVEL BEACH'
elif row['city'].startswith('CHES'):
return 'CHESTNUT HILL COVE'
elif row['city'].startswith('CHRUCH'):
return 'CHURCHTON'
elif row['city'].startswith('CROF'):
return 'CROFTON'
elif row['city'].startswith('CRPWMSVILLE') or row['city'].startswith('CROEN'):
return 'CROWNSVILLE'
elif row['city'].startswith('DAV'):
return 'DAVIDSONVILLE'
elif row['city'].startswith('EDE'):
return 'EDGEWATER'
elif row['city'].startswith('GLEN'):
return 'GLEN BURNIE'
elif row['city'].startswith('HAN'):
return 'HANOVER'
elif row['city'].startswith('HARM'):
return 'HARMANS'
# These aren't locations
elif row['city'].startswith('JAMES H HOLT PROP') or row['city'].startswith('MD') or row['city'].startswith('WENGER') or row['city'].startswith('COOK'):
return 'OTHER'
elif row['city'].startswith('LA'):
return 'LAUREL'
elif row['city'].startswith('LIN'):
return 'LINTHICUM'
elif row['city'].startswith('MILL'):
return 'MILLERSVILLE'
elif row['city'].startswith('NORTH BEACH'):
return 'NORTH BEACH'
elif row['city'].startswith('ODE'):
return 'ODENTON'
elif row['city'].startswith('PAS'):
return 'PASADENA'
elif row['city'].startswith('RIVA MD'):
return 'RIVA'
elif row['city'].startswith('SEVERNA'):
return 'SEVERNA PARK'
elif row['city'].startswith('SEV') or row['city'].startswith('BROOKWOOD'):
return 'SEVERN'
elif row['city'].startswith('SHAD'):
return 'SHADY SIDE'
elif row['city'].startswith('TRAC'):
return 'TRACYS LANDING'
else:
return row['city']
df['city'] = df.apply(lambda row: fixcities(row), axis=1)
Notice now we have a few empty city rows
print('Missing cities: ', len(df[(df['city'].isna())]))
So let’s also drop those rows
df = df.dropna(subset=['city'])
And with that we now officially have tidied up our dataset:
df.head()
This should be much better to work with than 212 columns with complete descriptions for column names!
With data in hand for property values across the city, we can now plot out the information and find any potential trends. Here with the following plots I’ll be focusing on data and interactions that I believe are relevant to our objective or may show interesting relationships that otherwise have been overlooked.
With the data available, the first thing I want to do is to actually plot it all out on an actual map, but that brought up a concern for me in the form of dealing with major outliers. I’ll go into more detail during the actual heat mapping section, but for now I want to check for any outliers and oddities.
To do so, we'll make a boxplot for our assessment values as the most direct way of identifying outliers.
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
sns.boxplot(df['base_total'], ax=axs[0,0])
sns.boxplot(df['curr_total'], ax=axs[0,1])
sns.boxplot(df['base_land'], ax=axs[1,0])
sns.boxplot(df['base_improvements'], ax=axs[1,1])
axs[0,0].title.set_text('Property Values Distribution')
axs[0,0].set_xlabel("$")
axs[0,1].title.set_text('Property Values (Current Cycle)')
axs[0,1].set_xlabel("$")
axs[1,0].title.set_text('Property Land Values Distribution')
axs[1,0].set_xlabel("$")
axs[1,1].title.set_text('Property Improvements Values Distribution')
axs[1,1].set_xlabel("$")
plt.show()
Looking at these plots, the problem of outliers seems blatantly apparent. The central 50% of all of the data is contained in the box, yet that box is completely shrunk down by some major outliers on the scale. Looking at the smaller proportion in this upper quartile of data, we can make a good educated guess that that these homes are for particularly wealthy individuals and communities.
Aprt from that, there's not too much else we can glean from these boxplots. It's interesting to be able to observe this small population present in the county, but it also makes it difficult to observe the rest of our distribution.
With outliers disrupting the scale of our plots in mind, I'd like to have an additional plot without outliers to have a better undersanding of the bulk of our data.
To so, let's calculate the fence of the data and cut off all values outside this fence. In the contetx of statistics, a fence is defined as the boundary that distinguishes outliers from non-outliers, and is typically 1.5 times the interquartile range (Q3 - Q1, the two ends of the boxpltos you saw). Also, since I'm mainly interesting in total value assessments, I'll only be generating a fenced dataset with respects to the total valeus distribution.
def remove_outliers(data, col):
Q3 = np.quantile(data[col], 0.75)
Q1 = np.quantile(data[col], 0.25)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return data[(data[col] >= lower) & (data[col] <= upper)]
fenced_df = remove_outliers(df, 'base_total')
Now with outliers removed, let's see what the distribution looks like now.
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
sns.boxplot(fenced_df['base_total'], ax=axs[0,0])
sns.boxplot(fenced_df['curr_total'], ax=axs[0,1])
sns.boxplot(fenced_df['base_land'], ax=axs[1,0])
sns.boxplot(fenced_df['base_improvements'], ax=axs[1,1])
axs[0,0].title.set_text('Property Values Distribution')
axs[0,0].set_xlabel("$")
axs[0,1].title.set_text('Property Values (Current Cycle)')
axs[0,1].set_xlabel("$")
axs[1,0].title.set_text('Property Land Values Distribution')
axs[1,0].set_xlabel("$")
axs[1,1].title.set_text('Property Improvements Values Distribution')
axs[1,1].set_xlabel("$")
plt.show()
The plots are a lot easier to see now! Looking at this distribution now we can still see that high valeus properties seem to dominate the peripherals of the data, but the median and general majority of property prices lie around $300,000 to \$400,000.
The average home in Maryland is currently $399,498 according to Zillow so it's not too far off, but notably below the current average. However base prices don't account for cyclic and seasonal trends in the market where as the current cycle data does. With the adjustment shown in the current cycle distrubtion, it seems that our iqr is shifted upwards a little bit, but also fails to average on 400,000.
Though odd, This biggest explanation to this difference is mostly explained by the higher value outliers of the set, which have a huge impact on the mean but none on the median. And considering how some residences breached into the next magnitude to over $1.6 million, our data can be better understood as still representative.
In fact we can take a look at the stats more explicitly to confirm too.
df['base_total'].describe()
Looking at the summary (keep in mind this df retains the outliers) weactually fall a bit over the average, passing $405,000 in this county, but with a drastically lower median the influence of the millionares is quite apparent.
Now that we've sorted out the outliers of the data, we can now generate heatmaps of property value that are reasonably scaled.
The max value for the heat mao is typically the max value you'd expect form a set, but since using the actual max would scale down everything else too drastically, I'll make use of the fenced data and use max of the data with outliers removed.
aac_lat, aac_long = 38.9530, -76.5488
# Heatmap for base total assessment
heat_data = [[row['lat'],row['long'], row['base_total']] for index, row in df.iterrows()]
m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)
HeatMap(heat_data,
max_val = fenced_df['base_total'].max(),
blur = 15,
radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)
m
Looking at this this map we get some interesting but not too surprising results. The majority of the high value properties seem to be more concentrated in the northern part of the county, which makes sense knowing that Southern states tend to be more rural and Northern ones more urban, but what's interesting is how this idea is represented in a small county like Anne Arundel too. Maryland is considered to be a borderline state politically and geographically in the US, so it's interesting how that trait can be observed at the county level in my home county.
Another thing to note is where exactly you can find these clusters. The three biggest ones I'd like to note if zoom in a bit are Annapolis, Severna Park, and Edgewater. All three make sense, Annapolis being the state capitol and the other two being well regarded as high income communities locally, but looking at the heatmap makes the distinction much more clear and apparent.
With total property value mapped and analyzed, let's take a look at how each of the two components, land and improvement, vary from the total.
# Heatmap for base land assessment
heat_data = [[row['lat'],row['long'], row['base_land']] for index, row in df.iterrows()]
m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)
HeatMap(heat_data,
max_val = fenced_df['base_land'].max(),
blur = 15,
radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)
m
Looking at this map, we can see that the Northern part of the county much less crowded with high value land, but what's incredibly interesting is that there appear to be three major clusters of high value land still, and those happen to be Annapolis, Crofton, and Severna Park. I neglected to mention Crofton before but it's also unsurprisingly regarded as a high income area, but it's interesting how valued it's land is compared to other similar communities, despite not being near the coast like Severna Park and Annapolis, which is a facotr I persoanlly suspect to be a postive influence on their land value. This more likely than not says more about the positive prospects of being in the Crofton community or having the facilities and resources that come with living near the area, which you wouldn't be able to tell with total value alone.
# Heatmap for base improvements assessment
heat_data = [[row['lat'],row['long'], row['base_improvements']] for index, row in df.iterrows()]
m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)
HeatMap(heat_data,
max_val = fenced_df['base_improvements'].max(),
blur = 15,
radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)
m
Looking at improvement values instead, the distribution seems to be a bit more spread out, but still exclusive to the North part of the county. The three cities before still show high values even in their improvements, but the Glen Burnie and Sveren areas also now show high improvement values, which was a little unexpected for me. Looking close this seems to be the product of smaller communities in the areas which has a vew houses with notably high improvement values.
Unlike land, there's a lot more individual agency when it comes to changing a property's improvement value, and the fact smaller neighborhoods and groups of houses seem to have more distinct impacts on improvement value here seems to support that notion.
Before looking at the direct relationships between specific property attributes, I also wish to see how land area plays into property value. Before I map this though, I will be fencing the data since as you can see below outliers are also prevalent
sns.boxplot(fenced_df['land_area'])
fenced_df2 = remove_outliers(df, 'land_area')
heatdf = remove_outliers(df.dropna(subset=['land_area']), 'land_area')
# Heatmap for base improvements assessment
heat_data = [[row['lat'],row['long'], row['land_area']] for index, row in df.iterrows()]
m = folium.Map([aac_lat, aac_long], tiles='OpenStreetMap', zoom_start=10)
HeatMap(heat_data,
max_val = fenced_df2['land_area'].max(),
blur = 15,
radius = 10,).add_to(folium.FeatureGroup(name='Heat Map').add_to(m))
folium.LayerControl().add_to(m)
m
This is where a bit of the follies of heatmapping this type of data come in. As we can see here, this is probably the most distributed spread of valeus we've seen thus far, mostly since land area is unlikely to substantially differ unless you have both very urban and rural communities together. However, we still see a greater density in the heatmap up North, but zooming in you can see that this is more certainly due to the sheer density of properties of North, correlating also to a higher population.
Most of the blocks up North are more or less blue when you zoom in, but the South appears to show some larger neighborhoods that support the notion of a more rural area, though not by much, which is to be expected in as a concise of a region as a county.
Now let's look at some more direct relationships of some of our columns attributes in regards to property value, let's start with seeing how the year a property was built affects it's value today.
I will note beforehand that you may see fenced versions of the dataframe used on plots from here on out without elaboration, and is usually done to avoid extreme scaling problems that come from having a couple million dollar homes.
fig, ax = plt.subplots(figsize=(10,6))
sns.lineplot(x='year_built', y='base_total', data=fenced_df)
ax.title.set_text('Property Values vs Time Built')
plt.show()
The relationship between construction year and property value brings something a bit unexpected and not immediately easy to understand.
Most houses built prior to the 1900's seem to fluctuate completely across the board, with those at and before the early 1800's being consistently valued more.
This particular observation is more or less explainable by the increased quality of older architecture and buildings in more historically important or long standing communities for those build much earlier on.
The major dip starting around 1900 is a clear display of the economic boom and depression that came with the 1920s and the introduction of credit and the fun concept of total bank collapse, as well as the effects of active World War. With that context this drop makes sense, as houses constructed durign such a tumultuous time were likely not done with frivoulousness or extravegance but more so pragmatism and afforadability in consideration of global circumstances at the time, and a larger more general market of buyers thanks to credit.
This drop in value being justified by the costs and pressure of war is also cemneted by the fast rise in the values of properties built after the end of WWII, as the value of homes every year seems to have steadily increased, which more or less reflects both our improving market, but also likely is a product of growing scarcity of space for homes too.
fig, axs = plt.subplots(nrows = 2, figsize=(10,10))
sns.lineplot(x='year_built', y='base_land', data=fenced_df, ax=axs[0])
axs[0].title.set_text('Property Land Values vs Time Built')
sns.lineplot(x='year_built', y='base_improvements', data=fenced_df, ax=axs[1])
axs[1].title.set_text('Property Improvement Values vs Time Built')
plt.show()
Splitting the values into improvements and land is very enlightening regarding the influencing factors I suspected above. The major spike in improvements while land takes a slow decline (until recently) starting post WWII shows that the larger factor to the growing prices of newer properties lies mainly in their construction.
With the higherer overall values in land prior to 1950 in mind too, this shows an interesting relationship of older properties being valued for their antiquity (or size as product of being older) and new properties being valued for being better built, or lending themselves to better upkeep of their land due to the relative recentness of their construction (Older buildings should be expected to fall in overall quality overall from aging after all).
The relationship of land and structure area to a properties value was also one that interested me.
fig, axs = plt.subplots(nrows = 2, figsize=(10,10))
sns.lineplot(x='land_area', y='base_land', data=fenced_df2, ax=axs[0])
axs[0].title.set_text('Property Land Values vs Land Area (sq ft)')
sns.lineplot(x='struct_area', y='base_improvements', data=fenced_df2, ax=axs[1])
axs[1].title.set_text('Property Improvement Values vs Building Area (sq ft)')
plt.show()
Plotting out these relationships was defintiely not what I expected, especially for land area. In hindsight, the larger amount of values on the x axis that may near 160,000 may not be the best for the plot, but there's still plenty of information that can be observed from these plots.
For property land values, it seems there's a minor positive linear relationship between it and land area, albeit it's almost neglible for many of the cases plotted as most lie near the bottom of land values (also to those worried 0 value cases weren't accoutn for, they are, there's simply that much homogeny in the land value side of evaluations, which the heatmap does support as well with only three small clusters visible there in an otherwise similar value range)
More interestingly is the very clear and distinct relationship of structure are to improvement value, which tapers off near large values but rich people elude a common persons thought process, and rural homes may be contesting that section along with richer properties depending on how structure area was determined. Otherwise, this is a definite positive linear relationship that can be witnessed here, confirming a relatively straightforward idea that bigger house costs more.
It's odd to see how much more impact buildigns have on a property then land relative to their values despite land having comparable impact on the total cost as the property does in most cases.
The last factor I want to observe is the relationship between a city and it's property values. For this I'll be keeping outliers due to some small and rich communities being mixed in to the cities too.
fig, ax = plt.subplots(figsize=(18,10))
sns.barplot(x="city", y="base_total", data=df)
plt.title("Average Property Value Across AAC Cities")
plt.xticks(rotation=90)
plt.show()
I'm still not sure whether having small independent communities as separate city entires is the best for reading the data, but this table gives us some insights on how some cities vary in how their property values.
Gibson Island and Other show themselves to be the most notorious of the small unincorporated communities in this plot, which makes sense when looking into them. Gibson Island is a whole island by the Chesapeake Bay, so it's value property is particularly distinct and lends itself to a more rich audience as a disjoint part of the overall area. Other represents entries which don't have any city to their label but had a real estate agency instead. Such cases appear to be a rpoduct of more independent transactions and otherwise which I'm not familiar with, but generally it seems to be in the field of particularly high income individuals, so I'll push it aside as being that.
Looking at the other cities, it's hard to see there's a very strong trend to glean from the data, but we do see some cities like Baltimore or Severna Park whose property valeus do correlate with the expected incomes in those cities.
The cities observations in particular make the idea of looking into other factors that pertain to a properties enviroment very compelling to look into and account for
With the plots above, we've gotten an idea about the relationships of these factors on the value of a property, so let's see if they actually hold true by creating a regression model for them.
For our model, we want to account for all the factors we've analyzed in the previous section so our X values will be the following:
Year Built
Land Area
Struct Area
City
To generate this model we'll also be using the sklearn libraries imported at the beginning, please see their user guide if you want to learn more for yourself!
# Each city must be represented in someway numerically/binarily(?)
lm_df = pd.get_dummies(df, prefix = 'city_is', columns = ['city'], drop_first = False)
dummy_list = ['city_is_' + s for s in df['city'].unique().tolist()]
# These are out x values
ind = sm.add_constant(lm_df[['year_built', 'land_area', 'struct_area'] + dummy_list].to_numpy())
# This is our y
dep = lm_df['base_total'].to_numpy()
model = sm.OLS(dep, ind).fit()
And now we have our model, let's look at the summary for it:
print(model.summary(xname = ['const', 'year_built', 'land_area', 'struct_area'] + dummy_list))
This is a lot of information to look at at once, but keep in mind this is fluffed up from each individual city being listed.
There are a few key takeaways from this that I grasped immediately from this. One would be how little land area appears to have an impact, with a coefficient of 1 percent, it has a miniscule positve influence, like we surmised fro mthe analysis. Land area also seems to have a pretty strong positive influence as we had assumed from our analysis too with a coefficient over a hundred, and both this and land area have P > |t| that rounds to 0.00, meaning these are extremely confident predictions!
Year built and a few of the cities also have very low P > |t| values (we'll look at time more in the actual graph later) but a good amount if not most of the cities have pretty poor values for confidence. This is likely a product of low sample sizes in those cases as this appears to exclusive to smaller communities I don't recognize, while places like Laurel or Hanover, Crofton and many other larger cities maintain very near 0 values for this P value, and also show their own varying coefficients for how they impact valu assessment (ie. Crofton has a positive coefficient and has been seen having higher values in our heatmap)
There's still lot's that is hard to grasp with just this summary alone though, so let's plot it out and see what we get!
pred = model.predict(ind).reshape(1, dep.size)
df['prediction'] = pred[0]
pred_df = remove_outliers(df[['base_total', 'prediction', 'year_built']], 'base_total')
pred_df = pred_df.melt('year_built', var_name="Source",value_name="Property Value")
pred_df.head()
fig, ax = plt.subplots(figsize=(10,6))
sns.lineplot(data=pred_df,x='year_built', y='Property Value', hue='Source')
ax.title.set_text('Property Values vs Time Built (Prediction vs Actual)')
plt.show()
Looking at this comparison between our prediction and the actual base values, the model appears to be pretty impressive in it's accuracy to an extent. There's nearly indistinguishable overlap for values past 1950.
There's definitely a lot of fluctuation to be had as we look before the late 1800s though and especially before 1800. This is most easily explained by the smaller amount of houses remaining from the time, but also the variance of the quality of a home and property that comes with time is certainly a factor to consider in how chaotic the plot gets.
Even if I'm not very knowledgable in the world of real estate, I learned about what factors go into how properties are obsessed, and how strongly and in what way some of them interact with the value of a home. Land area seemed to be much less relvant than I thought and trends in construction year and cities bring up even mroe interesting ideas and possibilities.
If this tutorial caught your interest in real estate or the data science process, there’s still lots left in this dataset you can explore yourself. This tutorial only focused on Anne Arundel County in Maryland but the whole statewide dataset is available for use. You could even looking into collating datasets of other states and gain some insights on larger scale factors on property value. Additionally, I also mentioned various columns of information that I chose to look past for this tutorial like cycle data, construction information, and various tax related evaluations that are available for you to look into yourself. This tutorial was written mainly as a way to guide others through the world of data analysis and as a more layman’s look into real estate, there’s lots of data here and elsewhere for you to explore, so if anything else I hope this has been a compelling preview into these worlds for you.