AI in Real Estate, Part I: A Brief Introduction to the Relevant Techniques
ChatGPT has reinvigorated the public’s interest in AI. However, while large language models like ChatGPT are quite impressive, other AI techniques have more potential to revolutionize the real estate industry. Let’s consider the major AI techniques and their uses within real estate; then, with this information in hand, we can turn to the question of how AI will alter the industry in subsequent posts.
1. Linear Regression
Linear regression models are, in fact, a form of machine learning, even though they are often thought of as a part of statistics. Indeed, machine learning and statistics are, in fact, the same discipline at the end of the day, just pursued by two different academic cultures, statisticians and computer scientists. Linear Regression essentially tries to fit a line (or hyperplane based on how many features you have in your model) as closely as possible to a set of observations using any number of metrics—most often, the sum of your squared residuals (the distance between the predicted values and the observed). The reason, of course, for squaring the residuals is to prefer a few small misses to several gigantic ones.
Linear regression can be used to predict property prices based on various features such as size, location, amenities, and historical sales data. It establishes a linear relationship between the input features and the target variable (price) to make predictions. One of its advantages over newer techniques is that it is incredibly easy to interpret and much less likely to overfit your data.
2. Support Vector Machines (SVMs) and Support Vector Regression
SVMs can be used for tasks such as property price prediction, property classification, and fraud detection. They are particularly useful when dealing with high-dimensional feature spaces and non-linear relationships. SVMs grew out of nearest-neighbor algorithms—which solved classification problems by looking at a particular item and then comparing its features to that of several paradigmatic examples; most often normalizing the different features of the item (be it weight, volume, etc.) into a single measure—generally z-scores for those of you familiar with that idea. SVMs are based on the idea that it would be helpful to precalculate boundaries, after all, theoretically every possible set of values already has an answer associated with it. What was found, however, is that often you would not be able to completely separate your examples into distinct regions; they were often jumbled together—making it impossible to segment your space into one region for each item you want to classify. A Soviet mathematician and computer scientist, Vladimir Vapnik, discovered that if you add dimensions to the space and map them into an alternative vector space (sadly, this does involve a little bit of linear algebra), you can often make the examples separable.
Support Vector Regression behaves similarly to linear regression except that it attempts to create a fixed boundary (and generally curved) boundary that classifies as many of the instances within a predefined error margin as possible. This can be used to determine the value of particular home features or as a home price model in and of itself. The advantage of linear regression is that it can better accommodate how features affect value at different price levels.
Let's connect, and see how we can help you stay ahead of the market.
Contact us
Support Vector Machine (SVM) Tutorial | by Abhishek Ghose | Cube Dev | Medium
3. Decision Trees
Decision trees are among the most important models in all of machine learning as they perform well on a variety of tasks and are highly interpretable (though less so than linear regression). They form the basis for several other techniques, described below, that are among the most accurate available, esp. when you are dealing with so-called “tabular data” (the sort of data you could represent on a spreadsheet as opposed to photographic data, sonar data, etc.) Their only real downsides are that they do not perform well on “unstructured data” (as stated above, think photos and the like) and that they can be prone to overfitting. Indeed, the art of building decision tree models is largely the art of preventing overfitting. Here is a brief description of how a decision tree works:
Root Node
At the top of the tree, we have the root node, which represents the entire dataset or a subset of it. It is associated with a feature that best separates the data based on certain criteria.
Internal Nodes
Internal nodes in the tree represent decisions based on specific features. Each internal node has two or more branches, each corresponding to a possible value of the associated feature. The decision is made by evaluating the feature’s value against a splitting criterion, such as Gini impurity or information gain, to determine the best split that maximizes the separation of the classes or reduces the variance in the target variable.
Leaf Nodes
Leaf nodes are the terminal nodes of the tree and represent the final prediction or output. In classification tasks, each leaf node corresponds to a specific class label, while in regression tasks, it represents a numerical value. The prediction at a leaf node is typically determined by the majority class or the mean value of the training samples that reach that leaf.
Splitting and Growing
The process of building a decision tree involves recursively splitting the data based on the selected features at each internal node. The goal is to create homogeneous subsets of the data, where samples within each subset share similar characteristics or have similar target values. This process continues until a stopping criterion is met, such as reaching a maximum depth, achieving a minimum number of samples per leaf, or no further improvement in the splitting criterion.
Prediction
To make a prediction for a new data point, it traverses the decision tree from the root node to a leaf node, following the path based on the feature values of the data point. The prediction at the leaf node reached is then returned as the output.
A brief introduction to Decision Trees | by Guilherme Henrique dos Santos | Medium
4. Random Forests
Random forests are an ensemble learning technique that combines multiple decision trees. Ensemble models are models that combine different models into a larger, single predictor. They are effective in real estate for property price prediction and classification. [It can also be used for time series modeling, though one must be careful to use walk-forward validation and not k-fold cross-validation, as that method can produce unrealistic optimistic results. Cross-validation involves splitting the data into n subsets, training your model on n-1 of the sets, and then testing its performance on the remaining one; one then repeats this process for each remaining subset using it as hold out. Walk-forward validation involves splitting the data into prior-to-test and time-of-tests sets, so you might test your model on January-November of 2022 and then look at its performance in December; then you would add in Dec, and look at January of 2023, etc. using a rolling-time window.] Random forests can handle complex relationships and reduce the risk of overfitting. By building multiple trees on different subsamples of the data (an idea referred to as bagging) you create a more general model (or rather set of models) that performs better on unseen examples.
5. Gradient Boosting
Gradient boosting algorithms, like XGBoost and LightGBM, can be used for a variety of tasks from AVMs to HPI forecasting. They iteratively build an ensemble of decision trees, focusing on correcting the error in prior trees—emphasizing most the correction of the largest errors; in short, you are fitting to the trees to predict the residual of the earlier trees (except for the first tree, of course). These techniques win the majority of Kaggle competitions where the entrants are presented with tabular data.
Your Key To Understanding the XGBoost Algorithm | by Patrick McClory | Medium
6. Neural Networks
Neural networks, particularly deep learning models, can be applied to various real estate tasks. They are used for property price prediction, HPI forecasting, property image analysis, sentiment analysis of real estate listings, and generating property descriptions. They show a great deal of promise specifically in terms of being able to extract property condition and build quality from photographs. They act as visualization aids, depicting what a property might look like post-renovation —or how it might look twenty years in the future depending on what kinds of materials the homeowner decided to use. Neural networks are esp. well suited to tasks where the representation of the problem space, such as image recognition and image generation, is itself difficult: i.e. problems where it is hard to find a simple representation of the problem.
- Neural networks, inspired by earlier beliefs about the human brain, consist of interconnected layers of artificial neurons (also called nodes).
- Neural networks have an input layer, one or more hidden layers, and an output layer.
- Each node takes input values, applies a transformation (activation function) to them, and produces an output. It essentially determines the “volume” of the message it passes to the next node in the network based on the strengths of the inputs it receives.
- The connections between nodes have associated weights that control the strength of the influence of one node on another.
- During training, the network adjusts these weights to minimize a loss function, which measures the difference between predicted outputs and desired outputs.
- This adjustment is performed using optimization algorithms such as gradient descent or its variants.
- The adjustment process, called backpropagation, calculates the gradient of the loss function with respect to the weights and updates them accordingly.
- Hidden layers allow the network to learn complex representations and extract relevant features from the input data.
- The output layer produces the final predictions or responses based on the learned relationships in the data.
Introduction to Neural Nets (Without the Brain Metaphor) | by Mark Riedl | Medium
7. Large Language Models
Large language models like Chat-GPT use a particular kind of neural net called a transformer, which utilizes attention scores in order to better incorporate context: In essence, they learn to focus the model’s attention on important words while ignoring “buffer words” that are less informative. Attention scores are a complicated enough topic that I plan to deal with them in a separate paper. For our purposes, they allowed neural nets to solve the problem of “keeping context in mind” during language processing. Interestingly enough, it is because real estate has put so much effort into standardizing its data that LLMs are not as revolutionary here as they will prove in other industries. Moreover, if you are trying to capture prices, you have moved away from natural language processing and towards more quantitative models like boosting.
Large language models and other NLP techniques can produce property descriptions, handle customer interactions, conduct sentiment analysis, recommend properties, and home improvement ideas based on a client’s observed preferences, and produce standardized property representations out of complicated text descriptions like permit data and MLS listings. Companies like Zillow may use them to replicate some of the tasks a realtor might perform, ultimately making them more competitive and profitable. They can even recommend renovations and then price out the components someone would need to finish them.
A Very Gentle Introduction to Large Language Models without the Hype | by Mark Riedl | Medium
8. Clustering
Clustering techniques like k-means or hierarchical clustering can group similar properties together based on their features. This can be useful for market segmentation, identifying property clusters with similar characteristics, or discovering patterns in large real estate datasets. We have used k-means clustering to identify neighborhood boundaries which helps our AVM make use of more realistic location fixed-effects.
9. Time Series Analysis
Time series analysis techniques, such as ARIMA (Autoregressive Integrated Moving Average), are relevant for real estate applications involving temporal data, such as analyzing such as forecasting interest rates and home prices or predicting rental yields over time. The most basic time series technique, moving averages, is key to reporting where various artifacts of data collection need to be smoothed away for interpretability. (In fact, during the dark days of Covid, the fact that so much reporting refused to utilize moving averages, showing us the familiar “Monday spike” suggests that many inside our health establishment need to learn about building out useable reports.) ARIMA combines three concepts: autoregression, differencing, and moving average. Autoregression refers to the idea that the value of a data point depends on its previous values. Differencing involves subtracting consecutive data points to remove trends or seasonality. Moving average, as explained earlier, helps smooth out noise. By combining these three components, ARIMA models can capture and predict the patterns and dynamics in time series data.
Read more: Real estate concepts
10. Dimensionality Reduction
Techniques like Principal Component Analysis (PCA) or t-SNE can reduce the dimensionality of real estate data while preserving important information. This can aid in visualizing data, identifying key features, or improving the efficiency of other machine-learning models. PCA can also help in local market analysis by showing which features best explain price variations.
AI promises not only to make real estate analytics vastly more accurate but to lead real estate economists to deeper market insights—by extracting information from new sources and providing a means of investigating complex relationships that earlier techniques could not uncover. On the consumer side, it will allow consumers to tour homes remotely and visualize potential renovations while allowing large companies to compete with smaller realtors in terms of customer service and local market expertise. In my next essay, we will discuss the question: What real estate jobs will AI replace?
Read more: Using social media sentiment to make predictions