Juliet Kelson

Predicting Airbnb Price

Mini-Project 1

Prompt

The ultimate goal of your analysis will be to build a predictive model of price and to understand the listing features with which this might be associated. Specifically: How accurately can we predict the price of an Airbnb listing by its features? Not only might this information help Airbnb, it might help new listers with guidance on how much they might charge. What makes some listings more expensive than others?!



Part 1: Ready the data

library(ggplot2)
library(dplyr)
library(caret)
airbnb <- read.csv("NYC_airbnb_kaggle.csv")
nbhd <- read.csv("NYC_nbhd_kaggle.csv")
airbnb_all <- left_join(airbnb, nbhd, by=c("neighbourhood_cleansed"= "neighbourhood"))
airbnb_all <- airbnb_all %>% 
  filter(price <1000)

amenities <- airbnb_all %>% select(amenities, id)
amenities <- amenities %>% 
  mutate(amenities_count = stringr::str_count(amenities, ",")) %>% 
  mutate(amenities_count = if_else(amenities != "{}", amenities_count+1, 0))

airbnb_all <- airbnb_all %>% 
  left_join(amenities, by="id")

airbnb_ints <- airbnb_all%>% 
  select(price, latitude, longitude, accommodates, bathrooms, bedrooms, beds, guests_included, minimum_nights, maximum_nights,
         number_of_reviews, review_scores_rating, amenities_count) 
cor(airbnb_ints)
##                             price     latitude    longitude accommodates
## price                 1.000000000  0.062250039 -0.268565871  0.560303649
## latitude              0.062250039  1.000000000  0.092210424 -0.039322363
## longitude            -0.268565871  0.092210424  1.000000000 -0.025220800
## accommodates          0.560303649 -0.039322363 -0.025220800  1.000000000
## bathrooms                      NA           NA           NA           NA
## bedrooms                       NA           NA           NA           NA
## beds                           NA           NA           NA           NA
## guests_included       0.345398137 -0.038478217  0.019759448  0.581187256
## minimum_nights       -0.004614455 -0.001311596 -0.028125565 -0.022663361
## maximum_nights       -0.001272293  0.005102524 -0.001844585 -0.005083025
## number_of_reviews    -0.005565926  0.006069145  0.008158253  0.116036073
## review_scores_rating           NA           NA           NA           NA
## amenities_count       0.184589733  0.002250424 -0.010799440  0.254538687
##                      bathrooms bedrooms beds guests_included
## price                       NA       NA   NA     0.345398137
## latitude                    NA       NA   NA    -0.038478217
## longitude                   NA       NA   NA     0.019759448
## accommodates                NA       NA   NA     0.581187256
## bathrooms                    1       NA   NA              NA
## bedrooms                    NA        1   NA              NA
## beds                        NA       NA    1              NA
## guests_included             NA       NA   NA     1.000000000
## minimum_nights              NA       NA   NA    -0.016184543
## maximum_nights              NA       NA   NA    -0.002707443
## number_of_reviews           NA       NA   NA     0.144415261
## review_scores_rating        NA       NA   NA              NA
## amenities_count             NA       NA   NA     0.217549624
##                      minimum_nights maximum_nights number_of_reviews
## price                  -0.004614455   -0.001272293      -0.005565926
## latitude               -0.001311596    0.005102524       0.006069145
## longitude              -0.028125565   -0.001844585       0.008158253
## accommodates           -0.022663361   -0.005083025       0.116036073
## bathrooms                        NA             NA                NA
## bedrooms                         NA             NA                NA
## beds                             NA             NA                NA
## guests_included        -0.016184543   -0.002707443       0.144415261
## minimum_nights          1.000000000   -0.001010053      -0.048465986
## maximum_nights         -0.001010053    1.000000000      -0.001739889
## number_of_reviews      -0.048465986   -0.001739889       1.000000000
## review_scores_rating             NA             NA                NA
## amenities_count        -0.026023869   -0.006109732       0.189857898
##                      review_scores_rating amenities_count
## price                                  NA     0.184589733
## latitude                               NA     0.002250424
## longitude                              NA    -0.010799440
## accommodates                           NA     0.254538687
## bathrooms                              NA              NA
## bedrooms                               NA              NA
## beds                                   NA              NA
## guests_included                        NA     0.217549624
## minimum_nights                         NA    -0.026023869
## maximum_nights                         NA    -0.006109732
## number_of_reviews                      NA     0.189857898
## review_scores_rating                    1              NA
## amenities_count                        NA     1.000000000
airbnb_all <- airbnb_all %>%
  select(-id, -host_response_time, -host_response_rate, -host_has_profile_pic, -calendar_updated, -require_guest_profile_picture, -availability_30, -is_location_exact, -guests_included, -beds, -amenities.x, -amenities.y, -minimum_nights, -maximum_nights, -cancellation_policy, -reviews_per_month, -bed_type, -is_business_travel_ready, -square_feet)

airbnb_all <- airbnb_all %>% 
  mutate(neighbourhood_cleansed = as.factor(neighbourhood_cleansed))
set.seed(253)

airbnb_sample <- sample_n(airbnb_all, 5000)
airbnb_int_sample <- sample_n(airbnb_ints, 5000)







Part 2: Analyze

Get to know the data:

summary(airbnb_sample)
##  host_is_superhost        neighbourhood_cleansed    latitude    
##   :  22            Williamsburg      : 438       Min.   :40.55  
##  f:4440            Bedford-Stuyvesant: 363       1st Qu.:40.69  
##  t: 538            Harlem            : 307       Median :40.72  
##                    Bushwick          : 267       Mean   :40.73  
##                    East Village      : 235       3rd Qu.:40.76  
##                    Hell's Kitchen    : 204       Max.   :40.90  
##                    (Other)           :3186                      
##    longitude          property_type            room_type   
##  Min.   :-74.17   Apartment  :4224   Entire home/apt:2474  
##  1st Qu.:-73.98   House      : 439   Private room   :2411  
##  Median :-73.96   Townhouse  : 108   Shared room    : 115  
##  Mean   :-73.95   Loft       :  87                         
##  3rd Qu.:-73.94   Condominium:  53                         
##  Max.   :-73.71   Other      :  33                         
##                   (Other)    :  56                         
##   accommodates      bathrooms        bedrooms          price      
##  Min.   : 1.000   Min.   :0.000   Min.   : 0.000   Min.   :  0.0  
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 69.0  
##  Median : 2.000   Median :1.000   Median : 1.000   Median :100.0  
##  Mean   : 2.809   Mean   :1.128   Mean   : 1.159   Mean   :139.6  
##  3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 1.000   3rd Qu.:175.0  
##  Max.   :16.000   Max.   :5.000   Max.   :10.000   Max.   :999.0  
##                   NA's   :18      NA's   :7                       
##  number_of_reviews review_scores_rating instant_bookable
##  Min.   :  0.00    Min.   : 20.00       f:3754          
##  1st Qu.:  1.00    1st Qu.: 91.00       t:1246          
##  Median :  5.00    Median : 96.00                       
##  Mean   : 18.09    Mean   : 93.46                       
##  3rd Qu.: 19.00    3rd Qu.:100.00                       
##  Max.   :489.00    Max.   :100.00                       
##                    NA's   :1155                         
##     neighbourhood_group amenities_count
##  Bronx        :  79     Min.   : 0.00  
##  Brooklyn     :2079     1st Qu.:12.00  
##  Manhattan    :2284     Median :15.00  
##  Queens       : 526     Mean   :16.22  
##  Staten Island:  32     3rd Qu.:20.00  
##                         Max.   :70.00  
## 
airbnb_sample %>%
  ggplot(aes(x = review_scores_rating, y = price)) +
  geom_jitter(aes(color = accommodates, alpha=.5)) +
  facet_wrap(vars(host_is_superhost))
## Warning: Removed 1155 rows containing missing values (geom_point).

airbnb_sample %>%
  ggplot(aes(x = review_scores_rating, y = price)) +
  geom_point(aes(color = room_type, alpha=.5)) +
  facet_wrap(vars(neighbourhood_group))
## Warning: Removed 1155 rows containing missing values (geom_point).

impute_info <- airbnb_sample %>% 
  select(-price) %>% 
  preProcess(method="knnImpute")

impute_info_int <- airbnb_ints %>% 
  select(-price) %>% 
  preProcess(method="knnImpute")

airbnb_sample_complete <- predict(impute_info, newdata=airbnb_sample)
airbnb_int_sample_complete <- predict(impute_info_int, newdata=airbnb_int_sample)

In preparing our data, we removed predictors that were repetitive, such as keeping accommodates and removing guests_included. We also removed predictors that we suspected wouldn’t help predict price, such as id or host_has_profile_pic.

Looking at the correlation plot helped us determine what the important predictors were that influenced price. After this process, we remained with 15 variables out of the original 31.

After studying the data and looking at various plots, a few insights about the price of Airbnb listings surfaced. To begin with, being a superhost does not seem to affect price. There are more listings that are not from superhosts, yet both true and false for this predictor has the same trend. There are many cheap listings and few expensive ones. The review score affects price slightly, but not in a very significant way. The reviews at or near 100 vary in price, but the most expensive listings are all high ratings. There is a similar trend with accommodates – the more expensive listings accommodate a large number of people. Furthermore, the more expensive listings are entire home/apts, while the less expensive listings are private rooms or shared rooms.

Predictive model attempts:

#  airbnb_please_work <-
#    airbnb_sample_complete %>% select(
#   price,
#   neighbourhood_group,
#   review_scores_rating,
#   amenities_count,
#   bathrooms,
#   property_type,
#   host_is_superhost,
#   accommodates,
#   bedrooms,
#   number_of_reviews
#   )
# 
#   gam_model <- train(
#   price ~ .,
#   data = airbnb_please_work,
#   method = "gamLoess",
#   tuneGrid = data.frame(span = 0.5, degree = 1),
#   trControl = trainControl(
#   method = "cv",
#   number = 10,
#   selectionFunction = "best"
#   ),
#   metric = "MAE",
#   na.action = na.omit
#   )
# 
# par(mfrow = c(4,3))
# plot(gam_model$finalModel)
# 
# gam_model$results
# 
# model_data <- data.frame(model.matrix(price ~ accommodates + host_is_superhost + neighbourhood_group - 1, data = airbnb_sample_complete)) %>%
#    mutate(price = airbnb_sample_complete$price)
# 
# knn_model <- train(
#   price ~ .,
#   data = airbnb_sample_complete,
#   preProcess = c("center","scale"),
#   method = "knn",
#   tuneGrid = data.frame(k = seq(1, 80, by=5)),
#   trControl = trainControl(method = "cv", number = 10, selectionFunction = "best"),
#   metric = "MAE",
#   na.action = na.omit
# )
# 
# 
# knn_model$results %>% filter(k==knn_model$bestTune$k)
# plot(knn_model)

Final predictive model:

lambda_grid <- 10^seq(-8, 4, length = 100)
set.seed(253)

lasso_model <- train(
  price ~ .,
  data = airbnb_sample_complete,
  method = "glmnet",
  trControl = trainControl(method = "cv", number = 10, selectionFunction = "best"),
  tuneGrid = data.frame(alpha = 1, lambda = lambda_grid),
  metric = "MAE",
  na.action = na.omit
)

coef(lasso_model$finalModel, lasso_model$bestTune$lambda)
## 261 x 1 sparse Matrix of class "dgCMatrix"
##                                                            1
## (Intercept)                                      147.7541747
## host_is_superhostf                                 .        
## host_is_superhostt                                 .        
## neighbourhood_cleansedArden Heights                .        
## neighbourhood_cleansedArrochar                     .        
## neighbourhood_cleansedArverne                      .        
## neighbourhood_cleansedAstoria                      .        
## neighbourhood_cleansedBath Beach                   .        
## neighbourhood_cleansedBattery Park City           54.4443102
## neighbourhood_cleansedBay Ridge                    .        
## neighbourhood_cleansedBay Terrace                  .        
## neighbourhood_cleansedBay Terrace, Staten Island   .        
## neighbourhood_cleansedBaychester                   .        
## neighbourhood_cleansedBayside                      .        
## neighbourhood_cleansedBayswater                    .        
## neighbourhood_cleansedBedford-Stuyvesant          -8.3832068
## neighbourhood_cleansedBelle Harbor                 .        
## neighbourhood_cleansedBellerose                    .        
## neighbourhood_cleansedBelmont                      .        
## neighbourhood_cleansedBensonhurst                  .        
## neighbourhood_cleansedBergen Beach                 .        
## neighbourhood_cleansedBoerum Hill                  .        
## neighbourhood_cleansedBorough Park                 .        
## neighbourhood_cleansedBriarwood                    .        
## neighbourhood_cleansedBrighton Beach               .        
## neighbourhood_cleansedBronxdale                    .        
## neighbourhood_cleansedBrooklyn Heights             .        
## neighbourhood_cleansedBrownsville                  .        
## neighbourhood_cleansedBushwick                     .        
## neighbourhood_cleansedCambria Heights              .        
## neighbourhood_cleansedCanarsie                     .        
## neighbourhood_cleansedCarroll Gardens             19.7030319
## neighbourhood_cleansedCastle Hill                  .        
## neighbourhood_cleansedCastleton Corners            .        
## neighbourhood_cleansedChelsea                     26.4469781
## neighbourhood_cleansedChinatown                    .        
## neighbourhood_cleansedCity Island                  .        
## neighbourhood_cleansedCivic Center                 .        
## neighbourhood_cleansedClaremont Village            .        
## neighbourhood_cleansedClason Point                 .        
## neighbourhood_cleansedClifton                      .        
## neighbourhood_cleansedClinton Hill                 .        
## neighbourhood_cleansedCo-op City                   .        
## neighbourhood_cleansedCobble Hill                  .        
## neighbourhood_cleansedCollege Point                .        
## neighbourhood_cleansedColumbia St                  .        
## neighbourhood_cleansedConcord                      .        
## neighbourhood_cleansedConcourse                    .        
## neighbourhood_cleansedConcourse Village            .        
## neighbourhood_cleansedConey Island                 .        
## neighbourhood_cleansedCorona                       .        
## neighbourhood_cleansedCrown Heights               -1.1649206
## neighbourhood_cleansedCypress Hills                .        
## neighbourhood_cleansedDitmars Steinway             .        
## neighbourhood_cleansedDongan Hills                 .        
## neighbourhood_cleansedDouglaston                   .        
## neighbourhood_cleansedDowntown Brooklyn            .        
## neighbourhood_cleansedDUMBO                       14.3468410
## neighbourhood_cleansedDyker Heights                .        
## neighbourhood_cleansedEast Elmhurst                .        
## neighbourhood_cleansedEast Flatbush                .        
## neighbourhood_cleansedEast Harlem                -25.6811467
## neighbourhood_cleansedEast Morrisania              .        
## neighbourhood_cleansedEast New York               -6.0455536
## neighbourhood_cleansedEast Village                 .        
## neighbourhood_cleansedEastchester                  .        
## neighbourhood_cleansedEdenwald                     .        
## neighbourhood_cleansedEdgemere                     .        
## neighbourhood_cleansedElmhurst                     .        
## neighbourhood_cleansedEltingville                  .        
## neighbourhood_cleansedEmerson Hill                 .        
## neighbourhood_cleansedFar Rockaway                 .        
## neighbourhood_cleansedFieldston                    .        
## neighbourhood_cleansedFinancial District           .        
## neighbourhood_cleansedFlatbush                    -4.0736089
## neighbourhood_cleansedFlatiron District            .        
## neighbourhood_cleansedFlatlands                    .        
## neighbourhood_cleansedFlushing                     .        
## neighbourhood_cleansedFordham                     -1.9478980
## neighbourhood_cleansedForest Hills                 .        
## neighbourhood_cleansedFort Greene                 18.3671193
## neighbourhood_cleansedFort Hamilton                .        
## neighbourhood_cleansedFresh Meadows                .        
## neighbourhood_cleansedGerritsen Beach              .        
## neighbourhood_cleansedGlen Oaks                    .        
## neighbourhood_cleansedGlendale                     .        
## neighbourhood_cleansedGowanus                      .        
## neighbourhood_cleansedGramercy                     .        
## neighbourhood_cleansedGraniteville                 .        
## neighbourhood_cleansedGrant City                   .        
## neighbourhood_cleansedGravesend                  -14.9209243
## neighbourhood_cleansedGreat Kills                  .        
## neighbourhood_cleansedGreenpoint                   0.1253550
## neighbourhood_cleansedGreenwich Village           23.5339202
## neighbourhood_cleansedGrymes Hill                  .        
## neighbourhood_cleansedHarlem                     -38.9743215
## neighbourhood_cleansedHell's Kitchen               9.3845906
## neighbourhood_cleansedHighbridge                   .        
## neighbourhood_cleansedHollis                       .        
## neighbourhood_cleansedHollis Hills                 .        
## neighbourhood_cleansedHolliswood                   .        
## neighbourhood_cleansedHoward Beach                 .        
## neighbourhood_cleansedHowland Hook                 .        
## neighbourhood_cleansedHuguenot                     .        
## neighbourhood_cleansedHunts Point                  .        
## neighbourhood_cleansedInwood                     -22.5047353
## neighbourhood_cleansedJackson Heights              .        
## neighbourhood_cleansedJamaica                      .        
## neighbourhood_cleansedJamaica Estates              .        
## neighbourhood_cleansedJamaica Hills                .        
## neighbourhood_cleansedKensington                   .        
## neighbourhood_cleansedKew Gardens                  .        
## neighbourhood_cleansedKew Gardens Hills            .        
## neighbourhood_cleansedKingsbridge                  .        
## neighbourhood_cleansedKips Bay                     .        
## neighbourhood_cleansedLaurelton                    .        
## neighbourhood_cleansedLighthouse Hill              .        
## neighbourhood_cleansedLittle Italy                 .        
## neighbourhood_cleansedLittle Neck                  .        
## neighbourhood_cleansedLong Island City             .        
## neighbourhood_cleansedLongwood                     .        
## neighbourhood_cleansedLower East Side              .        
## neighbourhood_cleansedManhattan Beach              .        
## neighbourhood_cleansedMarble Hill                  .        
## neighbourhood_cleansedMariners Harbor              .        
## neighbourhood_cleansedMaspeth                      .        
## neighbourhood_cleansedMelrose                      .        
## neighbourhood_cleansedMiddle Village               .        
## neighbourhood_cleansedMidland Beach               -2.7175001
## neighbourhood_cleansedMidtown                     44.4721318
## neighbourhood_cleansedMidwood                     -3.7664180
## neighbourhood_cleansedMill Basin                   .        
## neighbourhood_cleansedMorningside Heights        -18.6227968
## neighbourhood_cleansedMorris Heights               .        
## neighbourhood_cleansedMorris Park                  .        
## neighbourhood_cleansedMorrisania                   .        
## neighbourhood_cleansedMott Haven                   .        
## neighbourhood_cleansedMount Eden                   .        
## neighbourhood_cleansedMount Hope                   .        
## neighbourhood_cleansedMurray Hill                  6.8202960
## neighbourhood_cleansedNavy Yard                    .        
## neighbourhood_cleansedNeponsit                     .        
## neighbourhood_cleansedNew Brighton                 .        
## neighbourhood_cleansedNew Dorp Beach               .        
## neighbourhood_cleansedNew Springville              .        
## neighbourhood_cleansedNoHo                         .        
## neighbourhood_cleansedNolita                       .        
## neighbourhood_cleansedNorth Riverdale              .        
## neighbourhood_cleansedNorwood                      .        
## neighbourhood_cleansedOakwood                      .        
## neighbourhood_cleansedOlinville                    .        
## neighbourhood_cleansedOzone Park                   .        
## neighbourhood_cleansedPark Slope                   .        
## neighbourhood_cleansedParkchester                  .        
## neighbourhood_cleansedPelham Bay                   .        
## neighbourhood_cleansedPelham Gardens               .        
## neighbourhood_cleansedPort Morris                  .        
## neighbourhood_cleansedPort Richmond                .        
## neighbourhood_cleansedProspect Heights            19.9941503
## neighbourhood_cleansedProspect-Lefferts Gardens    .        
## neighbourhood_cleansedQueens Village               .        
## neighbourhood_cleansedRandall Manor                .        
## neighbourhood_cleansedRed Hook                     .        
## neighbourhood_cleansedRego Park                    .        
## neighbourhood_cleansedRichmond Hill                .        
## neighbourhood_cleansedRichmondtown                 .        
## neighbourhood_cleansedRidgewood                   -4.5251855
## neighbourhood_cleansedRiverdale                    .        
## neighbourhood_cleansedRockaway Beach               .        
## neighbourhood_cleansedRoosevelt Island             .        
## neighbourhood_cleansedRosebank                     .        
## neighbourhood_cleansedRosedale                     .        
## neighbourhood_cleansedSchuylerville                .        
## neighbourhood_cleansedSea Gate                    -3.0723163
## neighbourhood_cleansedSheepshead Bay               .        
## neighbourhood_cleansedShore Acres                  .        
## neighbourhood_cleansedSilver Lake                  .        
## neighbourhood_cleansedSoHo                        48.2270760
## neighbourhood_cleansedSoundview                    .        
## neighbourhood_cleansedSouth Beach                  .        
## neighbourhood_cleansedSouth Ozone Park             .        
## neighbourhood_cleansedSouth Slope                  .        
## neighbourhood_cleansedSpringfield Gardens          .        
## neighbourhood_cleansedSpuyten Duyvil               .        
## neighbourhood_cleansedSt. Albans                   .        
## neighbourhood_cleansedSt. George                   .        
## neighbourhood_cleansedStapleton                    .        
## neighbourhood_cleansedStuyvesant Town              .        
## neighbourhood_cleansedSunnyside                    .        
## neighbourhood_cleansedSunset Park                 -5.1532333
## neighbourhood_cleansedTheater District             0.5162365
## neighbourhood_cleansedThrogs Neck                  .        
## neighbourhood_cleansedTodt Hill                    .        
## neighbourhood_cleansedTompkinsville                .        
## neighbourhood_cleansedTottenville                  .        
## neighbourhood_cleansedTremont                      .        
## neighbourhood_cleansedTribeca                     52.1090777
## neighbourhood_cleansedTwo Bridges                  .        
## neighbourhood_cleansedUnionport                    .        
## neighbourhood_cleansedUniversity Heights           .        
## neighbourhood_cleansedUpper East Side              .        
## neighbourhood_cleansedUpper West Side              .        
## neighbourhood_cleansedVan Nest                     .        
## neighbourhood_cleansedVinegar Hill                 .        
## neighbourhood_cleansedWakefield                    .        
## neighbourhood_cleansedWashington Heights         -40.1425090
## neighbourhood_cleansedWest Brighton                .        
## neighbourhood_cleansedWest Farms                   .        
## neighbourhood_cleansedWest Village                42.9916118
## neighbourhood_cleansedWestchester Square           .        
## neighbourhood_cleansedWesterleigh                -39.2776101
## neighbourhood_cleansedWhitestone                   .        
## neighbourhood_cleansedWilliamsbridge             -43.5505218
## neighbourhood_cleansedWilliamsburg                21.3550427
## neighbourhood_cleansedWindsor Terrace              .        
## neighbourhood_cleansedWoodhaven                    .        
## neighbourhood_cleansedWoodlawn                     .        
## neighbourhood_cleansedWoodrow                      .        
## neighbourhood_cleansedWoodside                     .        
## latitude                                           .        
## longitude                                         -9.9218856
## property_typeBed & Breakfast                       7.4737464
## property_typeBoat                                  .        
## property_typeBoutique hotel                        .        
## property_typeBungalow                             17.2926351
## property_typeCabin                                 .        
## property_typeCastle                                .        
## property_typeCave                                  .        
## property_typeChalet                                .        
## property_typeCondominium                          14.1720562
## property_typeDorm                                  .        
## property_typeEarth House                           .        
## property_typeGuest suite                           .        
## property_typeGuesthouse                            .        
## property_typeHostel                               -7.6782242
## property_typeHouse                                 .        
## property_typeIn-law                                .        
## property_typeLoft                                 15.2126182
## property_typeOther                                 .        
## property_typeServiced apartment                    .        
## property_typeTent                                  .        
## property_typeTimeshare                            71.7429067
## property_typeTownhouse                             .        
## property_typeTrain                                 .        
## property_typeTreehouse                             .        
## property_typeVacation home                        29.2170179
## property_typeVilla                                 .        
## property_typeYurt                                  .        
## room_typePrivate room                            -54.7248708
## room_typeShared room                             -71.0276434
## accommodates                                      30.7114725
## bathrooms                                         14.4697106
## bedrooms                                          16.5117145
## number_of_reviews                                 -3.0984611
## review_scores_rating                               3.8749745
## instant_bookablet                                 -0.9313157
## neighbourhood_groupBrooklyn                       -6.4798968
## neighbourhood_groupManhattan                      46.3814770
## neighbourhood_groupQueens                          .        
## neighbourhood_groupStaten Island                 -42.4578262
## amenities_count                                    1.6177372
lasso_model$results %>% filter(lambda == lasso_model$bestTune$lambda)
##   alpha   lambda     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1     1 1.747528 77.48486 0.5315526 45.69189 6.581365 0.07319455 1.766787
table_resids <- data.frame(residuals = resid(lasso_model), fitted = fitted(lasso_model))

ggplot(table_resids, aes(x=fitted, y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)

Before settling on our final model, we looked at multiple different paths. The two that we considered included GAM and KNN. The GAM model does not do well with factors, and the data that we are looking has many categorical predictors. This can be seen in the GAM model plots that are linear. We tried to take out all the categorical predictors and got very poor results. The other model that we studied, KNN, also doesn’t do well with categorical predictors. When taking them out, we got better results than with the GAM, yet still not amazing results. The final model we chose was LASSO because we can use all the factors in it and we got slighlty better results than the KNN model. While the difference in results isn’t too large, LASSO is a much simpler model and less computationally expensive than KNN. Therefore, the choice between similar LASSO and KNN results was clear.

We made many different attempts in our approach before settling on one. This included looking at different combinations of predictors and different tuning parameters. For LASSO, the final set of predictors was one that wasn’t too large and didn’t contain redundant or useless variables. The tuning parameter that we chose was “best”, in order to get the best possible results. This selection also ended in better \(R^2\) and MAE than “oneSE”. For KNN and LASSO we tried a wide range of numbers for K and for Lambda, respectively, looking broadly and at many options. The numbers we settled on gave us the best possible predictive model. The method we chose was 10-fold CV, because the number 10 for this process is a standard and cross-validation allows us to test a model we trained on our data to verify that it is not overfit.

The \(R^2\) for our LASSO model is approximately 0.532 and the MAE is 45.692. These results are the best ones we discovered among all our models while simultaneously being a computationally efficient model. We also used the residual plot to evaluate if the model is wrong. While this is the best model that we found for this data, the residual plot shows that it is not perfect. The residual plot is not completely random or balanced above and below the y axis. We would hope, normally, for a better residual plot, yet this is a pretty average plot and one we can accept because the other possibilities were much worse.







Part 3: Summarize

Many of the predictors did not have a strong correlation to price, as seen in our correlation model in part 1. While the model did not explain all of the data and is not as strong as we hoped, it is still fairly good in relation to the data that we were given. There are also many elements to Airbnb that we cannot control in this dataset because we cannot describe them in predictors, such as the profile picture that users have.

Although there are some difficulties in analyzing this dataset, we can still see that some predictors significantly influence price. For example, a shared room, a room in Manhattan, and a time share all had large influences. The most expensive burrough of New York is Manhattan, and the influence of this factor reflected that. On the other hand, having to share a room meant the price was cheaper. This happens to be an important factor for people when selecting where to spend the night in a foreign location. Lastly, a time share would be more similar to renting an entire house, so it makes sense that the property type being a time share would make the price of the listing more expensive.

airbnb_sample_complete %>% 
  filter(price < 1000, neighbourhood_group == "Manhattan") %>% 
  filter(room_type %in% c("Shared room")) %>% 
  select(price) %>% 
  summary()
##      price       
##  Min.   : 30.00  
##  1st Qu.: 50.00  
##  Median : 60.00  
##  Mean   : 75.02  
##  3rd Qu.: 80.00  
##  Max.   :200.00
airbnb_sample_complete %>% 
  filter(price < 1000, neighbourhood_group != "Manhattan") %>% 
  filter(room_type %in% c("Shared room")) %>% 
  select(price) %>% 
  summary()
##      price       
##  Min.   :  0.00  
##  1st Qu.: 25.00  
##  Median : 35.00  
##  Mean   : 47.94  
##  3rd Qu.: 48.00  
##  Max.   :500.00
set.seed(253)

sample_n(airbnb_all %>% 
           filter(neighbourhood_group == "Manhattan"), 1) %>% 
  select(price, neighbourhood_group, room_type)
##   price neighbourhood_group       room_type
## 1   290           Manhattan Entire home/apt
set.seed(253)

sample_n(airbnb_all %>% 
           filter(neighbourhood_group != "Manhattan"), 1) %>% 
  select(price, neighbourhood_group, room_type)
##   price neighbourhood_group    room_type
## 1    69            Brooklyn Private room

Looking at example listings in the dataset along side the IQR of prices in or not in Manhattan prove that the predictor of the burrough Manhattan significantly affects the price. When looking at a room being in Manhattan vs another burrough while controlling for the room type being a shared room the median price is higher for a room in Manhattan. The median is $60 in Manhattan and $35 outside of Manhattan. Furthermore, the minimum is 0 outside of the burrough and $30 inside. The max price does not make sense here because it is higher outside of Manhattan, but we can see that this is an outlier in the summary output.







Part 4: Contributions

Juliet Kelson and Anael Kuperwajs Cohen both worked on this project equally.

This project is maintained by julietkelson