Cleaning and adjusting Airbnb data for a more accurate analysis
This article is for the lovers of data cleaning and wrangling who might be looking for some inspiration on how to improve the accuracy of their analysis and draw meaningful insights from it that actually converse with the real world.
I recently conducted an in-depth analysis of the Airbnb environment in the city of Napoli, Italy, and published my findings on this article. For this analysis, I used data provided by InsideAirbnb, an open source tool focused on scraping data from Airbnb with the objective of bringing data to the debate on how Airbnb is affecting housing around the world. The raw data they provide is incredibly helpful but I did have to make some adjustments to clean and narrow down the data.
In this article, I’ll talk about the behind-the-scenes of this analysis, including some challenges I ran into. You may also want to have a look at the full project on Github.
About the data and preliminary cleaning:
The website InsideAirbnb provides three main datasets for each city covering several cities around the world.
The main dataset contains a miriad of information about each unique listing from the city present on the platform — a total of 7,151 listings. To begin with, I removed 13 columns I knew I wasn’t going to need in order to reduce file size and improve query efficiency.
The second dataset was Calendars. This contained calendar information from every listing by day. You could see if a date was available or not, what was the price set up for that date and the minimum and maximum number of nights allowed for that date. This dataset was key for my analysis but it had 2 limitantions:
- It shows only a snapshot of what the calendars looked like when the data was scraped. For that reason, it was difficult to gauge occupancy numbers for Summer, for example, because the data was scraped on 15th December and most listings won’t be booked for Summer that far in advance.
- Secondly, for the availability column, you could only see if it was available or not. For a date that was unavailable, it was impossible to know whether that date had been booked by a guest or blocked by the host. I ran an analysis of Earnings based on this data (unavailable dates x price, put simply), so I had to apply a rationale here to avoid including blocked dates by the host in this analysis — more on this later.
Finally, the third dataset was the Reviews which, as the name suggests, lists every review left on every listing including the dates. I didn’t particularly use this dataset in this instance, but I see a potential for analysing guest sentiment based on the words used.
Data exploration: narrowing down the data to avoid skewed information
As mentioned before, the data included every listing, including properties of various sizes and prices, in the city of Napoli — which includes the historical centre and also neighbouring areas.
Whilst doing research on the pressing issues in Napoli and conducting some data exploration, I felt the need to further narrow down these listings in terms of region, listing activity, and size of property.
Region:
I spend a great amount of time in Napoli and I know the city fairly well. For context, the city has a majorly touristic area: the historical centre. There were listings outside this area (1,064 listings, precisely, out of the total of 7,151 listings — 14%) but the touristic areas will certainly have a different dynamic to the peripheral areas. Therefore, in order to analyse only listings in the same context, I narrowed down the listings to the historical centre, as per the Unesco definition, which included the neighbourhoods of: San Lorenzo, Pendino, San Ferdinando, Chiaia, San Giuseppe, Montecalvario, Avvocata, Stella, San Carlo all’Arena, Vomero, Vicaria, Posillipo and Mercato.
This left me with 6,087 listings in total.
PS: I’m aware that parts of Vomero and Chiaia wouldn’t be considered to be the historical centre in practical terms, but I was unable to breakdown location information more accurately. The vast majority of listings are not in these areas, though, so I doubt this would’ve have a massive effect on the data.
Listing Activity:
During the Data Exploration stage, I ran an analysis into the numbers of new hosts joining the platform each year and noticed an accentuated peak between the years of 2015–2019 with an equally accentuated dip corresponding to the pandemic period. This let me to consider whether some hosts may have decided to leave the platform but never deleted their listings.
In order to analyse how many inactive listings we had, I had to first define what “inactive” meant. With the data being scraped on the 15th December 21, I made the arbitrary but reasonable decision to focus my analysis on listings that had received at least 1 review in the 6 months previous to the data scrape. If a property hasn’t been reviews in the whole second semester of 2021, it’s very likely to be an inactive listing. I could’ve analysed the Calendar data more deeply and even looked at response and acceptance rates, but I felt like that would be wasted energy.
Therefore, from the 6,087 listings we had after narrowing down the regions, we are now left with 3,067 active listings in the historical centre of Napoli. This reduced the amount of listings by half but I’m convinced we’re only looking at genuinely active listings here.
Size of property:
One of the joys of data exploration is figuring out what could potentially skew the numbers and turn what could be an insightful analysis into a compilation of numbers with no correlation to the real world. When looking at the different sizes of property and their dynamics, I couldn’t find any noticeable differences in terms of location or review score. However, one interesting thing about the average price caught my attention:
The data showed me that the average price of a 1-bedroom flat or even a 2-bedroom flat were not too dissimilar. There was a slight hike for 3 bedroom properties and when we get to the bigger properties, we can see they’re at another tier of pricing. I also ran an analysis on private room listings and found out they were priced in a similar value to 1 bedroom and 2 bedroom flats. So, in order to be able to do a fair price analysis without these bigger properties skewing the numbers, I made a decision to focus solely on 1-bedroom flats, 2 bedroom flats and private rooms.
These amounted to 85% of active listings in the historical centre and left us with a total of 2,588 listings to conduct our analysis.
Challenges in the analysis
Letting the data dictate the course of the analysis
After I narrowed down the data, I was expecting to find a range of individual hosts, some with several properties but not too many, and others with single properties. Perhaps I would find that hosts were just taking the piss with the prices and offering a poor experience — which is my overall experience in Napoli. What I discovered was actually the dominance of Airbnb Management Companies (AMCs) with aggressive price strategies and who were not providing great experiences, considering their review scores.
Therefore, my analysis ended up becoming a host segmentation project by grouping hosts with similar characterics and highlighting their patterns — with a big focus on these AMCs. I’m not going to go into detail about my findings, there’s a whole article dedicated to that. But it was interesting how the data led me to a completely different direction and challenged me to go around its limitations in order to draw useful conclusions.
Defining occupancy rates
As I mentioned, one of the data limitations was related to Calendar availability. Defining occupancy was complex because the unavailable dates showing in the data could have been either booked dates or dates blocked by the host. Also, when looking at a Calendar that is actually a snapshot of the date the data was scraped, I knew that my occupancy data wouldn’t be super accurate. In my preliminary analysis, I had hosts already with 70% of occupancy for 2022 in December 2021 which seemed very unlikely.
The solution I found was to get the average occupancy per host and remove any hosts that were above the average. So, in effect, any hosts which had more than 120 unavailable dates in 2022 were removed from the analysis. In hindsight, I could have done it at listing level — removing listings that had over 120 days of unavailability, not hosts. But as I was segmentating hosts, I wanted to ensure I was not including potentially inactive hosts with calendars fully blocked for all or most of their listings in the analysis.
Then, since the occupancy numbers I had were not exactly accurate in terms of precisely how many days had been booked — but correct in terms of proportion — I decided to create an index of occupation by dividing it by the 369 days available in the data. Therefore, I ended up with a number that on its own didn’t mean much, but could be used to draw comparisons of occupancy levels amongst different hosts.
Figuring out a way to calculate Earnings
A big portion of my analysis ended up centred around Earnings. I wanted to check whether these AMCs were actually putting their hands on a large portion of the Airbnb income in Napoli, even though they had less listings than other groups.
Again, using the Calendar dataset, I simply added up the rates for the days that were marked as unavaialable — with the Occupancy threshold above still applying. I ended up with numbers that of course were not precise — given that the data was a snapshot of a point in time — but again, the proportion of Earnings could still be understood. In my analysis, I made sure to not say a host was making x amount of euros, but rather I focused on percentages.
This is another example of how even if you’re not able to reach precise absolute numbers in your analysis, you can still create a proportion and understand the dynamics of a segmentation project, for example.
Hope you guys enjoyed this behind-the-scenes of my Airbnb analysis and that it inspired to overcome problems in the data you’re analysing now or in the future. If you think something could have been improved from my analysis, do let me know. I’m always looking for innovative ways to work around data limitations.