Data Scraping and Visualisation

Planning

Justification for Tools

We used Netlytic and Gephi as our sole tool to analyse and gather our quantitative data.

Collectively, it was decided to use the social media platforms Twitter and Instagram to see the correspondence between the hashtags regarding Thailand and travel and how people share their information and experiences on these platforms. The results collected were helpful to analyse the large amounts of data and provided us with an understanding of the hashtags used and the relationship between people who used it.

Netlytic facilitates the collection of large data for our research from desired accounts, what content they post, what network they are on and how are they engaging with others. After Netlytic gathers all the data, an Excel spreadsheet can be downloaded, displaying a plethora of information such as how many followers an account has, where they are geographically located, the amounts of comments and likes received. It was a helpful tool as it saved us time and effort while giving us the data in a comprehendible format. Netlytic’s biggest disadvantage is not being able to search for two keywords at once on Instagram, which prevented us from scraping desired data.

Afterwards, we then used the Gephi programme and upload the file that Netlytic provided to it which allows us to easily visualise these large sums of data. Gephi allowed us to make many customisations when visualising the network where each layout can tell us a different thing about that network’s relationship. Gephi also has preset filters, such as Giant Component, that can make large networks more comprehendible. However, while complex settings allow for great customisations, they can also be difficult to manipulate for desired result. 

Justification of Platforms

Instagram

We selected two social media networks to gather information from, them being: Instagram and Twitter. Instagram is popular social media network in which users can share photos and short videos and the ability to share stories that their followers can view for 24 hours. Users can be anything from the average individual, celebrities, to large corporations and Instagram allows them share images with a description or express their thoughts and opinions on the matter. Furthermore, other users and their “followers” can “like” and comment on their posts as well. The demographic of this platform is younger than of Twitter.

Since Instagram is more visual and personal than most platforms, users would prefer to share their beautiful holiday photos on it. For this reason, lifestyle bloggers, especially travel bloggers, take full use of the social media site to engage with their followers. Taking into consideration the nature of the platform, we speculate that we would find many influencers posting about their travels in Thailand.

Twitter

The following social media platform we made use of, Twitter was also hugely beneficial to gather data on our subject. Twitter allows its users to share their thoughts with messages and with “tweets” to the public and their followers. Like Instagram, Twitter users also consist of average individuals, celebrities, and large corporations. Twitter also makes use of hashtags which allowed us to figure out which hashtags are relevant to our subject. This is how we quickly found what people talk with regards to traveling in Thailand.

Twitter provided us with a more clear outlook on exactly what people are currently writing up about their experience in Thailand. Unlike Instagram, it is largely information based with the focus of sending out information quick. Twitter limits its users to post a maximum of 140 letter characters per tweet, keeping things concise, fun and exciting. This also helped reduce the amount of data we had to process.

Justification for keywords

We used “Thailand” and “travel” as the keywords when conducting research on Twitter. Although our topic of traveling in Thailand is widely discussed, there is no specific hashtag made for it as it is not a sudden trend but rather a constant conversation. We decided to keep the keywords simple and broad in order to gather as much data as we can. In order to distinguish travel tweets from general discussion about the country, we added “travel”.

We used more hashtags on Instagram as we managed to find many related to our topic. Initially, we wanted to find posts containing both popular travel hashtags such as #instatravel or #travelgram and Thailand related hashtags as our preliminary research showed that users tend to combine them. However, due to the limitations of Netlytic, we could not use two keywords. This then lead us to examine accounts with large following who had been to Thailand and saw what they used. We found #explorethailand with nearly 80 000 posts, #thailandonly with 330 000 posts and #amazingthailand with 800 000 posts. Further research showed that #amazingthailand is a slogan initiated by the Tourism Authority of Thailand, while #thailandonly is likely related to a Thai movie with the same name. We considered scraping data from #landofsmiles as well; however, this keyword only found 47 000 posts. Apart from these, we tried different combinations of travel and thailand, including #travelthailand with 180 000 posts and #thailandtravel with 50 000 posts.

Connection to research question

Via Netlytic, we want to answer the question of emerging themes, most used devices and the best time to post. With the aid of Gephi, we can also identify main influencers and their network in the conversation. Through the content of posts collected, we can identify most used words and draw conclusions on common themes such as travel-related activities, destinations and general feelings of travellers. Data from Twitter also tells us which devices people post from while users of Instagram on the other hand are almost entirely posting from their phone. We will also be able to identify the time slot where most posts fall into and in turn suggest the best time to engage with the audience, as this would be when they are most active.

Collecting the data

We collected data over two periods of time: the first from May 6th to May 9th 2017 and the second from on June 9th 2017. During the first period, we collected two sets of data over three days, one for Twitter and one for Instagram via Netlytic. For this preliminary research, we only used two basic keywords “travel” and “Thailand”. Following this, we tried collecting more data using different search words from Instagram but were met with great difficulties. Our Instagram data sets were contaminated with Twitter data, generating false results. We did not spot the mistake early enough, resulting in a second attempt on June 9th. Here, we collected 2500 posts for each new keyword in one day.

Platform Collection date Keywords Volume Nodes
Twitter May 9th 2017 “travel” and “Thailand” 2500 1466
Instagram June 9th 2017 #amazingthailand 2500 756
Instagram June 9th 2017 #travelthailand 2500 701
Instagram June 9th 2017 #Thailandtravel 2500 584
Instagram June 9th 2017 #explorethailand 2500 679
Instagram June 9th 2017 #Thailandonly 2500  725

To help answering the question of emerging themes, we also used a category of keywords to identify topics most talked about. Apart from basic categories provided by Netlytic, we added two extra of our own to best describe the content of the posts:

  • Destinations: Ao Nang; Ayutthaya; Bangkok; Chiang Mai; Chiang Rai; Ko Samui; Ko Tao; Krabi; Pattaya; Phuket
  • Activities: boat; climb; club; elephant; festival; kayak; market; meditation; palace; swim; temple; trek; tuk tuk

Visualisation and interpretation

We used both Netlytic and Gephi for our data visualisation. Netlytic was used for visualising keywords with its online analysis tool while Gephi was used for visualising the network. On Gephi we used two sets of colour when partitioning the nodes based on modularity. Due to the large number of data gathered and nodes identified, we set the resolution to 5 instead of the default 1. The first set consists of 8 colours for top nodes, while the second uses bright blue and pink in order to better visualising top two nodes. We also ranked the nodes based on two factors, size – varying from 4 to 40, and label font size varying from 1 to 5. We also wanted to visualising the strength of connection; therefore, we ranked edges based on weight. When adjusting the layout, we used a combination of Force Atlas, Force Atlas 2, Fruchterman Reigngold, Contraction and Label Adjust.

Twitter – “travel” and “Thailand”

This slideshow requires JavaScript.

For analysing Twitter we only made use of the Netlytic programme, not Gephi due to complications in exporting the network file. We were however able to identify the key players based on their network size and in-degree. The ones with the most direct connections are: nytimes, truereporterau, blondtravels, and travelingbytes.

The network generated is highly interconnected and dense. The lack of sparsity means that there are more people talking amongst each other with a lower network diameter despite the large number of nodes.

After a further inspection of the file produced on Netlytic it showed that the following ten accounts are the ones with the most followers:

  1. Nytimes – 35936129
  2. MailOnline – 956592
  3. Nytimestravel – 1095932
  4. CDCgov – 795009
  5. BestEarthPix – 632283
  6. nationnews – 495551
  7. exploretravel1 – 311042
  8. exploreVSCO – 305543
  9. nytimevideo – 273069
  10. alhanda – 222573

This slideshow requires JavaScript.

The slide show above shows the top influencers on Twitter. Here, we see news outlets posting safety information about Thailand, which fits with our concern from preliminary research. The NY Times posted an article about travelling in Thailand which actually attracted many people and got retweeted a lot. Other influencers are travel bloggers visiting Thailand, with the exception of one local.

Screen Shot 2017-06-07 at 6.17.14 PM
Top 50 most used words – created with Netlytic’s word cloud

In the top 50 most used words that Netlytic provided, backpacking emerged as a strong theme. A large sum of people traveling in Thailand are backpacking through the country so this behaviour is expected. Furthermore, Chiang Mai made it on the list; this was in line with our trend analysis indicating the rising interest in the city.

This slideshow requires JavaScript.

The data we obtained showed a variety of mentions and destination was the most talked about regarding Thailand followed by activities and feelings. We were also able to see the top destinations in Thailand which showed Bangkok to be the most popular followed by Phuket and Chiang Rai. We did not expect Chiang Rai to surpass Chiang Mai.

Furthermore, we can see the top activities mentioned and temple was the most common with many travellers visiting the temples Thailand has to offer. Lastly positive feelings were shown  resulting in mostly very positive feelings arising such as: nice, great, happy, and good.

TW most used devices
Figure 1: Most used devices on Twitter

We used data collected from Netlytic and created the chart shown in figure 1. This showed that people post from mobile phones the most, while posting from the web ranks in third at over 300 entries. Other devices are platforms which allow scheduled posting. Users of these platforms are likely news outlets or professional content creators.

time of day TW
Figure 2: Distribution of posting time based on 2500 posts

To answer one of our sub research questions, we took the time stamps of 2500 tweets and created a scatter graph as shown in Figure 2 with the y-axis showing Netherland’s time. Here, it can be clearly observed that few posts fell between 3 – 9 p.m., indicating the least active time. Based on this graph, the best time is around 11 a.m and 1 a.m.

Instagram – #amazingthailand

Screen Shot 2017-06-10 at 6.29.44 PM
Top 50 most used words – created with Netlytic’s word cloud

No clear themes emerged with the #amazingthailand hashtag. Besides that, we identified hashtags mostly used by these users them being: #thailand, #travel, #travelgram, #bangkok, and #instatravel.

This slideshow requires JavaScript.

The graphs that were produced from the data showed that destinations are the most commonly mentioned with Bangkok taking the majority. It then showed that activities are mentioned less on here, contrasting with the trend found on Twitter. Finally, more positive feelings came up this time as well as references to appearance.

This slideshow requires JavaScript.

In the group it shows that many clutters are disconnected and the network diameter came to be four with node count at 756, meaning that it is not well connected. Network is not connected in many directions and came out to be rather cluttered around big groups.

Many large hubs were identified. Badpassport and ourfamily passport are the only ones with strong edges and are sparsely connected with a few other netoworks. Moreover, the accounts vannesstwin and benpstanton ended up not being connected to anyone else despite having a high degree rank. This meant that while their followers sought for their content, others talking about Thailand were not interested in them.

The creator of the hashtag, the Tourism Authority of Thailand (@thailandinsider) is not strongly represented in the visualisation as it has lower modularity rank and degree rank when compared to other accounts. Perhaps this call for better engagement from the Tourism Authority.

This slideshow requires JavaScript.

Instagram – #travelthailand

Screen Shot 2017-06-10 at 10.14.46 PM
Top 50 most used words – created with Netlytic’s word cloud

Some of the most commonly associated hashtags we identified with Netlytic’s word cloud are: #thailand, #travelphotography, #travel, and #amazingthailand. Many positive descriptive terms were also found such as: #instagood, #beautiful and #cool. The cities of Phuket and Bangkok were also mentioned, this is expected as the two cities are the most well known attractions in Thailand.

This slideshow requires JavaScript.

In this graph, feelings were mentioned surprisingly more than previously, followed by destination, appearance and activities. Chiang Mai ranked relatively low, despite being a rising destination in Thailand. Top activities that came up are temple, elephant and boat which make sense with Thailand being renown for the beautiful temples, touristic elephant rides and sanctuaries and boat rides being ever so common as well. Temples might be the most photogenic subject on Instagram for Thailand. The high score that elephant and boat received also shows that users are interested in exotic and adventurous activities while in Thailand.

This slideshow requires JavaScript.

Most of the influencers identified were bloggers and once again the influencers badpasspaort and ourfamily came up as with the best rank. These accounts held top modularity score and top degree score. The bloggers posting on here are the ones that are traveling in Thailand with gezmelerdeyim being the only major account not from U.S, Europe or Australia.

The network also proved to be quite sparse. A high portion of the members reside many steps away from one another and many were not connected at all. You can see many grey nodes scattered towards the edges or are just stranded. Furthermore, the results generated showed defined grouping. Large hubs are clearly observed but they are not communicating with each other, making the network disconnected with no focused direction.

Screen Shot 2017-06-11 at 2.36.01 AM

 

Instagram – #thailandtravel

Associated hashtags can be seen here; although they remain largely the same from the previous hashtags.

This slideshow requires JavaScript.

Feelings were also mentioned most in this graph and they remained largely positive. The top destinations reappeared once again as Bangkok and Phuket; this was consistent with our preliminary research as the top destinations then were also Bangkok and Phuket. The most used descriptive word was beautiful by an overwhelming majority. The bad feelings that came up, upon closer inspection ended up being words like “jealous” from others, hence, not describing a bad experience but other people jealous of their time and experience in Thailand.

This slideshow requires JavaScript.

This network is shown to be less sparse than the previous two networks, suggesting a stronger connection although it still has a fair amount of outliers. The network diameter is at three with the node count of 500. Despite having fewer nodes, this can still be viewed as a good score.

Furthermore, there is high modularity score from the accounts: wanderingtooth and the_travel_hub. @The_travel_ hub is a community account for travellers, it makes sense that they are strongly connected to their followers and not to others. Some high degree accounts observed are thelifeofjord and maytrara; their accounts provide information that their followers tend to seek.

In the graph, large hubs are clearly observed but it shows that grouping is not too clearly defined. On the account crystaldivekohtao, you witness thick edges along it which is indicative of a strong connection. After a closer inspection into the crystaldivekohtao account, it showed that they are a business which suggests that they are doing well at communicating with people on their platform.

This slideshow requires JavaScript.

Instagram – #Thailandonly

Screen Shot 2017-06-11 at 12.26.57 AM
Top 50 most used words – created with Netlytic’s word cloud

Some of the hashtags identified on Netlytic were: #thailand_allshots, and #amazingthailand. Many positive descriptive words were also seen again such as: #amazing, #nice, and #beautiful.

This slideshow requires JavaScript.

Feelings are mostly talked about once again, followed by destination, appearance and then activities. Among destinations, so far, Pattaya has been mentioned often, although this is not reflected in our research. The city hosts many temples, this might be connected with how temple is mentioned most amongst activities as well.

This slideshow requires JavaScript.

In this graph you can see that the Network appears to be sparse with a lot of smaller groups of nodes in grey scattered around. Once again, the accounts badpassport and ourfamilypassport reappear with a high ranking for degree and modularity. Large hubs are clearly observed and some connections between them can be seen. Otherwise, there does not seem to me much connection between accounts. For instance, the account minto_ong ranked well in terms of degree but it showed no connections to other networks.

Screen Shot 2017-06-11 at 4.48.43 PM.png

 

Instagram – #explorethailand

Screen Shot 2017-06-11 at 1.42.58 AM
Top 50 most used words – created with Netlytic’s word cloud

The data collected for this had been affected by unspaced hashtags, creating long hashtags that are probably repeated through conversations. Besides that, #thailand, #amazingthailand #travel and #travelgram can be seen, but with a relatively low frequency, suggesting a weak connection with the #explorethailand hashtag.

This slideshow requires JavaScript.

Feelings once again took top spot for most mentioned, followed by destination, activities and appearance. Within the activity category, club was most mentioned for the first time, making it an anomaly compared to other datasets from other hashtags. This was followed by temple, elephant and market.

This slideshow requires JavaScript.

This graph Gephi produced showed to be very sparse. On another hand, connections between each sub groups are strong as visualised by the weighted edges. A large amount of anomalies are scattered outside indicating how sparse this network is.

The accounts: eternal_lifeforce, wanderingtooth, and alwayzdope are those with higher modularity as visualised by bright pink and blue in the graph above. Alwaysdope appear to have a strong connections with its nodes as demonstrated by the thick edges, meaning it is a highly interactive account. High degree accounts observed were: hownottotravellikeabasicbitch, csventure, and eternal_lifeforce. These accounts, apart from also being large hubs, show little connection with each other and are isolated.

This slideshow requires JavaScript.

 

Overall findings

Our overall findings showed some connection and similarities to our trend analysis and the literature review. Also, as expected, Twitter showed to be more information based with people having more conversations on the site as shown by its denser and more connected network. We learned that the accounts with the most followings within the network are news source. We found top posts from two news sites, warning people of dangers one may face when in Thailand, which was consistent with our literature review. We expected some concerns over safety. However, these news sources, with the exception of NY times, are much less connected to other accounts than travel bloggers. This meant that the bloggers are ones being sought after. Twitter also found Chiang Mai to be quite frequently mentioned, which is consistent with our trend analysis indicating the rising interest in the city. However, we did not expect Chiang Rai to have a higher frequency at all. Bangkok and Phuket were still the most talked about destinations. Overall, based on keyword categories, the content found on Twitter is fairly positive and descriptive.

Regarding Instagram’s findings, we also found travel bloggers to have the biggest influence in terms of connectivity, in-degree and followings. We also found that most of the posts they made were when they were traveling through Thailand. Interestingly, the Tourism Authority of Thailand on Instagram, despite its large following, has considerably less influence on users of the hashtags than travel bloggers. This meant not many people sought for them or talked to them. The content of the posts were less descriptive than Twitter content and focused largely on feelings. Most users have had a positive experience with Thailand. Lastly, the networks on Instagram appears sparse, showing that not a lot of people are communicating with each other. It would be best for companies to directly target accounts with a large following rather than accounts that are well connected.

We also found Twitter users to be posting from their phones twice as much as from web browsers. The worst time frame to post (GMT+1) is between 3 – 11 p.m. This might be working time for people in America or near bed time for people in Asia.

Limitations

There were several limitations that were faced that impeded our research. Firstly, data should have been collected over a longer period of time so we could gain a broader perspective on the subject. If we had not contaminated our data, we would have been able to produce more meaningful results. Several spam accounts were discovered that affected our data slightly such as, @amazingthailand – a featuring account that charges for money.

Furthermore, Netlytics is unable to interpret photo content (it can only show what filters were applied to the images), which is more important for Instagram, being a photo-based platform. However, not many programmes have the capability to do this just yet.

Moreover, too many ‘retweets’ were collected, this possibly might affect our word categories and word count. We are seeing less original content as a result of this. Lastly, there are many spelling variations for many destinations in Thailand (e.g. Ko Samui or Koh Samui), effecting our destination category.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s