Hierarchical Diffusion Of An Innovation

1. Introduction

Spatial dependence is the propensity of nearby locations to influence each other and to possess like attributes (Goodchild 1992). Information technology is i of the key concepts for the understanding and assay of spatial phenomena. The concept of spatial dependence is essential for recognizing spatial autocorrelation and is likewise of import to studying the diffusion of information. Early studies of information diffusion looked into the spatial relationships among places where information had spread. Past reviewing and mapping the patterns of diffusion of agricultural innovations, Hägerstrand (1967) described spatial diffusion as governed past the flow of inter-personal information and suggested three diffusion stages: (a) an initial stage with local concentrations of early on acceptances; (b) a 2d stage showing radial outward dissemination of adoptions, a ascent of secondary agglomerations, and continued growth of initial agglomerations; and (c) a saturation stage. Enquiry by Hägerstrand noted that spatial dependency in improvidence included ii phases of the phenomenon: the hierarchical improvidence and the neighborhood effect. The neighborhood consequence assured outward spread in space from an initial point, while hierarchical spread added new spatial structures as additional geographic regions joined the base of operations source of information spread.

New technologies have reshaped the improvidence of innovation and information. The advance of Information Communication Technology (ICT) and the emergence of the Internet not only changed the way people interact with others but also led to the notion of the death of distance (Cairncross 1997). While the statement was more on the new era of international economical development and trade, information technology does provide a preview of how social media changed the speed and the scale of information sharing in the early 21st century. Social media is a advice platform congenital upon mobile and web-based technologies that have created highly interactive means through which individuals and communities share, co-create, hash out, and modify user-generated content (Kaplan and Haenlein 2010; Kietzmann et al. 2011). The abundant user-generated content on social media and its real-time interaction capability have introduced substantial and pervasive changes to advice between communities and individuals. Furthermore, its depression-cost data accessibility and Application Programming Interface (API) open up opportunities for scientists to study individual behavior in real time in a way that is both fine-grained and massively global in scale (Lazer et al. 2009; Leetaru et al. 2013; Tsou 2015).

Using social media such equally Twitter, the usually private and fleeting advice betwixt individuals can now exist captured, and data sharing is no longer limited by barriers of location and time. Cairncross's statement seems to be valid in this instance, which suggests that spatial dependence is non a factor in the diffusion of information. However, how spatial dependence of regions in physical space shapes the information diffusion processes in virtual space has not nevertheless been fully investigated. Social media-based diffusion research has mostly focused on the interactions betwixt users in network space. Although some included the temporal dimension of how these interactions change through time (Kossinets and Watts 2006; Lewis et al. 2008), few studies accept nevertheless to thoroughly examine the impacts of the spatial dimension in their scope. Moreover, the modeling of data diffusion has not included spatial dependence. Existing diffusion models, e.yard. the Linear Threshold model (Granovetter 1978) and the Independent Pour model (Goldenberg, Libai, and Muller 2001), are built on the basis of node-edge graph structures where the distance between actors is the geodesic altitude, i.e. the shortest path between two nodes. This distance has been suitable for graph modeling purposes; all the same, it ignores the fact that human activities happen at a specific location in geographic space. Understanding how location influences diffusion is important to quantify any spatial dependence for the development of improvidence models.

This inquiry aims to examine the relationship betwixt data improvidence patterns and the urban hierarchy. Our overall supposition is that the response behavior of an urban region to new information is related to its position in the urban hierarchy. The urban hierarchy is a function of the population of the regions as defined past Hägerstrand (1966), and the response beliefs in a region is defined past the frequency of responses and the diffusion pattern over fourth dimension. Geotagged conversations from Twitter were collected from the peak 30 populated U.S. cities and Metropolitan Statistical Areas (MSAs) reflecting two selected topics. The response behavior of urban areas was examined with multiple statistical methods and two Diffusion Charge per unit Indexes (DRI) over time. The contributions of this piece of work are every bit follows:

We demonstrate a framework and procedures for understanding the diffusion of information using geotagged Twitter data. This framework is not limited to Twitter data and can be adapted to conversation posts from other location-based social media data.
Nosotros examine the Twitter activities between different types of urban hierarchy (urban center and MSA) at multiple temporal resolutions to create baseline beliefs and suggest two Diffusion Rate Indexes (DRI) that tin can be derived from the baseline.
We propose multiple statistical features ^ane that tin can be extracted from geotagged Twitter posts to study diffusion behavior. These represent different aspects of conversations and can be separately analyzed depending on the purpose of the research.
Nosotros implement 4 altitude metrics to formalize dissimilarity of the response behavior over fourth dimension between urban regions. In addition, procedures for comparing response behavior beyond urban bureaucracy classes are introduced.

The residue of this newspaper is organized as followed: section two introduces the related work focusing on studies of information improvidence, the nodes and links in diffusion studies, and the characteristics of the diffusion nodes. Afterwards describing the information drove and pre-processing procedures, section three introduces the statistical features that represent Twitter conversation and the ii Improvidence Charge per unit Indexes (DRI) for measuring response behavior over time. The results of baseline behavior and topic-based response design are presented in section four. The research is ended with discussions of the key findings and suggestions for future inquiry in this domain.

2. Related works

ii.one. Information improvidence

Diffusion was first studied as a descriptive concept by the sociologist Gabriel Tarde (1897) when the process of people imitating behavior, desires or motives transmitted from 1 individual to another was theorized equally diffusion in the social system. Applying the diffusion concept, researchers started to examine how innovations are diffused with empirical information when agricultural technology was advancing. Ryan and Gross (1943) investigated how independent farmers were adopting new hybrid seeds in the state of Iowa, The states. Their work considered the diffusion of innovation every bit a social process, and they focused on the relative influence of economical versus social factors on adopting a technological innovation. Works by Hägerstrand (1966, 1967) provided innovation diffusion with a spatial context and also served every bit early examples of quantitatively modeling the improvidence process. The concept of improvidence of innovation has been practical to unlike type of innovations such as the diffusion of knowledge near news stories through media (Deutschmann and Danielson 1960), the improvidence of a medical drug amongst doctors (Coleman, Katz, and Menzel 1966), the improvidence of Smart Menu applied science in a medical arrangement (Aubert and Hamel 2001), the improvidence and adoption of new policy and technology in schools (McCormick, Steckler, and McLeroy 1995; Frank, Zhao, and Borman 2004), the adoption of hashtags by social media users (Romero, Meeder, and Kleinberg 2011), the diffusion of microfinance through social network (Banerjee et al. 2013), and the diffusion patterns of scientific articles that are shared on social media platform (Alperin, Gomez, and Haustein 2019), to proper name merely a few.

ii.ii. Links and nodes in diffusion

Analysis of the diffusion phenomenon can be categorized equally the study of the links or the study of the nodes (Hägerstrand 1967). In the sense of information diffusion, the links stand for the connections where information is passed along, and the nodes are the individuals that react to the information. In the study of the links category, different types of network derived from Twitter data were studied to understand the diffusion of information among users. Yang and Counts (2010) analyzed the mentioning network with Twitter information and measured the diffusion of data with speed, calibration, and range. Suh et al. (2010) considered retweeting every bit the of import indicator of diffusion in Twitter's network. In search of the factors that impact a tweet being diffused, they found that content features such as URLs and hashtags take potent relationships with retweetability. Focusing on following-follower networks, Kwak et al. (2010) looked at trending topics on Twitter and analyzed improvidence by retweet activities. In their work, influential users were ranked similar by either the number of followers or past PageRank. For the study of nodes, much research has tried to quantify the influence of Twitter users. In contrast to the traditional concept of opinion leaders in diffusion on innovation, research in this domain has establish that ordinary users may also have strong impacts on the diffusion of information (Cha et al. 2010; Bakshy et al. 2011). This co-operative also gained lots of tractions recently due to the popularity apply of social media for information propagation. Efforts have been done in examining the sharing of truthful and false information on the social network platform (Vosoughi, Roy, and Aral 2018), the role and impact of social bots (Ferrara et al. 2016; Shao et al. 2018), and the influence of nodes in the network during major election events (Bovet and Makse 2019; Grinberg et al. 2019). As well links and nodes, content features such every bit the apply of hashtags and the topic of the bulletin are too considered every bit important factors that contribute to the diffusion of information (Romero, Meeder, and Kleinberg 2011).

ii.3. Diffusion node characteristics

In the early studies of the improvidence of agricultural innovations, the size of the farm was recognized as one of the key factors related to the credence of a diffusing innovation (Ryan and Gross 1943; Gross 1949; Gross and Taves 1952), which Hägerstrand (1967) later described equally an economic factor in the diffusion process. The size of a city was showtime included to explain the spread of the Rotary Lodge movement in the Scandinavian countries in which Hägerstrand (1966) found that innovation diffusion moved down the cardinal place hierarchy in Sweden. In this example, innovations were initially spread to the places highest in the hierarchy system of city rankings and then diffused to places at lower ranks. The urban hierarchy was considered equally the ranking of cities past population in the instance study. In improver to the size of the diffusion nodes, such equally farm size and city population, density is some other gene that had been examined in diffusion case studies. Graham's law of diffusion (1829) highlights the role of density in the diffusion of gases, and this concept tin also be plant in Hägerstrand's piece of work (1967) where population density was used to categorize sub-regions for the simulations.

This research follows the concept of innovation diffusion as a spatial process and falls into the study of the node category. In contrast to previous studies, instead of looking at interpersonal diffusion in social networks, we extend the scope to the study of geographic regions and the urban hierarchy that connects them. Furthermore, instead of using raw counts of responses or normalizing the responses by the regional population, we first analyze the baseline behavior of report regions over fourth dimension and then derive statistical features that tin can reflect the response intensities. Lastly, in improver to using population as a node variable equally suggested past previous research, we also classify urban regions with both population density and GDP to test the underlying type of urban bureaucracy in the improvidence process. Population density is selected as many considered it may have impacts on the mail frequencies in social media. The Gdp-based urban ranking is included as an culling urban hierarchy that represents the economic perspective compared to the traditional population-based indexes.

three. Methods

iii.1. Urban hierarchy

Two types of urban hierarchy were used in this research to plant the geographic regions of involvement: population ranked by urban center boundaries and past Metropolitan Statistical Areas (MSAs). For the beginning approach, we synthetic the urban hierarchy using the top xxx U.South. cities based on the Demography 2010 metropolis population. These cities are referenced to the incorporated places equally defined by the United States Census Bureau, which includes a diverseness of designations, including metropolis, town, village, civic, and municipality. The second approach builds the urban hierarchy with the height xxx populated Metropolitan Statistical Areas defined by the United States Part of Management and Budget (OMB). The OMB defines a MSA as one or more adjacent counties or canton equivalents that have at least one urban core area of more than 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core. MSAs normally extend over wider areas than cities.

3.2. Data drove

This research uses Twitter information to report the improvidence of information in urban regions. The Twitter Search API was utilized to collect tweet messages that are reacting to the following two selected topics, the Nepal Convulsion and #JesuisCharlie, using pre-selected keywords divers by domain expertise (the topic datasets). The 2 topics were selected because they both represent events that draw attending all over the world and with an epicenter that is not in the U.Due south., which reduces the impacts of local and regional interests that tin can be seen when analyzing domestic topics. Data collection for the two topics was initiated at the start of the events and as well retrospectively traced back to the fourth dimension before the events. The Streaming API was used for collecting geotagged tweets from the top U.Due south. populated regions to generate the baseline behavior (the baseline dataset). There are several types of location information that could be bachelor in a given tweet object for location-based analysis including geotagged location, user profile hometown, and identify names appearing in the tweet text. Although geotagged tweets are often a fraction of the entire dataset, this type of location presents less uncertainty comparing to the others when assigning tweets to urban regions and thus this research focused on geotagged tweets simply.

Topic dataset ane – Nepal Earthquake (refer as NE from now on): The Nepal earthquake (also known as the Gorkha earthquake) happened on 25 April 2015, at an epicenter east of the commune of Lamjung, Nepal. The vii.8M_{due west} (or 8.11000_south ) magnitude primary convulsion and its aftershocks acquired major harm to lives and backdrop in the affected area. To sympathise the spread of information about this disaster and the responses from the U.S., xx keywords ² were selected to collect Twitter messages. From 21 Apr 2015, to 31 May 2015, we retrieved 1,135,638 tweets from 474,746 unique users. 60,297 (5%) of them contains geotagged latitude/longitude coordinates and 32,454 of them were from the U.S.

Topic dataset ii – #JesuisCharlie (refer equally JC from now on): #JesuisCharlie was the trending hashtag on Twitter adopted by supporters of freedom of speech and freedom of the press after an assault on January 7th, 2015, at the offices of the French satirical weekly paper Charlie Hebdo in Paris, France. The hashtag was first posted on Twitter and then went viral to other social media platforms. The #JesuisCharlie hashtag was selected every bit one of the topics for this enquiry to examine the diffusion of a trending term over social media at a global scale. From seven Jan 2015, to 26 January 2015, we retrieved vi,368,131 tweets from ii,174,805 unique users that incorporate the hashtag #JesuisCharlie. Over 45% of the tweets are in French and nearly xxx% were English tweets. Of these 88,693 (2%) tweets came with geotagged coordinates and 48,693 tweets were from the U.S.

Baseline dataset: To properly normalize topic-related conversations for each geographic region, baseline beliefs dataset was established. We used the minimal bounding boxes of the top 30 U.S. cities and MSAs as spatial filters to capture random samples of geotagged tweets from these regions. Over a ii-month time frame (17 April 2016, to twenty June 2016), a total of 11.ii million posts were harvested from the defined bounding boxes. All geotagged posts were first clipped past the height 30 study region boundaries (city and MSA) in a geographic information system. The timestamps of posts are originally in Coordinated Universal Time (UTC) only were assigned local timestamps using the geotagged coordinates of each post to obtain the matching time zone and its UTC offsets.

3.3. Noise filtering from raw data

To focus on the overall trend of information diffusion, potential noise including Twitterbots and overly active users were filtered from the dataset. A Twitterbot produces automatic posts or automatically follows other Twitter user accounts. Previous studies (Chu et al. 2012; Stringhini, Kruegel, and Vigna 2010) on detecting Twitterbots accept provided some starting points just this remains to be a significant claiming in social network research working with social media. We analyzed the source attribute of the post and removed tweets that were not posted via the official Twitter mobile applications or from the Twitter website to filter out potential Twitterbots that might exist programmed past third-party clients. Users who generate a big number of topic-related tweets are considered equally overly active users and can peradventure be Twitterbots as well. The overall distributions of user-tweet counts for both topics show a long tail indicating some users contributing a very loftier number of topic-related tweets, which aligns well with the identified common characteristic of Twitter data (Asur and Huberman 2010; Longley and Adnan 2016). We are more interested in the general patterns of diffusion instead of the extreme cases, thus users with post counts in a higher place the 90th percentile of posting frequencies are removed. The 90th percentile for the NE topic is 676.ii tweets and the 90th percentile for the JC topic is 543.3 tweets. Effigy 1(a) shows the Empirical Cumulative Distribution Function (ECDF) of both topics after information filtering, and Figure one(b) plots the user counts by tweets of both topics. Although in that location are some users with a large number of tweets in the NE dataset, the bulk of users contributing to this topic tweeted less when compared to the JC dataset. When fitting the CDF of tweet per user to power law distribution, the NE dataset shows overall meliorate fitting according to the Kolmogorov-Smirnov (KS) examination. Effigy 2 indicates the fitted ability law distributions with an exponent of two.22 for the JC dataset and an exponent of 2.5 for the NE dataset.

Figure 1. Distribution of tweets per user and user counts by tweets after filtering potential Twitterbots and overly active users. The cherry-red line in both charts represents the JC dataset and the light-green line represents the NE dataset.

Effigy 2. Plots of CDFs and Power Law best fit lines after dissonance filtering. The lines indicate the best fit Ability Law of the dataset. The red line represents the JC dataset fitting with α of two.22; and the green line represents the NE dataset fitting with α of ii.5.

3.iv. Representing conversation aspects

In improver to the number of geotagged responses (Geo) about a topic, in that location are several other aspects of the conversation that tin be retrieved to correspond different types of response to the two topics in urban areas. The numbers of Unique Users (UU) represents the number of users that responded to this topic (Kryvasheyeu et al. 2016), and the average number of posts past unique users (Post-User) can exist treated as the intensity of engagement from the users. Retweeting activities take been used in Twitter-based diffusion studies as the indicator of improvidence (Yang and Counts 2010; Suh et al. 2010), and thus the total number of retweets (RT) and not-retweets (noRT) are included. Tweets with URLs (URL) may accept higher credibility (Duan et al. 2010) or spur more retweets (Suh et al. 2010); however, tweets without URLs (noURL) may correlate ameliorate with existent-world events such as influenza outbreaks (Nagel et al. 2013; Aslam et al. 2014). For each city and MSA, active users within the region were filtered and thus a set of avant-garde filtered features tin can be used (Filt_noRT, Filt_noURL, and Filt_All). Furthermore, these chosen statistical variables were normalized by the number of unique users to generate an additional five variables (noRT_UU, URL_UU, noURL_UU, Filt_noRT_UU, and Filt_noURL_UU). In addition to Mail-User, 4 derived variables are selected including the pct of non-retweets (noRT%), the percentage of tweets without URL (noURL%), and the normalization by unique users of these two (noRT_UU% and noURL_UU%). These 19 statistical features give us an overview of the responses from the study region.

iii.5. Measuring similarity past response behavior over time

Adding the temporal dimension, we analyze improvidence for each region past measuring the frequency of responses over time using two measurements: (ane) the frequency of social media postings related to a topic, and (ii) the number of unique social media users accounted for in these posting. These two measurements were represented by two Improvidence Rate Indexes (DRI).

Diffusion Rate Alphabetize by Posting Frequency (DRI_p): DRI_p considers the frequency of social media posts about a topic as the indicator of improvidence. Information technology represents the intensity of information improvidence in a given area. Within a time frame, the diffusion rate of a given topic at a region is the ratio of the frequency of topic-related posts (Postal service_t ) to the average frequency of posts at the same temporal resolution (AvgPost_t ).

Diffusion Charge per unit Index by Unique Social Media Users (DRI_u): DRI_u considers the increment of social media users who had topic-related posts as an indicator of information existence more than diffused. It represents the number of users in the region that are aware of the information. Within a fourth dimension frame, the diffusion charge per unit of a given topic at a region is the ratio of unique users who had topic-related posts (User_t ) to the average number of unique users during the same time frame (AvgUser_t ).

In this report, the values of AvgPost_t and AvgUser_t were derived by monitoring the baseline behavior of geotagged posts in each study region. Calculating DRI_p and DRI_u over time generates fourth dimension serial observations for each region, which nosotros used to analyze whether regions show similar diffusion patterns in response to the aforementioned topic. It's worth discover that the distance metrics should be selected based on the aspect of diffusion patterns that are being studied. If responses happen at the same temporal time points, Euclidian altitude and other correlation-based altitude may offer the differences of the improvidence magnitude. However, regions may respond to data differently and at different temporal intensity, thus we formalize dissimilarity of response beliefs with four distance metrics that focus on the overall pattern and better business relationship for the fourth dimension skew in the fourth dimension series data: Dynamic Time-Warping (DTW) (Berndt and Clifford 1994), Earth Mover's Distance (EMD) (Rubner, Tomasi, and Guibas 1998), Fréchet Distance (FD) (Alt and Godau 1995), and Hausdorff Distance (Hard disk drive) (Huttenlocher, Klanderman, and Rucklidge 1993).

4. Results

4.one. Baseline beliefs of posts and users in urban regions

Baseline behavior for each region was established to properly normalize the responses of each geographic region and create the two DRI measurements. Nosotros start examined the posting and user behavior in the baseline dataset with multiple temporal resolutions to see if they stand for with the daily action (commuting and piece of work) and the capability constraints (sleep and eat) at local time.

Daily activity: Amongst the top xxx cities, the average number of posts drops every bit the population of the urban center decreases. Figure 3(a) indicates the daily average of posts/user behavior in the top U.South. cities. However, several cities stand up out with much college average posts comparing to the equivalent rank cities. For example, San Francisco is ranked 13th past population but its daily average mail service number is between Los Angeles (2nd) and Chicago (third). Boston, Seattle, and peculiarly Washington are lower ranked cities only are in fact very active in tweeting beliefs. The daily average numbers of unique users showed a similar trend to the average posts for the cities. San Francisco has the largest difference betwixt the two measurements, which ways that San Francisco has a college post per user beliefs than all the other cities. Figure 3(b) indicates the daily average of posts/user beliefs in MSA areas. Several MSAs stand out from the trend, such as Miami, San Diego, and Orlando, showing higher activeness than MSAs with like populations. Riverside has the lowest daily average post and the everyman unique users while existence ranked 13th by population. These MSAs suggest that there are other factors affecting their baseline behavior, such every bit (a) Riverside is one of the satellite regions of the greater Los Angeles metropolitan areas where many residents commute and work at the Los Angeles MSA on a daily basis; or (b) Miami, San Diego, and Orlando are popular tourist destinations and then additional posts could be contributed by tourists instead of residents. San Francisco, by comparison its beliefs across city and MSA, presents a very interesting instance where its city area is relatively small but has most of the tweeting behavior of the whole MSA.

Figure iii. Average posts/user behavior in the top 30 U.Southward. populated cities and MSAs. In both sub-figures, the 10-axis represents the listing of urban regions descending by population from left to correct. The showtime y-axis on the left represents the boilerplate number of posts, with its value shown as blueish. The secondary y-axis is for the average user behavior with value indicated by blood-red horizontal bars.

Activity by day of the week: The cyclical temporal patterns of weekday behavior and hourly behavior are also similar across the two urban types. Effigy iv(a) shows a boxplot of average tweet frequency by weekday in our study regions. Both urban types show the aforementioned patterns, with the highest mail frequency on Sundays and the lowest on Saturdays. Notwithstanding, the variances between regions are much higher amidst the cities than amongst the MSAs, especially on Sundays. As shown in Effigy iv(b), the order of user frequency past weekday is exactly the same as the tweet frequency. Still, the effigy shows that there is a college variation of user frequency over the weekends for cities than MSAs. This is mayhap because several cities have just a small portion of residential areas, and thus the commuting population is not tweeting in those city areas during weekends.

Effigy 4. Twitter activeness past 24-hour interval of the week. The red boxes represent activeness of the elevation 30 U.S. populated cities, and activities for the MSAs are shown by blue boxes.

Hourly activity: Tweet frequency and user frequency act similarly for cities and MSAs when posts and number of unique users are aggregated to the hourly temporal resolutions. Figure 5(a) indicates the baseline post frequency beliefs in the urban areas past hours of the day. The beliefs is consequent in both urban types, which basically follow the human daily activity patterns with three fourth dimension blocks: (a) people wake upwardly in the morn and activities increase throughout the day, (b) activities decrease after getting off work followed by a steady decline during the night, and (c) activeness reaches the lowest during the mid-dark hours when most people are sleeping. One significant difference betwixt the 2 urban types is that the posting frequency varies more among cities at night. In add-on, cities have a college portion of nighttime activities and dark activities turn down slower than for the MSAs. The baseline behavior of total unique users by hours are very similar between cities and MSAs (Figure five(b)). There are large increases in the early morning (vii AM to eight AM), and big decreases during evening commuting hours (5 PM to six PM), which can be interpreted as: (a) users tend to post multiple tweets early in the morning time and (b) the evening commute is the least favored time for posting multiple tweets.

Effigy 5. Hourly tweet and user activity in height U.South. populated regions. The red boxes represent activity of the top 30 U.S. populated cities, and activities for the MSAs are shown past bluish boxes.

4.two. Spatial patterns of the topic dataset

The point distribution of geotagged tweets for both topics generally follow the population distribution patterns in the U.S. A closer look, however, reveals some variation across the 2 topics. Geotagged tweets for each topic were then mapped into distribution heatmaps using Kernel Density Estimation (KDE), which calculates the density of features in a neighborhood around those features. In add-on, when multiple heatmaps apply the same radius and total numbers of cells for calculating the KDE, a differential information mural map (Tsou et al. 2013) tin be created to illustrate the geospatial fingerprints of the two topics in different regions of the U.South., as shown in Figure 6. The intensity of the differential values is visualized by the shading of the red/blue colour, where red indicates more attention to the NE topic than to the JC topic in the region and blue indicates the opposite. On the Westward declension, responses to the NE topic were significantly higher in Southern California compared to the other topic. In general, the East coast shows a slightly more evenly distributed pattern of the conversation hot spots, while the NE hot spots dominated the South-East regions, changing to the JC topic in united states closer to the Canadian border and New Orleans where about French-speaking population resides. Conversations about the ii study topics were so spatially aggregated into the two urban area types and compared to the population of each selected region. The full counts of responses to both topics are highly correlated to the population in the urban regions. The Pearson'south correlation coefficients are high for the NE topic (city: 0.76 & MSA: 0.82) and for the JC topic (0.89 & 0.92).

Figure vi. Differential information landscape map of geotagged tweets for two topics; blood-red indicates more Nepal Earthquake attention than #JesuisCharlie; blue indicates more #JesuisCharlie than Nepal Earthquake.

4.3. Diffusion patterns for the top U.S. populated urban areas

four.three.1. Characteristic reduction and classification of urban areas

The 19 statistical features calculated from the topic dataset were normalized by the tweeting behavior established from the baseline dataset. We then analyze the correlations of these statistical features and remove the highly correlated ones. This is important as highly correlated features may provide very similar information about the phenomenon under report. Including these features in the assay may identify extra weight on specific features and thus impact the overall results. Figure 7 shows the correlation matrixes before and later on the stride of variable reduction. The first row is the correlation matrix of all 19 statistical variables for both topics and for both urban types. After variable reduction, 10 out of xix statistical variables were preserved for analyzing the diffusion similarity amongst the written report regions, including v count-based features (UU, RT, URL, noRT, noURL) and 5 derived features (Post_User, noRT%, noURL%, noRT_UU%, noURL_UU%). We implemented multidimensional scaling (MDS) with the x remaining variables to clarify and visualize the similarities among urban regions. These variables represent dissimilar aspects of the twitter responses and each of them was used as ane dimension in the MDS. Furthermore, the thirty cities and MSAs were separately grouped into four classes using Jenks' Natural Breaks method to examine whether the size of the urban regions impacts their response similarities. Often used in GIS, this method tries to reduce the within-class variance and maximize the variance among classes. We apply Jenks' Natural Breaks nomenclature method to grouping the report regions by population, population density, and GDP (MSAs only). The classification of cities and MSAs was used as an boosted aspect in the MDS graphs. The next step was to clarify the similarities among the urban regions with MDS, which utilized visualizations of multivariate data points in a two-dimensional space.

Effigy 7. Correlation matrix of the statistical features. Start row: correlation matrix of all features before reduction (19 features); Second row: showing all features after reduction (10 features).

4.3.ii. Similarity among urban regions

Top 30 U.S. cities: Effigy 8 shows the MDS results with colors indicating the population classes. Overall, i major cluster can be found across both topics, where class members included a mixture of cities from all four population classes. For both topics, New York and Los Angeles are very close in MDS altitude, which means that they comport more similarly based on the 10 aspects of responses to the two topics than other cities. Cities that responded differently than others are Fort Worth and Jacksonville, both positioned further abroad from the cluster in both topics. Washington and San Francisco form some other very similar set beyond the response to both topics. However, this pair is distant from the principal cluster in the JC topic, while being right at the middle in the NE topic. Overall, cities in class two and class 3 seemed to be more like to other inside-class cities, which is not true for the lowest class 4. In Figure 9, point colors are changed to stand for the population density classes. For the JC topic, cities of the lower classes (form 3 and class 4) are scattered in the MDS graph. Although the form 1 urban center and class 2 cities are relatively closer to each other, they are less similar than the MDS using population. Population density explains very well the cluster of class 3, in which cities are very similar besides San Jose. This is also true for class 2 with merely Chicago being dissimilar to virtually cities.

Effigy 8. Multidimensional scaling of the top 30 U.Due south. cities; color indicates population class of each urban region.

Figure 9. Multidimensional scaling of the summit 30 U.South. cities; colour indicates population density grade of each urban region.

Top 30 U.S. MSA: Effigy 10 depicts the MDS of twitter response behavior of the top 30 U.Due south. MSAs to both topics, where the colour of the points indicates the population class of the MSAs. For the JC topic, a loose cluster formed past lower populated MSAs (course 4) can be seen in the middle. Around that cluster, college populated MSAs (form 1 and ii) are nearby with form 3 MSAs scattered further away from the cluster. For the NE topic, grade three and class 4 MSAs create two clusters. Amidst the higher classes, New York and Los Angeles are very similar in terms of their responses to both topics although in different classes. Chicago is more like to them in the JC topic while being far away from the cluster for the NE topic. Another interesting MSA is Washington, which is classified every bit class 3 but is actually more similar to New York and Los Angeles than to other MSAs of its class.

Figure 10. Multidimensional scaling of the summit 30 U.Southward. MSAs; color indicates population class of each urban region.

Switching from population to population density (Effigy 11) amend explains the closeness within classes. New York and Los Angeles form the simply MSAs in class one and evidence very high similarity in term of their response behavior to the two topics. More MSAs are included in course 2 and three out of four MSAs are relatively similar in both topics. Lower density classes (class 3 and 4) also do good from the employ of population density. The chief cluster formed by grade 4 in the JC topic is denser than before, and the class 3 MSAs are also more concentrated at both topics. While several MSAs switched classes in the case of grouping by Gdp (Figure 12), the across-classes patterns in the MDS did not alter significantly from the two population-based MDS results. 1 significant change is that form 3 MSAs are more than separated from grade 4 MSAs and starting to form the second cluster, which is happening in the NE topic as the distances amid class 3 MSAs are shrinking.

Figure 11. Multidimensional scaling of the superlative 30 U.S. MSAs; colour indicates population density class of each urban region.

Figure 12. Multidimensional scaling of the top 30 U.S. MSAs; color indicates Gross domestic product grade of each urban region.

4.iii.3. Inside-class similarity

The MDS visualizations demonstrate the similarity amid urban regions based on multiple aspects of their twitter responses to the topic. The next step is to examine whether the response behavior of these regions over time reflects the urban hierarchy. We focus on whether urban regions that are categorized in the same class respond similarly to a topic over time. The two improvidence indexes (DRI_p and DRI_u) were calculated for each region as time series data, and then pairwise distances amidst the regions are created using the four selected altitude metrics. To examine inside-course regions similarity, two mean metrics are created:

Within-grade hateful: The hateful of pairwise distances between all regions in each form.
Random sample hateful: The mean of pairwise distances between 10 random samples picked from all regions. Calculated for grand iterations.

The combinations of two topics, three form types (population, population density, and GDP for MSAs), and four region classes (class 1 to class 4) resulted in 16 inside-class mean/random sample mean pairs for cities and 24 pairs for MSAs. For each pair, we checked if the within-class mean was smaller than the random sample mean. In the cases where the inside-course mean is smaller, the response behaviors over time are interpreted as more similar amongst the urban regions classified in the same form. Figure 13 indicates the comparative results for cities and Figure 14 shows the results for MSAs. Note that as well the population density class for MSAs, the other class one comparison is tagged equally NA, every bit there was simply one region alone in that class and so the within-class hateful cannot be calculated.

Figure xiii. Comparison of within-grade urban center similarity to random sampled urban center similarity. Red indicates that within-class hateful is smaller than random sampled mean more than than 800 times out of the grand iterations.

Figure 14. Comparing within-grade MSA similarity to random sampling MSA similarity. Red indicates that within-course hateful is smaller than random sampled mean in more than 800 times out of the 1000 iterations.

DRI_p and DRI_u as a Diffusion Index: When it comes to representing the response behavior over time, the differences are not very significant using either DRI_p or DRI_u. However, DRI_u seems to work better every bit it results in more than inside-course similarity than DRI_p. Especially for the MSAs, DRI_u contributes to more than cases where the within-form mean is smaller than the random sample hateful. For example, MSAs in the highest region form (excluding NA) are always more than similar than random samples. In improver, representing regions with DRI_u also makes the inside-form similarity more stable beyond the four selected distance metrics.

Altitude Metrics for Region Similarity: Among the four distance metrics, EMD performs the best in terms of showing the inside-class similarity. EMD works well with DRI_p for the cities, and particularly for higher region classes (class 2 and class 3) as 7 of the 8 classes accept smaller ways than random. The same finding holds true for the MSAs, where MSAs belonging to the ii higher region classes have very similar response behavior over fourth dimension with both DRI_p and DRI_u (24 out of 24 region classes). The rest of the three altitude metrics do not support finding within-grade similarity in most cases every bit EMD does. However, the four altitude metrics are in agreement in places including cities past population (class ii and form 3 with DRI_u) and MSAs of the highest region classes with DRI_u (class 2 of population, grade 1 of population density, and form 2 of GDP). Another interesting finding is that more agreements can be establish among the distance metrics in either the highest region class (class i or 2) or the lowest region class (class 4), especially with DRI_u.

five. Word and futurity work

In this inquiry the relationship between information diffusion patterns and the urban hierarchy has been explored. We assumed that the improvidence of information can be captured by the response behavior of an urban region and hypothesized that this behavior is related to the region's position in the urban hierarchy. We presented a research framework for analyzing diffusion with the urban hierarchy that demonstrated procedures of data collection and filtering, establishing and examining baseline behavior, engineering conversation features, and analyzing diffusion patterns of selected topics at urban regions. Social media postings from Twitter nearly two topics during 2015: Nepal earthquake and #JesuisCharlie are collected for the pinnacle xxx populated cities and the top 30 populated metropolitan areas (MSA) in the U.Due south. Nosotros commencement examined the baseline beliefs and and then analyzed the similarity of how urban regions responded to the Twitter information.

The regional baseline behavior showed that cities and MSAs bear witness very similar posting and unique users activeness behavior patterns in both daily (day of the week) and hourly (60 minutes of the day) temporal resolutions. In addition, the baseline hourly tweeting activity aligns well with man daily activity patterns known every bit the capability constraint. Overall, the average number of posts and unique users from each city and MSA gradually decreased every bit the regional population decreased. We introduced 19 variables to stand for different aspects of advice on the Twitter platform, including personal opinions, echoing other messages, and references to external sources. Nosotros used Multidimensional Scaling (MDS) to analyze the similarities among urban regions with three urban variables (population, population density, and GDP for MSAs). For the height 30 U.S. cities, population seems to be the better variable to explain the similar response behavior. Yet, population density exceeds the other two variables for the MSAs. The results likewise showed that: (i) some regions accept very different interests in the two study topics, such as Chicago; (2) New York and Los Angeles are very similar in both topics and in both city and MSA scale; and (3) although Washington D.C. and San Francisco are not really close to Los Angeles/New York in the urban hierarchy by all three classification variables (population, population density, and GDP), they are the virtually similar urban regions to them in terms of their Twitter responses to new data. These cases suggest future piece of work should examine other factors that influence the diffusion of information. For example, how continued a region is to the others, or further examine why a region is specifically interested in the given topics.

Nosotros introduced Ii Improvidence rate Indexes (DRI) to clarify the twitter response from urban regions to a new topic over time. To formalize the dissimilarity amidst the response beliefs, four distance metrics were designed to compare the cities. Equally a issue, cities that are classified in the same class by population are very probable to share similar response behavior over fourth dimension, especially for the higher populated cities at the peak of the urban hierarchy. For the MSAs, the inside-class similarity is shown in the highest and lowest classes in all classification variables. Overall, the diffusion rate index with unique users (DRI_u) is the better measurement to analyze response behavior over time. Earth Mover's Altitude (EMD), amongst the four altitude metrics, has the best performance when comparing within-course region similarities to random sampling region similarities. Furthermore, the MSA is more suitable than the metropolis to demonstrate the impact of the urban bureaucracy on the diffusion of information. MSAs in the aforementioned class have higher chances of responding similarly to new information to twitter over time.

Iii limitations of this study merit acquittance. Start, the framework was demonstrated with geotagged tweets that could be a fraction of the whole chat population on Twitter. The framework can be practical to not-geotagged posts equally well if some geocoding algorithms are involved to assign proper locations to enrich the location information of each post. Although the often short and noisy text content remains to be the main challenge of geocoding tweets (Zheng, Han, and Sunday 2018), efforts take been made in this domain to infer the location of the social media users with help of user profiles (Hecht et al. 2011), with text content and timestamp (Li et al. 2011), and with models that combine multiple components (Mahmud, Nichols, and Drews 2012; Rodrigues et al. 2013; Ghahremanlou, Sherchan, and Thom 2014; Miura et al. 2017). In add-on, hereafter works applying this proposed framework using data other than tweets would also demand to reconstruct the 19 conversation variables to represent their selected information source. Second, the improvidence patterns that we discovered is based on the nineteen variables generated by raw and derived counts of the tweets to stand for different aspects of advice. It would be interesting to add variables that carry the context of communication such as the sentiment of posts or other features from the Natural Linguistic communication Processing (NLP) of text. In improver, this research removed loftier correlation variables for the MDS and other characteristic reduction methods such as random woods and PCA could be practical when more chat variables are included. 3rd, the spatiotemporal diffusion patterns and their relation to the three urban hierarchies are the results of the two selected information topics and defined keywords. Applying the research framework to examine the spatial and temporal diffusion patterns of more topics would make the findings more general.

Summing upward, this inquiry used ii types of urban bureaucracy, cities and MSAs, to study information diffusion over time. The results bear witness that the urban hierarchy does bear upon the diffusion of data in virtual space with diffusion aqueduct such every bit Twitter. Such an impact is rooted in the college similarity of the response behavior among regions when they are positioned closer in the urban hierarchy. This is especially meaning when the urban regions are at the pinnacle of the urban hierarchy. Time to come research will involve expanding the dataset to include additional topics that represent the diverseness of information each region is exposed to, and other forms of social media. Another management is related to constructing the urban hierarchy, which involves expanding the numbers of regions to build a more than complete urban hierarchy or introducing other metrics that tin rank cities and regions differently. Other possible indexes can include the Creative Urban center Index (Hartley et al. 2012) or the economic functioning and network connectivity of cities (Pain et al. 2015). The hierarchy of political importance in the country may too exist important, every bit Washington, D.C. clearly stands out in this research. Furthermore, to meliorate gauge the responses from local residents in each region, future piece of work should separate tourists from locals using user profiles, past geotagged post locations, or content of the historical posts of the user. Moreover, futurity research that utilizes global topics as in this study tin can likewise analyze the worldwide spatiotemporal diffusion and investigate the role of fourth dimension zone delay in response beliefs. While the studied cities and MSAs have very similar baseline behavior, some responded very differently to the two topics. This aligns with the argument of the local nature of data and the part of infinite and place tin facilitate the social engagement from the Physical-Cyber-Social point of view (Sheth, Anantharam, and Henson 2013). The potential differences in social, cultural, religious, and economic characteristics among these urban regions might provide insights into the dissimilarity in their responding behaviors. We would like to explore the theory and mechanism behind those differences in follow-up studies. Lastly, we also suggest future research to examine the patterns of improvidence at different spatial resolutions. While this study is specifically interested in diffusion at aggregated regions, hereafter research tin can utilize the fine-grained spatial data to identify the types of location that diffusion would about likely take place or to analyze the Neighborhood Effects (Hägerstrand 1967) in diffusion across nearby regions.

Boosted information

Notes on contributors

Jiue-An Yang

Jiue-An Yang earned his PhD in Geography from the San Diego Land University/Academy of California, Santa Barbara. He is currently a postdoctoral research scientist at the Middle for Wireless and Population Health Systems (CWPHS) and the Qualcomm Institute (QI) of Academy of California, San Diego. His research interests include information diffusion in cyberspace, geospatial perspectives of personal and public health, spatiotemporal analysis of big geo-data, and geospatial artificial intelligence.

Ming-Hsiang Tsou

Ming-Hsiang Tsou is a Professor of Geography and the Director of the Eye for Human Dynamics in the Mobile Age at San Diego State Academy. He received a Ph.D. (2001) in Geography from the Academy of Colorado at Boulder. His research interests are in Large Data, Human Dynamics, Social Media, Cancer Disparity, Visualization, Web GIS, and Cartography. He has published one academic book (Internet GIS) and over 76 refereed academic papers. Dr. Tsou led on multiple NSF projects as PIs or Co-PIs (over $4 million awards since 2010) and served on the editorial boards of the Annals of GIS, Cartography and GIScience, and the Professional Geographers.

Krzysztof Janowicz

Krzysztof Janowicz earned his PhD in Geoinformatics from the University of Muenster, Germany. He is currently an associate professor at the University of California, Santa Barbara and associate director of the centre for spatial studies. His research involvement include knowledge graphs, geosemantics, geographic information retrieval, and cyberinfrastructure. He has (co)-authored over 200 scientific publications, including a book on internet security.

Keith C. Clarke

Keith C. Clarke is a inquiry cartographer and professor, with the Chiliad.A. and PhD from the University of Michigan, specializing in Belittling Cartography. His research covers ecology simulation modeling, modeling urban growth, terrain mapping and assay, and visualization. He is the author of iii textbooks in eight editions, and well-nigh 300 volume capacity, journal manufactures, and papers in the fields of cartography, remote sensing, and geographic information systems. He was elected an AAAS Fellow in 2017.

Piotr Jankowski

Piotr Jankowski earned his PhD from the University of Washington. He is currently a professor and chair of the Department of Geography at San Diego State University. His research focuses on spatial decision support systems, participatory geographic information systems, and sensitivity analysis in spatial models. He is an author of over 100 peer-reviewed periodical papers and the co-author of two books: Geographic Information Systems for Group Decision Making and GIS for Urban and Regional Environments: A Spatial Determination Support Arroyo.