Hi. We are Arbitrary.

This is our project,

SIM Spam Slam

Like the popular canned food brand it is named after, spam texts can come in all kinds of varieties. This data project aims to uncover the commonalities between these messages and analyze the trends and patterns that they follow.

  • Hans Miguel Salazar WFY
  • Ehren Castillo WFY
  • Kevin Trinidad WFY

Overview

What is this project for?

The advent of Short Message Service (SMS) technology greatly shifted the nature of communication. The ease and accessibility of this method led to a rise in the number of spam texts. Bad actors are taking advantage of other people through malicous use of this technology.

SMS Relevancy

The rise of instant messaging (IM) has not eliminated SMS technology due to mobile data limitations and availability of devices. Many people in the Philippines still rely on SMS regularly.

Republic Act No. 11934

Also known as the Subscriber Identity Module (SIM) Registration Act, it was approved on 10 Oct 2022 with the primary purpose of assisting local authorities in eliminating malicious usage of SMS technology.

Preventive Measures

Technology and society is ever-changing, hence, a static solution is a band-aid solution at best. The use of machine models may allow an expandable framework for filtering spam texts.

Background

Despite the country's best efforts, Spam texts remain a prevalent and persistent form of cybercriminal activity, with millions of messages being detected as of 2024.

Telecommunication companies such as Globe, PLDT and Smart have taken action against 5.5 million spam messages in the past year. However, even with this preventive action, the Cybercrime Investigation and Coordinating Center (CICC) noted that the losses of Filipino mobile phone users to spam texts have reached millions of dollars. Moreover, some instances of spam messages have also been found in instant messaging apps, which are not covered by the Sim Registration Act.

References: [1] [2] [3]

What do we aim to do?

Through this project, we want to be able to address the following questions in an effort to further understand the nature of SMS Spam Texts in the Philippines.

Main Question

What could be the trends and patterns of spam messages in the Philippines?


Secondary Question

Is there a difference between Spam messages before and after the deadline of the Sim Card Registration Act?


Null hypothesis

There is no difference in spam messages before and after the deadline of the SIM card registration act.


Alternative hypothesis

There are differences in spam messages before and after the deadline of the SIM card registration act.


Action Plan

Utilize data science tools and Natural Language Processing (NLP) to identify the patterns and trends of Spam text content.

How was the data collected?

The dataset used is a compilation of two public datasets available on Kaggle . Both of these datasets contain Philippine spam messages personally received and compiled by these Kaggle users. Additionally, both datasets contain timestamps for when messages were received.

Keywords Text Spam SMS Registration Scam Philippines

Data Set A

  • Author: Scott Lee Chua
  • Date Start:09/30/2022
  • Data End:02/27/2024
  • Raw Count: 3464
  • Processed Count:467

Data Set B

  • Author:BwandoWando
  • Date Start:12/11/2022
  • Data End:03/31/2024
  • Raw Count:619
  • Processed Count:595

How was the data handled?

Through Python and its many libraries

Tools Python Jupyter Notebook Google Colab Google Sheets Pandas Deep Translator Natural Language Toolkit Regular Expressions NumPy Scikit-learn Plotly

Cleaning

The datasets were cleaned by filtering out entries that had insufficient data or contained data which could not be used .

For this step, the entries containing the following properties were filtered out of the initial datasets:

  • non-exact duplicates
  • non-spam (Data Set A)
  • no phone number
  • no alphabetical characters in body
  • hidden or broken content
    ("Content not supported" or "<Redacted>")

Preprocessing

With both datasets cleaned, they were combined into a single compiled dataset. From here, several pieces of information were extrapolated from each entry including:

  • year
  • month
  • day of week
  • sender carrier
  • SMS length

Lastly, the SMS messages were prepared for topic modeling by applying the following:

  • emoticons were removed
  • text was translated to English
  • tokenization
  • lemmatization
  • stopword filtering
  • stemming

Topic Modeling

Using the prepared data obtained, topic modeling was performed on the SMS messages in the dataset to form eleven topic clusters.

From this, each entry was assigned to a cluster depending on which topic it was most dominantly similar to, thus grouping SMS messages with similar or possibly related content.

Dataset Splitting

The deadline for SIM Card Registration was July 25, 2023. The processed data set was split into two dataframes whether the message was received before or after this data. From each of these two dataframes, 200 random samples were taken to equalize them for statistical testing.

Here are the results of data exploration.

With the data prepared, visualizations were generated. The confidence level for hypthesis testing was set to 95%. This will determine if the SIM Card Registration deadline on July 30, 2023 has an effect on certain parameters of spam messages.

Frequency of words


There is an association between the period before and after the SIM card registration act and the frequency of the top 10 most common words in spam messages.

For the hypothesis testing, chi-square test for independence was used on the top 10 most frequent words found in the spam messages and that resulted to p-value = 5e-23. The primary goal of spam messages is to mislead people and incite thoughtless behaviors. This can be accomplished through various methods but one of the most common tactics of spam messages is to take advantage of scarcity and desires [4]. Certain words are more effective for this than others and may be an indicator of spam.


Before


After

Frequency of spam SMS received by day of the week


There is no association between the period before and after the SIM card registration act and the frequency of spam messages received by day of the week.

For the hypthesis testing, chi-square test for independence resulted to p-value = 0.499. This lack of effect could be attributed to the prominence of automated bots and scripts. While regular people only work during the weekdays, bots are different. The day of the week could be said to be irrelevant to the sources of these spam messages.

Before

After

Links in spam SMS


There is an association between the period before and after the SIM card registration act and the presence of links in spam messages.

For the presence of links hypothesis testing, chi-square test for independence was used and it yielded p-value = 0.023. Due to the ever increasing accessibility of the internet, website links are now often seen in spam messages. If the words were meant to incite various psychological aspects, then the link is presented as the "gateway" to resolve those. However, the presence of a link in a message, though a good indicator, is not a conclusive metric.

Another important component of website links is the top-level domain (TLD) such as .com, .edu, and etc. No hypothesis testing was done for TLDs, TLDs are moderated by the Internet Corporation for Assigned Names and Numbers (ICANN) and the restrictions for each type can vary, however, lesser known TLDs have less stringent restrictions when compared to TLDs like ".com.ph" [5]. As a result, websites within these lesser known TLDs can be malicious in nature, therefore, links' TLDS are also a good indicator.

Before


After

Notice how the number of sites with TLDs of ".com" dropped sharply. This may be attributed to the risk by association site of ".com" site owners gained after the deadline.

Length of spam messages


There is no association between the period before and after the SIM card registration act and the character length of spam messages.

For the hypothesis test, Mann-Whitney U test was used and it yielded p-value = 0.639. Normality testing showed that the pre and post-deadline samples were not normally distributed, hence, a t-test would be unreliable. The distribution can be attributed the 160 characters per peso that prepaid carriers charge users for sending SMS messages. The test result may be an indicator that spammers made little to no changes in terms of content.

Before


After

Clustering


For the machine learning algorithm, Linear Discriminant Analysis (LDA) with topic count = 11. The spam messages were clustered through natural language processing (NLP) in Python through the scikit-learn (sklearn) library. The resulting cluster plot can be found as an image below. Note that on larger displays, an interactive version is available.

Multiple values for topic count were tested, with 11 being the most optimal and reasonable result. However, the resulting plot is still sub-optimal. The 11 topics are not distinct enough. Some overlaps can be found between them when further inspected. For example, multiple topics have keywords connected to gambling such as "draw, spin, and roulete". Nonetheless, the plot still showcases the general themes of spam messages. These would be: False Security Alerts, Gambling "Opportunities, Free High-value items, and Easy Money.

Recommendations for future projects would be to use other models aside from LDA clustering and increase the diversity of the data set further.

What do the results mean?

Now, the results need to be connected to our questions.


As shown in the results, it was found that the period before and after the SIM card registration act was correlated with the frequency of the top 10 most common words appearing in spam messages and the presence of links in spam messages. On the other hand, it was also found that it did not correlate with the day of the week spam messages are received and the character length of spam messages. At face value, these results show that while the SIM card registration act did not correlate with any changes in the form of the spam message (i.e. character length, day of the week sent), it did correlate with changes in the actual contents of the spam messages (i.e. presence of links, words that appear in a message).

Therefore, this suggests that we may reject the null hypothesis on the basis that whether a spam message was sent before or after the SIM card registration act was correlated to some difference in the spam messages. However, the extent of this difference and whether this has a cause-and-effect relationship cannot be conclusively determined.

Moreover, it must be noted that this study is limited primarily due to the small number of parameters tested, the relatively small sample size of the dataset, and the lack of dataset entry sources (results in bias) . As such, further study must be accomplished with a larger and more comprehensive dataset and set of parameters to determine the validity of these correlations found and provide better interpretations. Still, this all suggests that spam messages in the Philippines did change in some manner before and after the deadline of the SIM card registration act. As for whether these differences are for the better or for the worse how this reflects on the overall effectiveness of the SIM card registration act in the Philippines requires further research.

Besides that, the results of the topic clustering also show the existence of some pattern or trend in the content of spam messages. However, again, due to the limitations of the study and the nature of topic clustering, it is difficult to determine what exactly these patterns are. Still, this is significant in showing that spam messages in the Philippines are not just random and do have some pattern in their contents. This is significant as the information gained through the topic clustering can be used to better determine whether a certain SMS message is spam or not based on the presence of certain words commonly found in a certain cluster, thus creating better filters for spam messages received in the country.

The Conclusion

TLDR, what now?


Even until today, SMS communication and instant messaging technology serve as the primary methods for communication within the Philippines with millions of users making use of these nationwide. In line with Sustainable Development Goal number 9, it is in our best interest to make this technology safer and easier to use - with our project primarily focusing on the fight against spam messages and their pernicious content.

The data shows that certain trends and patterns can be determined based on the features of these messages such as the common terminologies found within them as well as the presence of links. The identified patterns can be linked to the application of certain persuasive techniques combined with malicious intent in order to trick unsuspecting Filipinos into getting scammed, providing personal information, or invest in shady financial schemes. However, there were also some features that did not exhibit any trend such as the day of the week in which they are received and sent as well as the length of the spam messages, which seemed to stay consistent, at least with the data that we were able to gather. This is understandable as they are features that are more closely related to the method by which the spam messages are sent rather than their content, which is unlikely to change barring major technological advancements in the medium. Moreover, some of the changes within these trends did seem to have at least been associated with the recently implemented Sim Card Registration Act with some degree of statistical significance.

Finally, the application of clustering on the data shows us that these spam messages have the potential to be further categorized based on their content, which could be used in developing more robust spam detection filters and systems.

The proponents of this project hope that these results will primarily serve as a reference that will aid Filipinos in discerning what messages follow the patterns of malicious spam messages and help them avoid the concerning outcomes that may result from entertaining these kinds of messages.

It also serves as a data-driven look into the potential impact of the Sim Card Registration Act on one of the issues that it aims to address, assisting future lawmakers and citizens to improve the implementation of this law further.


So join us in the fight to make sure that every Filipino SIM out there will have most of their SPAM messages SLAMmed out of existence.

Meet the team.

Ehren Castillo


Hey! I'm Ehren. I'm a 2nd year BS Computer Science student at the University of the Philippines, Diliman. I'm generally interested in software and web development, as well as trying out a wide variety of technologies and programming languages. In my free time, I enjoy programming, playing games, and reading manga.

Hans Salazar


Heyo! I'm Hans Miguel Salazar, a 4th year computer science student with interests in web development, artificial intelligence and game development. I hope to contribute to the nation through the projects that we create and to make the world a better place, one line of code at a time. When I'm not doing that though, I read manga and novels, and play games as well.

Kevin Trinidad


Hello! I'm Kevin and I'm a 2nd year computer science student with interests in game and software developement. I've been helping my dad with computers ever since I was young so both physical and digital aspects of computers are something I'm fond of. My free time is usually spent playing games, reading novels, and exploring software tweaks.