Explore an open-source malware classifier and dataset in this conference talk from BSidesSF 2018. Delve into the challenges of machine learning for static malware detection due to limited public datasets. Learn about a new open-source dataset of labels for diverse Windows PE files, including feature vectors for model building and a pre-trained model for research. Discover the reasoning behind feature selection and labeling, and witness the model's performance on real-world samples. Gain insights into the Ember dataset, its naming convention, and the training set composition. Examine two types of features, their calculation methods, and various categories such as section information, strings, and file size. Understand feature vectorization, model training, and scoring processes. Explore the code base, Python notebook, and feature engineering techniques. Investigate semisupervised learning and offensive research applications. Conclude with a live demonstration showcasing data download, packed samples analysis, and metadata examination.
Read more