This project introduces an innovative approach for automating the parsing of unstructured text from searchable PDF invoices into structured schemas (name entity recognition). Traditional methods rely heavily on manual efforts and rigid templates, leading to inefficiencies and a lack of scalability. We leveraged Transformer networks to enhance operational efficiency and data accuracy by developing a machine learning algorithm capable of understanding and structuring invoice data without predefined rules. Utilizing pre-trained Transformer networks, combined with techniques for both sequential and simultaneous field extraction, we aim to significantly reduce manual data entry and improve the reliability of database information. The project is set to explore the integration of cloud infrastructure for scalability, a Human in the Loop system for validation and quality assurance (QA).
Deliverables include a trained ML algorithm, comprehensive documentation, and integration strategies for real-world applications, aiming to set new standards for document processing automation.
]]>Hai-Ming Xu, Lingqiao Liu, Hao Chen, Ehsan Abbasnejad, Rafael Felix
As an effective way to alleviate the burden of data annotation, semi-supervised learning (SSL) provides an attractive solution due to its ability to leverage both labeled and unlabeled data to build a predictive model. While significant progress has been made recently, SSL algorithms are often evaluated and developed under the assumption that the network is randomly initialized. This is in sharp contrast to most vision recognition systems that are built from fine-tuning a pretrained network for better performance. While the marriage of SSL and a pretrained model seems to be straightforward, recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data. In this paper, we postulate the underlying reason is that the pretrained feature representation could bring a bias inherited from the source data, and the bias tends to be magnified through the self-training process in a typical SSL algorithm. To overcome this issue, we propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels and only allow the classifier to be trained from the labeled data. More specifically, we progressively adjust the feature extractor to ensure its induced feature distribution maintains a good class separability even under strong input perturbation. Through extensive experimental studies, we show that the proposed approach achieves superior performance over existing solutions.
Cite:
@inproceedings{xu2023progressive,
title={Progressive Feature Adjustment for Semi-supervised Learning from Pretrained Models},
author={Xu, Hai-Ming and Liu, Lingqiao and Chen, Hao and Abbasnejad, Ehsan and Felix, Rafael},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={3292--3302},
year={2023}
}
Arpit Garg, Cuong Nguyen, Rafael Felix, Thanh-Toan Do, Gustavo Carneiro
Noisy labels are unavoidable yet troublesome in the ecosystem of deep learning because models can easily overfit them. There are many types of label noise, such as symmetric, asymmetric and instance-dependent noise (IDN), with IDN being the only type that depends on image information. Such dependence on image information makes IDN a critical type of label noise to study, given that labelling mistakes are caused in large part by insufficient or ambiguous information about the visual classes present in images. Aiming to provide an effective technique to address IDN, we present a new graphical modelling approach called InstanceGM, that combines discriminative and generative models. The main contributions of InstanceGM are: i) the use of the continuous Bernoulli distribution to train the generative model, offering significant training advantages, and ii) the exploration of a state-of-the-art noisy-label discriminative classifier to generate clean labels from instance-dependent noisy-label samples. InstanceGM is competitive with current noisy-label learning approaches, particularly in instance-dependent noise benchmarks using synthetic and real-world datasets, where our method shows better accuracy than the competitors in most experiments.
Cite:
@inproceedings{garg2023instance,
title={Instance-dependent noisy label learning via graphical modelling},
author={Garg, Arpit and Nguyen, Cuong and Felix, Rafael and Do, Thanh-Toan and Carneiro, Gustavo},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={2288--2298},
year={2023}
}
In a challenging project, I successfully developed a real-time Point of Gaze (PoG) estimation model for analyzing ad viewership on social media. My leadership as a tech lead and active engagement as an individual contributor were pivotal in orchestrating a novel, multi-disciplinary approach encompassing data collection, validation, and model construction using RGB images in desktop settings. Key actions included managing large-scale data collection via Amazon Mechanical Turk, creating efficient data processing pipelines, and developing a robust CNN model enhanced with pre-trained architectures. We significantly mitigated over-fitting and optimized performance using custom loss functions, achieving an impressive error margin.
This endeavor not only achieved its direct objectives but also propelled the company into new markets such as the UK, Ireland, and Twitch streaming platforms, generating substantial contracts, with similar subsequent agreements. Our project delivered high customer satisfaction, especially in ROI, solidifying the company’s market standing and showcasing our team’s capabilities in meeting tight deadlines, strategic planning, and fostering collaboration under pressure. This project underlines the importance of clear communication, leadership, and adaptability in dynamic project environments, setting a benchmark for future initiatives.
]]>A Garg, C Nguyen, R Felix, TT Do, G Carneiro
Noisy-labels are challenging for deep learning due to the high capacity of the deep models that can overfit noisy-label training samples. Arguably the most realistic and coincidentally challenging type of label noise is the instance-dependent noise (IDN), where the labelling errors are caused by the ambivalent information present in the images. The most successful label noise learning techniques to address IDN problems usually contain a noisy-label sample selection stage to separate clean and noisy-label samples during training. Such sample selection depends on a criterion, such as loss or gradient, and on a curriculum to define the proportion of training samples to be classified as clean at each training epoch. Even though the estimated noise rate from the training set appears to be a natural signal to be used in the definition of this curriculum, previous approaches generally rely on arbitrary thresholds or pre-defined selection functions to the best of our knowledge. This paper addresses this research gap by proposing a new noisy-label learning graphical model that can easily accommodate state-of-the-art (SOTA) noisy-label learning methods and provide them with a reliable noise rate estimate to be used in a new sample selection curriculum. We show empirically that our model integrated with many SOTA methods can improve their results in many IDN benchmarks, including synthetic and real-world datasets.
Cite:
@article{garg2023noisy,
title={Noisy-label Learning with Sample Selection based on Noise Rate Estimate},
author={Garg, Arpit and Nguyen, Cuong and Felix, Rafael and Do, Thanh-Toan and Carneiro, Gustavo},
journal={arXiv preprint arXiv:2305.19486},
year={2023}
}
Rafael Felix; Boris Repasky; Samuel Hodge; Reza Zolfaghari; Ehsan Abbasnejad; Jamie Sherrah
While querying of structured geo-spatial data such as Google Maps has become commonplace, there remains a wealth of unstructured information in overhead imagery that is largely inaccessible to users. This information can be made accessible using machine learning for Visual Question Answering (VQA) about remote sensing imagery. We propose a novel method for Earth observation based on answering natural language questions about satellite images that uses cross-modal attention between image objects and text. The image is encoded with an object-centric feature space, with self-attention between objects, and the question is encoded with a language transformer network. The image and question representations are fed to a cross-modal transformer network that uses cross-attention between the image and text modalities to generate the answer. Our method is applied to the RSVQA remote sensing dataset and achieves a significant accuracy increase over the previous benchmark.
Cite:
@inproceedings{felix2021cross,
title={Cross-modal visual question answering for remote sensing data: The international conference on digital image computing: Techniques and applications (dicta 2021)},
author={Felix, Rafael and Repasky, Boris and Hodge, Samuel and Zolfaghari, Reza and Abbasnejad, Ehsan and Sherrah, Jamie},
booktitle={2021 Digital Image Computing: Techniques and Applications (DICTA)},
pages={1--9},
year={2021},
organization={IEEE}
}
While querying of structured geospatial data such as Google Maps has become commonplace, there remains a wealth of unstructured information in overhead imagery that is largely inaccessible to users. This information can be made accessible using machine learning for Visual Question Answering (VQA) about remote sensing imagery. We propose a novel method for Earth observation based on answering natural language questions about satellite images that uses cross-modal attention between image objects and text. The image is encoded with an object-centric feature space, with self-attention between objects, and the question is encoded with a language transformer network. The image and question representations are fed to a cross-modal transformer network that uses cross-attention between the image and text modalities to generate the answer. Our method is applied to the RSVQA remote sensing dataset and achieves a significant accuracy increase over the previous benchmark.
]]>Rafael Felix, Michele Sasdelli, Gustavo Carneiro, Ian Reid
Correspondent author: Rafael Felix – rafael dot felixalves at adelaide dot edu dot au
Generalised zero-shot learning (GZSL) is defined by a training process containing a set of visual samples from seen classes and a set of semantic samples from seen and unseen classes, while the testing process consists of the classification of visual samples from the seen and the unseen classes. Current approaches are based on inference processes that rely on the result of a single modality classifier (visual, semantic, or latent joint space) that balances the classification between the seen and unseen classes using gating mechanisms. There are a couple of problems with such approaches: 1) multi-modal classifiers are known to generally be more accurate than single modality classifiers, and 2) gating mechanisms rely on a complex one-class training of an external domain classifier that modulates the seen and unseen classifiers. In this paper, we mitigate these issues by proposing a novel GZSL method – augmentation network that tackles multi-modal and multi-domain inference for generalised zero-shot learning (AN-GZSL). The multi-modal inference combines visual and semantic classification and automatically balances the seen and unseen classification using temperature calibration, without requiring any gating mechanisms or external domain classifiers. Experiments show that our method produces the new state-of-the-art GZSL results for fine-grained benchmark data sets CUB and FLO and for the large-scale data set ImageNet. We also obtain competitive results for coarse-grained data sets SUN and AWA. We show an ablation study that justifies each stage of the proposed AN-GZSL.
Cite:
Rafael Felix, Ben Harwood,Michele Sasdelli, Gustavo Carneiro
Correspondent author: Rafael Felix – rafael dot felixalves at adelaide dot edu dot au
Generalised zero-shot learning (GZSL) methods aim to classify previously seen and unseen visual classes by leveraging the semantic information of those classes. In the context of GZSL, semantic information is non-visual data such as a text description of the seen and unseen classes. Previous GZSL methods have explored transformations between visual and semantic spaces, as well as the learning of a latent joint visual and semantic space. In these methods, even though learning has explored a combination of spaces (i.e., visual, semantic or joint latent space), inference tended to focus on using just one of the spaces. By hypothesising that inference must explore all three spaces, we propose a new GZSL method based on a multi-modal classification over visual, semantic and joint latent spaces. Another issue affecting current GZSL methods is the intrinsic bias toward the classification of seen classes – a problem that is usually mitigated by a domain classifier which modulates seen and unseen classification. Our proposed approach replaces the modulated classification by a computationally simpler multi-domain classification based on averaging the multi-modal calibrated classifiers from the seen and unseen domains. Experiments on GZSL benchmarks show that our proposed GZSL approach achieves competitive results compared with the state-of-the-art.
Cite:
In generalized zero shot learning (GZSL), the set of classes are split into seen and unseen classes, where training relies on the semantic features of the seen and unseen classes and the visual representations of only the seen classes, while testing uses the visual representations of the seen and unseen classes. Current methods address GZSL by learning a transformation from the visual to the semantic space, exploring the assumption that the distribution of classes in the semantic and visual spaces is relatively similar. Such methods tend to transform unseen testing visual representations into one of the seen classes’ semantic features instead of the semantic features of the correct unseen class, resulting in low accuracy GZSL classification. Recently, generative adversarial networks (GAN) have been explored to synthesize visual representations of the unseen classes from their semantic features - the synthesized representations of the seen and unseen classes are then used to train the GZSL classifier. This approach has been shown to boost GZSL classification accuracy, but there is one important missing constraint: there is no guarantee that synthetic visual representations can generate back their semantic feature in a multi-modal cycle-consistent manner. This missing constraint can result in synthetic visual representations that do not represent well their semantic features, which means that the use of this constraint can improve GAN-based approaches. In this project, we propose the use of such constraint based on a new regularization for the GAN training that forces the generated visual features to reconstruct their original semantic features.
]]>