Large and Small Magellanic Cloud clean samples for Gaia DR3

In the field of view of the Magellanic Clouds (MCs) astronomers observe both stars that belong to these dwarf galaxies and stars that correspond to the Milky Way (MW) halo. In order to perform any analysis on the Large and Small Magellanic Cloud (LMC and SMC, respectively) we need to get rid to all this foreground contamination of MW stars. Hence the need of obtaining LMC and SMC clean samples. We focus our efforts on the latest Gaia Data Release (Gaia DR3). The classification probability of each object is available in electronic form at the CDS: LMC, SMC.

Figure 1. Artistic impression of an observer close to the Sun, which is embedded in the MW disk (in light blue) and halo (light purple). When looking at the MCs (red stars), the observer also detects foreground MW stars (yellow stars).

First, one may think on distinguishing MCs and MW through their distances. However, due to the large uncertainties in the parallax-based distances of Gaia at the MCs (Lindegren et al. 2021), it is not possible. For this purpose a classification method based on a Neural Network (NN) is developed in Jiménez-Arranz+23a and Jiménez-Arranz+23b to get LMC and SMC clean samples, respectively.

The input of the NN is Gaia's astrometry (with their uncertainties) and photometry, corresponding to 11 input neurons. The NN has a single output which gives for each object the probability 𝑃 of being a MCs star (or, conversely, the probability of not being a MW star). The object is very likely to belong to the MCs (MW) if the 𝑃 value is close to 1 (0). See top panel of Fig. 1.

We must establish a probability threshold 𝑃𝑐𝑢𝑡 in order to acquire a binary classification using the probabilities that the classifier generated for each star. The star is thought to belong to the MCs if 𝑃 > 𝑃𝑐𝑢𝑡 and the MW if 𝑃 < 𝑃𝑐𝑢𝑡.

Figure 2. Representation of the NN output neuron. A star is very likely to belong to the MCs (MW) if the 𝑃 value is close to 1 (0). Middle panel: low probability threshold that prioritizes completeness on the MCs sample. Bottom panel: high probability threshold that prioritizes purity on the MCs sample.

Fixing a low probability threshold allows us to ensure that no MCs objects are missed, but at the cost of having more "mistaken" MW stars in the MCs-classified sample. We prioritize completeness (middle panel of Fig. 2) over purity.

Conversely, by setting a high probability threshold, we can reduce contamination in the resultant MCs-classified sample, but at the cost of omitting some MCs stars and producing a less complete sample. We prioritize purity (bottom panel of Fig. 2) over completeness.

A choice about the purity-completeness trade-off will determine the characteristics of the final sample and may, therefore, have an impact on the results. Thus, we defined two different samples:

Additionally, and because MW stars exponentially rise at fainter magnitudes whereas MCs stars rapidly decrease beyond 𝐺 ≃ 19.5, we introduced a third case after carefully studying the results for the optimal sample:

Finally, for each of the four samples we consider two datasets. First, the full sample where we assume that all the stars have no line-of-sight velocity information. Second, a sub-sample of the first one where we only keep stars with Gaia DR3 line-of-sight velocities. We refer to these sub-samples as the corresponding 𝑉𝑙𝑜𝑠 sub-samples. The number of stars per dataset is in the second and third column of Table 1, respectively, together with the mean astrometric information.

Table 1. Comparison of the LMC and SMC samples number of sources and mean astrometry between the proper motion selection (sample used as reference, it was introduced in Gaia Collaboration, X. Luri+21) and the NNs samples. Parallax is in mas and proper motions in mas yr −1 .

The sky density distributions for the classified LMC/MW (SMC/MW) members in our different samples are shown in Figure 3 left (right). In the left column of each panel, we show the LMC/SMC selection in each of the samples, while in the right column, we show the sources classified as MW. Each row corresponds to one selection strategy: proper motion selection (first row) followed by the three NNs based ones (complete, optimal and truncated-optimal, respectively).










Figure 3. Sky density distribution in equatorial coordinates of both the MCs (left columns) and MW (right columns) sample obtained from the different classifiers. On the left panel we show the LMC/MW classifier, whereas the right panel shows the SMC/MW classifier. First row: proper motion selection classification. Second row: Complete NN classification. Third row: Optimal NN classification. Fourth row: Truncated-optimal NN classification.

To validate the results of our selection criteria we compare them with external independent classifications. To do so, we cross-matched our base sample with three external samples:

Table 2 compares the outcomes of our four classification criteria as they were applied to the stars in the three validation samples. The results using the LMC (SMC) Cepheids, RR-Lyrae, and SH validation samples reveal that the completeness of the resulting MCs classifications is excellent, typically exceeding 95% (85%). The truncated-optimal sample is the exception, where the cut in faint stars reduces the RR-Lyrae’s completeness.

Table 2. Matches of the classified LMC (top) and SMC (bottom) members in our four considered samples against the validation samples. The total number of stars, which is listed beneath the sample name, is used to determine percentages.

On the other hand, the relative contamination by MW stars in the MCs samples is more difficult to assess. We have to rely on the SH distance-based classification as an external comparison, with the caveat that this classification contains its own classification errors. These results point out to a possible contamination by MW stars in our samples around some tens of percentage.

However, we can do an additional check using the line-of-sight velocities in Gaia DR3, which are available only for a (small) subset of the total sample. These line-of-sight velocities are not used by any of our classification criteria and have different mean values for the MW and LMC/SMC (therefore providing an independent check).

In Fig. 4, we plot the histograms of line-of-sight velocities separately for stars classified as MW and LMC/SMC, and it is clear from these that the contamination of the MCs sample is reduced, likely to be significantly below the levels suggested above. We estimate the MW contamination to be around 5% (10%) if we take into account the LMC (SMC) NN complete sample and roughly separate the MW stars with a cut at 𝑉𝑙𝑜𝑠 < 125 (75) km s⁻¹. Also, this check is not entirely representative since only stars at the bright end of the sample (𝐺 ≲ 16) are included in the subset of Gaia DR3 stars having observed line-of-sight velocities.

Figure 4. Line-of-sight velocity distribution for the LMC/MW (two first columns) and SMC/MW (last two columns) classifier. We show the stars classified as LMC or SMC (top) and MW (bottom). We show two 𝑉𝑙𝑜𝑠 sub-samples: NN complete (first and third columns) and NN optimal (second and forth columns) samples.

As a final test, we made a query to the Gaia archive in a nearby region with homogeneous sky density. This way we can make an estimation of the MW stars expected in a regions similar to that covered by our Gaia base sample. From this new query, we obtained 4 240 771 (932 332) stars, so we would expect a similar number of MW stars in the region we selected around the LMC (SMC).

Given that the LMC (SMC) Gaia base sample contains 18M (4M) objects and the number of objects classified as LMC (SMC) is around 6 − 12 (1-2) million (see Table 1), the number of stars classified as MW is around 12 − 6 (3-2) million; therefore, we can conclude that our NN LMC/SMC samples prioritize purity over completeness since there are too many stars classified as MW (an excess of 2 to 8 million for the LMC, and of 1 to 2 million for the SMC).

Finally, since Gaia also provides photometric information, we can convert the density plots of Fig. 3 into colorful images of the MCs. We can also compare them with their respective base samples (where the MW contaminants are present). To highlight the performance of the classifier, notice, for example, how two globular clusters Tuc 47 and NGC362 are successfully removed from the SMC clean samples.

Figure 5. The Large (top) and Small (bottom) Magellanic Cloud as viewed by European Space Agency (ESA)’s Gaia satellite using astrometric and photometric information from the mission’s Data Release 3 (DR3). Each column compares different samples. First column: Base sample (with MW contamination). Second column: Complete NN classification. Third column: Optimal NN classification. Fourth column: Truncated-optimal NN classification.  Scale is maintained in the image.

Again, we would like to emphasize that the classification probability of each object is open-source and available in electronic form at the CDS: LMC, SMC. If you require assistance with implementing the catalog, feel free to reach out to me without hesitation.