During scATAC-seq data preprocessing, does it make sense to filter data matrix, so it contains only most variable peaks (in the same way how we do it for scRNA-seq), before any further dimensionality reduction or clustering analysis?
To better define cell types, it makes sense.
That depends on the type of analysis you're referring to. PCA, for example, will always focus on the most variable regions. I haven't looked at scATAC-seq data myself but given that it's basically binary, I'm not sure how well the typical variance measures even hold up.
can I do sparse PCA on scATAC-seq matrix and see, what peaks correspond to, let's say the first component? And choose those as the most informative (variable)?
Well, I'm not sure how "peaks" would be defined in scATAC-seq as there's a maximum of 2 reads per open region per cell. Maybe you want to collapse the information from multiple cells at the same region? What exactly is the question you're trying to address?
Usually, dimensionality reduction is done on top variable features (usually top 500). So you can take top variable peaks and build a PCA and see how the tSNE clusters looks like. If you want to overcome the sparsity of data, you could use KNN approach to merge data from n-similar cell. Before doing that I would check tSNE on top 500 variable peaks.
I did not know that the data is binary, so this paper seems to have a nice method to process the data.
Thanks! But how do you find most variable peaks, if the data is binary?
Sorry I am not aware that it's binary. I updated my answer and moved it to comment as it doesn't qualify as an answer anymore
Asked on BioC in the first place and then cross-posted here as suggested there.
Login before adding your answer.